Previous (1. Introduction) | Up | Next (3. Syntax Highlighting)

2. Lexical Analysis

2.1. Groups

Lexical analysis is (at least in part) the process of converting a body of text into tokens. It is also the process of identifying the class of each token. The Syntax library refers to these classes as groups.

Each syntax module may define its own groups. The Ruby module, for instance, defines 18 different groups:

normal: whitespace and the like. Basically, any text not grouped in any of the other groups.
comment: the delimiters and contents of a comment
keyword: any recognized keyword of the Ruby language
method: the name of a method when it is being declared
class: the name of a class when it is being declared
module: the name of a module when it is being declared
punct: any punctuation character
symbol: a Ruby symbol token
string: the contents (but not delimiters) of a string
char: a character literal (?g)
ident: an identifier, not otherwise recognized as a keyword
constant: a constant (beginning with an uppercase letter)
regex: the contents (but not delimiters) of a regular expression
number: a numeric literal
attribute: an instance variable
global: a global variable
expr: a nested (interpolated) expression within a string or regex
escape: an escape squence within a string or regex

The only group common to all modules is normal. (When converting text to HTML, the name of the class used in a span will be the name of the corresponding group—this makes it straightforward to determine what CSS classes need to be defined.)

2.2. Instructions

In addition to groups, each token has an associated instruction. For most tokens, this instruction is the symbol :none, meaning “do nothing special”. However, there are two other instructions defined by the framework:

:region_open: begin a “region”. This region is a sequence of tokens that are all nested inside the group of the current token. This is useful for strings and regular expressions, which may contain other kinds of tokens (like expr and escape, in Ruby’s case).
:region_close: close the current region.

The HTML convertors uses these instructions to know whether to emit just an opened span tag, or a closed one, or whether to emit both. Other convertors may use these instructions in similar ways.

2.3. Analyzing

Lexical analysis is performed by obtaining a tokenizer of the appropriate class and calling tokenize on it, passing the text to be tokenized. Each token is yielded to the associated block as it is discovered.

Tokenizing a Ruby script [ruby]

1
2
3
4
5
6
7
8

require 'syntax'

tokenizer = Syntax.load "ruby"
tokenizer.tokenize( File.read( "program.rb" ) ) do |token|
  puts token
  puts "  group: #{token.group}"
  puts "  instruction: #{token.instruction}"
end

If you need finer control over the process, you can use the lower-level API:

Tokenizing a Ruby script via step [ruby]

1
2
3
4
5
6
7
8
9
10
11
12
13

require 'syntax'

tokenizer = Syntax.load "ruby"
tokenizer.start( File.read( "program.rb" ) ) do |token|
  puts token
  puts "  group: #{token.group}"
  puts "  instruction: #{token.instruction}"
end

tokenizer.step
tokenizer.step
...
tokenizer.finish

In this case, each time #step is invoked, it results in tokens being consumed and yielded to the block. However, a single step may result in multiple tokens being detected and yielded—there is no way to guarantee a single token at a time, unless the corresponding syntax module was written to work that way. For efficiency, the existing modules will yield multiple tokens when processing (for instance) strings, regular expressions, and heredocs.

Previous (1. Introduction) | Up | Next (3. Syntax Highlighting)

Syntax Manual

Chapters

Other Documentation

Tutorials

2. Lexical Analysis

2.1. Groups

2.2. Instructions

2.3. Analyzing