2. Lexical Analysis

2.1. Groups

Lexical analysis is (at least in part) the process of converting a body of text into tokens. It is also the process of identifying the class of each token. The Syntax library refers to these classes as groups.

Each syntax module may define its own groups. The Ruby module, for instance, defines 18 different groups:

  1. normal: whitespace and the like. Basically, any text not grouped in any of the other groups.
  2. comment: the delimiters and contents of a comment
  3. keyword: any recognized keyword of the Ruby language
  4. method: the name of a method when it is being declared
  5. class: the name of a class when it is being declared
  6. module: the name of a module when it is being declared
  7. punct: any punctuation character
  8. symbol: a Ruby symbol token
  9. string: the contents (but not delimiters) of a string
  10. char: a character literal (?g)
  11. ident: an identifier, not otherwise recognized as a keyword
  12. constant: a constant (beginning with an uppercase letter)
  13. regex: the contents (but not delimiters) of a regular expression
  14. number: a numeric literal
  15. attribute: an instance variable
  16. global: a global variable
  17. expr: a nested (interpolated) expression within a string or regex
  18. escape: an escape squence within a string or regex

The only group common to all modules is normal. (When converting text to HTML, the name of the class used in a span will be the name of the corresponding group—this makes it straightforward to determine what CSS classes need to be defined.)

2.2. Instructions

In addition to groups, each token has an associated instruction. For most tokens, this instruction is the symbol :none, meaning “do nothing special”. However, there are two other instructions defined by the framework:

  • :region_open: begin a “region”. This region is a sequence of tokens that are all nested inside the group of the current token. This is useful for strings and regular expressions, which may contain other kinds of tokens (like expr and escape, in Ruby’s case).
  • :region_close: close the current region.

The HTML convertors uses these instructions to know whether to emit just an opened span tag, or a closed one, or whether to emit both. Other convertors may use these instructions in similar ways.

2.3. Analyzing

Lexical analysis is performed by obtaining a tokenizer of the appropriate class and calling tokenize on it, passing the text to be tokenized. Each token is yielded to the associated block as it is discovered.

Tokenizing a Ruby script [ruby]
1
2
3
4
5
6
7
8
require 'syntax'

tokenizer = Syntax.load "ruby"
tokenizer.tokenize( File.read( "program.rb" ) ) do |token|
  puts token
  puts "  group: #{token.group}"
  puts "  instruction: #{token.instruction}"
end

If you need finer control over the process, you can use the lower-level API:

Tokenizing a Ruby script via step [ruby]
1
2
3
4
5
6
7
8
9
10
11
12
13
require 'syntax'

tokenizer = Syntax.load "ruby"
tokenizer.start( File.read( "program.rb" ) ) do |token|
  puts token
  puts "  group: #{token.group}"
  puts "  instruction: #{token.instruction}"
end

tokenizer.step
tokenizer.step
...
tokenizer.finish

In this case, each time #step is invoked, it results in tokens being consumed and yielded to the block. However, a single step may result in multiple tokens being detected and yielded—there is no way to guarantee a single token at a time, unless the corresponding syntax module was written to work that way. For efficiency, the existing modules will yield multiple tokens when processing (for instance) strings, regular expressions, and heredocs.