4. Extending Syntax

4.1. Introduction

Because of Syntax’s modular design, it is pretty straightforward to create your own syntax modules. The hardest part is doing the actual tokenizing of your chosen syntax.

You can use the existing syntax modules to guide your own implementation if you wish, but note that each module will have to solve unique problems, because of the uniqueness of each different syntax.

4.2. Interface

Your new syntax implementation should extend Syntax::Tokenizer—this sets up a rich domain-specific language for scanning and tokenizing.

Then, all you need to implement is the #step method, which should take no parameters. Each invocation of #step should extract at least one token, but may extract as many as you need it to. (Fewer is generally better, though.)

Additionally, you may also implement #setup, to perform any initialization that should occur when tokenizing begins. Similarly, #teardown may be implemented to do any cleanup that is needed.

4.3. Scanning API

Within a tokenizer, you have access to a rich set of methods for scanning the text. These methods correspond to the methods of the StringScanner class (i.e., scan, scan_until, bol?, etc.).

Additionally, subgroups of recent regexps (used in scan, etc.) can be obtained via subgroup, which takes as a parameter the group you want to query.

Tokenizing proceeds as follows:

  1. Identify a token (using #peek, #scan, etc.).
  2. Start a new token group (using #start_group, passing the symbol for the group and optionally any text you want to seed the group with).
  3. Append text to the current group either with additional calls to #start_group using the same group, or with #append (which just takes the text to append to the current group)

Instead of #start_group, you can also use #start_region, which begins a new region for the given group, and #end_region, which closes the region.

Here is an example of a very, very simple tokenizer, that simple extracts words and numbers from the text:

Simple tokenizer [ruby]
1
2
3
4
5
6
7
8
9
10
11
12
13
require 'syntax'

class SimpleTokenizer < Syntax::Tokenizer
  def step
    if digits = scan(/\d+/)
      start_group :digits, digits
    elsif words = scan(/\w+/)
      start_group :words, words
    else
      start_group :normal, scan(/./)
    end
  end
end

4.4. Registering Your New Syntax

Once you’ve written your new syntax module, you need to register it with the Syntax library so that it can be found and used by the framework. To do this, just add it to the Syntax::SYNTAX hash:

Registering a new syntax [ruby]
1
2
3
4
5
6
7
require 'syntax'

class SimpleTokenizer < Syntax::Tokenizer
  ...
end

Syntax::SYNTAX['simple'] = SimpleTokenizer

That’s it! Once you’ve done that, you can now use your syntax just by requiring the file that defines it, and then using the standard Syntax framework methods:

Using your new syntax [ruby]
1
2
3
4
5
require 'simple-tokenizer'
require 'syntax/convertor/html'

convertor = Syntax::Convertors::HTML.for_syntax "simple"
puts convertor.convert( "hello 15 worlds!" )