Implementing a scripting language with Antlr (Part 1: Lexer)

Recently I implemented a small scripting language with Antlr and I thought it would make for a good mid-level example. Antlr is a grammar generator like JavaCC or the venerable lex/yacc tools. Seems like most examples of Antlr you can find out on the web are either trivial (the canonical calculator example) or very complicated (the Java grammar).

The language I’m trying to implement lets you define a set of named scripting blocks. Each scripting block is composed of commands. Each command consists of a command name followed by a set of properties. Properties can be either keyed (key=value) or unkeyed (value).

A BNF for this simple language is:

script : block* 
  block  : STRING LBRACE command* RBRACE 
  command : STRING property* SEMI
  property : (STRING EQUALS)?  STRING

In this BNF, lower-case words represent non-terminal rules and upper-case words represent literal terminal tokens. * is 0..n and ? is 0..1.

An example script might look something like this:

# What to do in the morning
  morning {
      wake_up;
      eat bacon;
      brush_teeth;
      wash hair=true minutes=10;
  } 

  # What to do in the evening
  evening { 
      read book="How To Be a Ninja";
      brush_teeth;
      sleep hours=8;
  }

When you work with Antlr (or pretty much any grammar tool set), you will need to define at least a lexer (aka tokenizer) and a parser (usually defined as a grammar). The lexer takes a stream of characters and emits a stream of tokens. The parser takes a stream of tokens and (usually) emits an Abstract Syntax Tree (AST) which represents the structure of the input stream as a tree.

To visualize how a grammar produces a tree, imagine each concrete token match as a leaf and each non-terminal rule that is activated as an intermediate node. The start rule (script in our example) is the root. In reality, the AST story is not quite so straight-forward, but conceptually, this works.

So, let’s first define the lexer using Antlr. We mentioned and will need to define the following tokens in our grammar: STRING , LBRACE, RBRACE, SEMI, EQUALS. So, let’s start with the punctuation and leave STRING out for the moment:

class ScriptLexer extends Lexer;

	options { 
		k=1; 	// Only 1 lookahead character required
	}
	
	// Punctuation     
	LBRACE : '{';
	RBRACE : '}';
	EQUALS : '=';
	SEMI : ';';

We start the Lexer with a class declaration, then an options block, then the punctuation definitions (officially known as “lexer rules”). The k=1 option defines how many tokens ahead the Lexer should look. In this case, all of the token definitions are unambiguous from the first character, so k=1 is sufficient. Lexer rules must start with a capital letter and by convention they are usually all-caps.

Here we’ve defined tokens as single characters but the token definitions are actually regular expressions which makes things more interesting. So, let’s define the token for STRING values now. For now, let’s do something simple with a limited set of characters:

STRING : ('a'..'z'|'A'..'Z'|'0'..'9'|'_')+

Here we define some character ranges (‘a’..’z’) combine them with an or (

) and state them want one or more of them in a row (+).

One other thing that we’re missing is a rule that tells the lexer to ignore whitespace. Otherwise, the lexer will fail on all those spaces in between words in our script. This is a pretty common thing in grammars as one of the benefits of using a parser-generator (over writing one yourself) is that the generator can take care of the nasty stuff. All we have to do is declare a rule like this:

WS     :
	    (' ' 
	    | '\\t' 
	    | '\\r' '\\n' { newline(); } 
	    | '\\n'      { newline(); }
	    ) 
	    { $setType(Token.SKIP); } ;

In this rule you see a couple of times where we drop out of the grammar and into code blocks. The calls to newline() increment the line counter in the Lexer class so that error messages have accurate line/column information. The $setType command is a special instruction to ignore these tokens and not put them in the token stream the Parser will work on. That means that a string of characters like:

“a { \n b; } \n” is turned into a stream of tokens like: STRING LBRACE STRING SEMI RBRACE and the whitespace is matched but skipped. Many grammars have a whitespace rule that looks very similar to this.

What about comments? We’d like # to serve as a single-line comment character indicating that the rest of the line should be skipped. This sounds similar to what we did with whitespace and it is:

LINE_COMMENT : 
            '#' 
            ( ~('\\n'|'\\r') )* 
            ( '\\n'|'\\r'('\\n')? )? 
	    { $setType(Token.SKIP); newline(); } ;

The LINE_COMMENT rule first matches a # symbol, then a series of any character other than an end of line, then an end of line. Once we match this line comment token, we skip it and mark a new line (since we consumed a new line). One tricky detail here (that tripped me up) is what happens here if you have a comment on the last line of a file. In this case, there will be no end of line, but we still want to match the line comment. This is the reason that the end of line group is marked with a ( )?, which indicates an optional group. In that case, our newline() call is incorrect, but it won’t matter as we are at EOF.

At this point, we can tokenize almost all of our original script example. Can you spot what I left out? I ignored the whole issue of accepting quoted strings as part of the STRING token. Let’s tackle that now. We’ll need to add an additional kind of string that starts and ends with a quote and allows any other character within the quotes. We will redefine the STRING rule as follows:

STRING : ('a'..'z'|'A'..'Z'|'0'..'9'|'_')+ 
                     | ('"' (~'"')* '"');

That concludes the complete lexer for our simple scripting language. So, how do we actually use Antlr to generate the grammar? Well, we need to add the antlr.jar to our classpath, then run:

java antlr.Tool lexer.g

This will produce two files: ScriptLexer.java (the nasty generated tokenizer code) and ScriptLexerTokenTypes.java (just some token constants). Then we can write some sample code that examines the token string emitted from the Lexer for our test script.

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.Reader;

import antlr.Token;

public class TestLexer {

    public static void main(String[] args) throws Exception {
        System.out.println("Parsing: " + args[0]);
        
        Reader reader = new BufferedReader(new FileReader(args[0]));
        ScriptLexer lexer = new ScriptLexer(reader);

        Token token = lexer.nextToken();
        while(token.getType() != ScriptLexerTokenTypes.EOF) {
            System.out.println("\t" + getTokenType(token.getType()) + "\t\t" + token.getText());
            token = lexer.nextToken();
        }
    }
    
    private static String getTokenType(int tokenType) {
        switch(tokenType) {
        case ScriptLexerTokenTypes.STRING: return "STRING";
        case ScriptLexerTokenTypes.LBRACE: return "LBRACE";
        case ScriptLexerTokenTypes.RBRACE: return "RBRACE";
        case ScriptLexerTokenTypes.EQUALS: return "EQUALS";
        case ScriptLexerTokenTypes.SEMI: return "SEMI";
        default: return "OTHER";
        }
    }

}

When we run this on our test script, you’ll see:

Parsing: src/test.script
	STRING		morning
	LBRACE		{
	STRING		wake_up
	SEMI		;
	STRING		eat
	STRING		bacon
	SEMI		;
	STRING		brush_teeth
	SEMI		;
	STRING		wash
	STRING		hair
	EQUALS		=
	STRING		true
	STRING		minutes
	EQUALS		=
	STRING		10
	SEMI		;
	RBRACE		}
	STRING		evening
	LBRACE		{
	STRING		read
	STRING		book
	EQUALS		=
	STRING		"How To Be a Ninja"
	SEMI		;
	STRING		brush_teeth
	SEMI		;
	STRING		sleep
	STRING		hours
	EQUALS		=
	STRING		8
	SEMI		;
	RBRACE		}

You may think we’re done at this point, but we’re only half-way there. Right now we can tokenize a stream of characters, but we are doing nothing to enforce the actual grammar. The lexer will happily tokenize a stream like this too: “;;;;blah}{=” and emit a stream of tokens. The next step is to define the set of tokens that constitute a valid script and that’s done through the parser.

At this point, we’ll take a break. It took me a long time to explain it, but we wrote a really small and (mostly) simple set of rules to cover all the tokens we needed. In Part 2, we’ll actually define our grammar and create our parser.

Here are the files discussed in Part 1 if you’re interested:

Links to the rest of the series:

Pure Danger Tech

Implementing a scripting language with Antlr (Part 1: Lexer)