Saturday, August 27, 2016

A Pleasurable Journey into Text Translation using ANTLR4

For one of my clients, I needed to spec out a REST service API for developers. Although they had standardized on the API Blueprint product, I feel compelled to look for alternatives for specking out future REST APIs.  The input needed for API Blueprint is just too verbose and requires far too much time. If I expected to spend less time specking out REST APIs, this might not be an issue. Or at least not as much of an issue. I also checked out Swagger and RAML and found the same verbosity issue; I just don't have time for that on an ongoing basis.  There's a good review of these products that I found particularly useful here.

My thoughts turned to a syntax for specking REST APIs that would be far more streamlined and concise. Within an hour, I had a draft of a syntax (example below) that would be far less verbose and that I could increase my productivity significantly per API.  The problem would be writing something that could interpret specifications in this format and could then generate the verbose XML , Markdown, or YAML syntax for one of the other API designer products mentioned above. It turns out that there are products that specialize in text translation.  ANTLR is the most popular of these products right now.

The way ANTLR works is that you specify a grammar that describes the text you want to translate. ANTLR uses that grammar to generate Java code that can take text in that format and interpret it for you in the terms that you specified in the grammar. For instance, I define a REST Resource block in my grammar and told it about the syntax for different operations and the json arguments they accept as input or emit as output. ANTLR generated code will analyze the specifications I write and convert that to a Java object format that I can more easily read, interpret through code, and generate useful translated output with the help of a templating technology, such as Freemarker. An example grammar for the Java language as an example can be found here.

It turns out that specifying a grammar wasn't as easy as I thought it would be going in. What, in my mind, is a simple context is actually quite complex when you break it down into constructs that a product like ANTLR can understand. Essentially, you need to specify all whitespace (characters to ignore) and comments if the syntax is to support them. All special keywords and rules that govern when those keywords are expected to be used also need to be specified.  At this point, I should have just backed down from this idea and suffered through one of the more verbose solutions. However, by this point, I'm far too interested in how structured text gets specified and what's possible by interpreting it through code to stop.

I'm part way through this project and will open source it once complete.  For those taking similar text translating journeys with ANTLR, I have ferreted out some techniques that helped me immensely.

Write and test the Lexer portion of the grammar first.

ANTLR breaks up grammars into two pieces: a "Lexer" and a "Parser". A "lexer"  understands what characters and keywords are important for what you're doing and skipping any unneeded whitespace. It also formats those characters/keywords internally as "Tokens" so that it can be used for more sophisticated translation later on. A "Parser" applies rules to important characters and keywords to interpret context. For example, a REST resource definition doesn't make sense in the data type structure section of my proposed REST API specification syntax.

As the parser uses lexer output; it's important to make sure the lexer portion of your grammar tests out first. Any testing of the parser at this point is premature. Assertions in your lexer test should be:
  • Make sure all characters and keywords are recognized.
  • Make sure that the lexer identifies characters and keywords correctly. For instance, I had a bug early on where the keyword 'Resource' was recognized as a string literal. In my syntax, 'Resource' has a special context and meaning.

You can test the lexer generated from your grammer by iterating through the Tokens generated. Any unrecognized tokens shouild cause a test failure. If the lexer doesn't recognize your special characters and keywords (e.g. doesn't identify the correct number of keyword 'Resource' from your test sample), then it should also cause a test failure. 

Write Parser rules iteratively from general rules to more specific rules.

Parser rules apply context to the tokens identified by the Lexer. I found it much easier to start with very general parser rules and get those working. For example, my syntax has two main sections: a bounded context section that describes resources and operations and a Types section that describes all data types used by the API. My first iteration of the parser rules just identified the two sections.  That isn't enough to do what I need, but I didn't leave it there. Over time, I specified the portions of both sections and progressively describe them in more detail.

In other words, parser rules describe a section of your input text. The first test for parser rules can be simple; just test the start line/column position and end line/column position for each parser rule. If those are correct, then you can describe more specific rules that carves up the larger sections in the first iteration. Each parser rule you write has a value object specifically generated for it. That value object has the starting and ending token for the section it covers (you can get the starting and ending positions from those tokens).

There are a few points that aren't obvious about the ANTLR product to remember.
  • Lexer rules have UPPER_CASE names. Parser rules are lower case
  • At least one parser rule should apply to the entire document (minus skipped whitespace).
I'll post additional reports and publish the resulting work via Github when complete.  I'm still midway through this effort.

An Example REST API Specification Syntax

#student.spec - Student REST API
Bounded Context: Student Information {
Resource: Student // Everything about current and past students
Operation: /student - POST, json, student //Creates a student
httpStatus: 201,400
return: json, studentId
Operation: /student/{studentId} - PATCH, json, student //Update student (only those attributes provided)
httpStatus: 200,400,404,405
Operation: /student/{studentId} - DELETE //Delete student
httpStatus: 200,400,404,405
Operation: /student - GET //Finds a list of students by status
status - string[] // Status values to search by
httpStatus: 200,400
return: json, student
Operation: /student/{studentId} - GET //Finds a student by their id
httpStatus: 200,400,404
return: json, student
Types: {
student {
studentId - required, string // student Identifer that uniquely identifes a student
firstName - required, string $$Bill
middleName - string
lastName - required, string $$Williamson
title - enum{Mr, Ms, Mrs}
birthDate - required, date
primaryAddress - required, address
schoolAddress - address
primaryPhone - required, phone
cellPhone - phone
status - enum{Applied, Accepted, Active, NonActive}
address {
streetAddress1 - string $$123 Testing Lane
streetAddress2 - string
City - string $$Somewhere
StateCode - string(2) $$IL
zipCode - int(5)
zipCodeExt - int(4)
phone {
countryCode - int(2)
areaCode - int
prefix - int
line - int
extension - int
classSection {
title - string
discipline - string
courseNbr - int
building - string
room - string
time - string