Aaron Patterson recently posted this article on
Parsing JSON the hard way using
Racc. I think he did an excellent job explaining Racc. However, it was painful. Racc is an LALR(1) parser-generator similar to Yacc. While Yacc was fun to learn in school, it's also painful to use. LALR is hard to debug ("shift-reduce conflict" - !@#$% - if you know what I mean), and the parser generation steps are tedious to set up and maintain. There is a better way...
Enter
Parsing Expression Grammars. I learned about PEGs from
Treetop. Think of PEGs as greedy regular expressions with recursion. Being greedy, they don't try every possible pattern-match. This turns out to be a good thing; it makes them much easier to understand and debug. Adding recursive rule-matching to regular expressions makes PEGs approximately as expressive as LALR(1) parsers (though not identical). Treetop got me excited to work with parsers again, but sadly it still had some of the awkwardness of Yacc-style parser-generators. You have to build "treetop" files into ruby files before you can use them. There is yet an even better way...
Enter
Babel Bridge (
github). I wanted a PEG based parser-generator that allowed me to use 100% ruby with no extra build steps, so I built Babel Bridge. I've been evolving BB for over 2 years now. It is a powerful tool for building easy-to-understand parsers.
Here is how to write a JSON parser using the Babel Bridge gem. If you want to skip to the punch-line, here is the
JSON parser source code.
Writing a JSON parser with Babel Bridge
First, install the babel_bridge gem:
gem install babel_bridge
With Babel Bridge you can create many different parsers and have many instances of each in one runtime. To define a new parser, create a class and inherit from BabelBridge::Parser.
require "babel_bridge"
class JsonParser < BabelBridge::Parser
# define the parser here
end
The main component for defining your parser is the "rule". A rule is a named pattern for matching strings of characters. The first parameter of the "rule" method is the name of the rule. The remaining parameters specify one or more patterns to match. For more detailed info, please see the
readme.
There are five literal values that are allowed with JSON: string, number, true, false and null. Let's start by making rules for each of them. (Credit: I used Aaron Patterson's regular expressions he derived from
json.org.)
require "babel_bridge"
class JsonParser < BabelBridge::Parser
rule :value, any(:number, :string, :true, :false, :null)
rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
rule :true, "true"
rule :false, "false"
rule :null, "null"
end
BabelBridge::Shell.new(JsonParser.new).start
In addition to the five rules for the literals, I added two more lines of code. First, I added a general rule called :value. It matches any of the five literals. Babel-Bridge starts matching using the first rule you declare. That "root rule" must include everything you want to match. JsonParser will now match any of the JSON literals. Second, I added a "shell". It is a Babel-Bridge tool to help make parser-testing easier. Try running the above code and typing in some JSON literals at the prompts.
> "string literal"
Parse tree:
ValueNode1 > StringNode1 > "\"string literal\""
> 123
Parse tree:
ValueNode1 > NumberNode1 > "123"
> null
Parse tree:
ValueNode1 > NullNode1 > "null"
We are halfway done. There are two more important patterns for JSON: objects (hashes) and arrays. Arrays consist of 0 or more :values, delimited by "," and wrapped in "[]". Objects consist of 0 or more :pairs, again delimited by "," and wrapped in "{}". Pairs consist of a :string, then a ":", then a :value.
require "babel_bridge"
class JsonParser < BabelBridge::Parser
rule :value, any(
:object, :array, :number,
:string, :true, :false, :null
)
rule :array, '[', many?(:value, ','), ']'
rule :object, '{', many?(:pair, ','), '}'
rule :pair, :string, ':', :value
rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
rule :true, "true"
rule :false, "false"
rule :null, "null"
end
BabelBridge::Shell.new(JsonParser.new).start
You can try out the new rules:
> ["my string",123,true]
Parse tree:
ValueNode1 > ArrayNode1
"["
ValueNode1 > StringNode1 > "\"my string\""
ValueNode1 > NumberNode1 > "123"
ValueNode1 > TrueNode1 > "true"
"]"
> {"my key":"my value"}
Parse tree:
ValueNode1 > ObjectNode1
"{"
PairNode1
StringNode1 > "\"my key\""
":"
ValueNode1 > StringNode1 > "\"my value\""
"}"
If JSON allowed literals to stand alone, we'd be done. Unfortunately JSON is a little more restrictive, so we have to add a new root-rule to ensure there is an array or object at the top-most level.
There is one other thing we need to add. Parsing-Expression-Grammars match every character or fail. Right now we are not handling white-space, so any white-space character in our string will cause our parsing to fail. BB makes this easy. Just call "ignore_whitespace" in your class declaration.
Below is the complete JSON parser. It should match any legal JSON and reject anything illegal. However, it just returns the parse-tree when it succeeds. In the next section we'll add the same functionality Aaron did in his blog post.
require "babel_bridge"
class JsonParser < BabelBridge::Parser
ignore_whitespace
rule :document, any(:object, :array)
rule :array, '[', many?(:value, ','), ']'
rule :object, '{', many?(:pair, ','), '}'
rule :pair, :string, ':', :value
rule :value, any(
:object, :array, :number,
:string, :true, :false, :null
)
rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
rule :true, "true"
rule :false, "false"
rule :null, "null"
end
BabelBridge::Shell.new(JsonParser.new).start
source
Adding Functionality to The Parse Tree
Every rule you define creates a node class. Every time an instance of that rule is parsed, an instance of that node class is created. With Babel-Bridge you can easily define methods on the parse-tree nodes for each rule you declare. When calling the rule method, just open up a do-block. The code inside the do-block is executed inside the class for that node.
Our goal now is to convert the resulting parse-tree from our JSON parser into the equivalent Ruby Hashes, Arrays and Strings etc. We are going to add an "evaluate" method to our nodes which returns the correct Ruby data-structures. For example, here is how we can add the evaluate method to our :null rule.
rule :null, "null" do
def evaluate; nil end
end
To evaluate the remaining JSON literals we can leverage the fact that Ruby's literals are nearly identical. In fact, "null" is the only one which is different. To avoid duplicating code (DRY!), I'm going to insert an extra rule called :ruby_compatible_literal. Calling to_s on a node just returns the source string that node matched. Using Ruby's "eval" method, we can evaluate that string and get the Ruby object back.
require "babel_bridge"
class JsonParser < BabelBridge::Parser
ignore_whitespace
rule :document, any(:object, :array)
rule :array, '[', many?(:value, ','), ']'
rule :object, '{', many?(:pair, ','), '}'
rule :pair, :string, ':', :value
rule :value, any(:object, :array,
:ruby_compatible_literal, :null
)
rule :ruby_compatible_literal, any(:number,
:string, :true, :false
) do
def evaluate; eval(to_s); end
end
rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
rule :true, "true"
rule :false, "false"
rule :null, "null" do
def evaluate; nil end
end
end
BabelBridge::Shell.new(JsonParser.new).start
Next, let's add the evaluate method to array. When there are nested rules, you can access them by name. If that sub-rule can be matched more than once, it will return an array of the child nodes which were matched with that rule. Here, the #value method returns a :value nodes in matched in the array. We just need to convert that array to the #evaluated versions of each :value node.
rule :array, '[', many?(:value, ','), ']' do
def evaluate; value.collect {|v| v.evaluate}; end
end
Evaluating Objects takes two steps. The :pair rule returns a 2-element array of the form: [key, value]. Then the :object rule uses the Hash[] constructor for the final result.
rule :object, '{', many?(:pair, ','), '}' do
def evaluate; Hash[ pair.collect {|p| p.evaluate } ] end
end
rule :pair, :string, ':', :value do
def evaluate; [ eval(string.to_s), value.evaluate ] end
end
Here is the full source-code for parsing JSON and generating Ruby data structures.
require "babel_bridge"
class JsonParser < BabelBridge::Parser
ignore_whitespace
rule :document, any(:object, :array)
rule :array, '[', many?(:value, ','), ']' do
def evaluate; value.collect {|v| v.evaluate} end
end
rule :object, '{', many?(:pair, ','), '}' do
def evaluate; Hash[ pair.collect {|p| p.evaluate } ] end
end
rule :pair, :string, ':', :value do
def evaluate; [ eval(string.to_s), value.evaluate ] end
end
rule :value, any(:object, :array, :ruby_compatible_literal, :null)
rule :ruby_compatible_literal, any(:number, :string, :true, :false) do
def evaluate; eval(to_s); end
end
rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
rule :true, "true"
rule :false, "false"
rule :null, "null" do
def evaluate; nil end
end
end
BabelBridge::Shell.new(JsonParser.new).start
source
You'll notice that we didn't have to define evaluate on every node. Babel-Bridge has some intelligent defaults to help minimize code-size. Now run the code and enter some JSON values:
> [null]
=> [nil]
> {"my key":"my value"}
=> {"my key"=>"my value"}
Notice that the shell stopped printing our parse-tree. The shell looks for the "evaluate" method on the root node. If it exists, it prints out root_node.evaluate.inspect.
You might also try inputing some illegal values. This shows off one of Babel-Bridge's powerful debugging tools - the parser-failure-info.
> [nil]
Parsing error at line 1 column 2 offset 1
Source:
...
[<HERE>nil]
...
Parse path at failure:
DocumentNode1
Expecting one of:
"[" ArrayNode1
"]" ArrayNode1
"false" FalseNode1
"null" NullNode1
"true" TrueNode1
"{" ObjectNode1
/"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/ StringNode1
/-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/ NumberNode1
If you want to use your parser in real code, you probably won't want to use the shell. Here's how:
json_parser = JsonParser.new
parse_tree = json_parser.parse "[null]"
if parse_tree
ruby_data_structure = parse_tree.evaluate
puts ruby_data_structure.inspect
else
puts "ParsingError: " + json_parser.parser_failure_info
end
For further reading, you might be interested in my earlier blog post:
How to Create a Turing Complete Programming Language in 40 Minutes (and 70 lines of code).