Enter Parsing Expression Grammars. I learned about PEGs from Treetop. Think of PEGs as greedy regular expressions with recursion. Being greedy, they don't try every possible pattern-match. This turns out to be a good thing; it makes them much easier to understand and debug. Adding recursive rule-matching to regular expressions makes PEGs approximately as expressive as LALR(1) parsers (though not identical). Treetop got me excited to work with parsers again, but sadly it still had some of the awkwardness of Yacc-style parser-generators. You have to build "treetop" files into ruby files before you can use them. There is yet an even better way...
Enter Babel Bridge (github). I wanted a PEG based parser-generator that allowed me to use 100% ruby with no extra build steps, so I built Babel Bridge. I've been evolving BB for over 2 years now. It is a powerful tool for building easy-to-understand parsers.
Here is how to write a JSON parser using the Babel Bridge gem. If you want to skip to the punch-line, here is the JSON parser source code.
Writing a JSON parser with Babel Bridge
First, install the babel_bridge gem:gem install babel_bridgeWith Babel Bridge you can create many different parsers and have many instances of each in one runtime. To define a new parser, create a class and inherit from BabelBridge::Parser.
require "babel_bridge" class JsonParser < BabelBridge::Parser # define the parser here endThe main component for defining your parser is the "rule". A rule is a named pattern for matching strings of characters. The first parameter of the "rule" method is the name of the rule. The remaining parameters specify one or more patterns to match. For more detailed info, please see the readme.
There are five literal values that are allowed with JSON: string, number, true, false and null. Let's start by making rules for each of them. (Credit: I used Aaron Patterson's regular expressions he derived from json.org.)
require "babel_bridge" class JsonParser < BabelBridge::Parser rule :value, any(:number, :string, :true, :false, :null) rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/ rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/ rule :true, "true" rule :false, "false" rule :null, "null" end BabelBridge::Shell.new(JsonParser.new).startIn addition to the five rules for the literals, I added two more lines of code. First, I added a general rule called :value. It matches any of the five literals. Babel-Bridge starts matching using the first rule you declare. That "root rule" must include everything you want to match. JsonParser will now match any of the JSON literals. Second, I added a "shell". It is a Babel-Bridge tool to help make parser-testing easier. Try running the above code and typing in some JSON literals at the prompts.
> "string literal" Parse tree: ValueNode1 > StringNode1 > "\"string literal\"" > 123 Parse tree: ValueNode1 > NumberNode1 > "123" > null Parse tree: ValueNode1 > NullNode1 > "null"We are halfway done. There are two more important patterns for JSON: objects (hashes) and arrays. Arrays consist of 0 or more :values, delimited by "," and wrapped in "[]". Objects consist of 0 or more :pairs, again delimited by "," and wrapped in "{}". Pairs consist of a :string, then a ":", then a :value.
require "babel_bridge" class JsonParser < BabelBridge::Parser rule :value, any( :object, :array, :number, :string, :true, :false, :null ) rule :array, '[', many?(:value, ','), ']' rule :object, '{', many?(:pair, ','), '}' rule :pair, :string, ':', :value rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/ rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/ rule :true, "true" rule :false, "false" rule :null, "null" end BabelBridge::Shell.new(JsonParser.new).startYou can try out the new rules:
> ["my string",123,true] Parse tree: ValueNode1 > ArrayNode1 "[" ValueNode1 > StringNode1 > "\"my string\"" ValueNode1 > NumberNode1 > "123" ValueNode1 > TrueNode1 > "true" "]" > {"my key":"my value"} Parse tree: ValueNode1 > ObjectNode1 "{" PairNode1 StringNode1 > "\"my key\"" ":" ValueNode1 > StringNode1 > "\"my value\"" "}"If JSON allowed literals to stand alone, we'd be done. Unfortunately JSON is a little more restrictive, so we have to add a new root-rule to ensure there is an array or object at the top-most level.
There is one other thing we need to add. Parsing-Expression-Grammars match every character or fail. Right now we are not handling white-space, so any white-space character in our string will cause our parsing to fail. BB makes this easy. Just call "ignore_whitespace" in your class declaration.
Below is the complete JSON parser. It should match any legal JSON and reject anything illegal. However, it just returns the parse-tree when it succeeds. In the next section we'll add the same functionality Aaron did in his blog post.
require "babel_bridge" class JsonParser < BabelBridge::Parser ignore_whitespace rule :document, any(:object, :array) rule :array, '[', many?(:value, ','), ']' rule :object, '{', many?(:pair, ','), '}' rule :pair, :string, ':', :value rule :value, any( :object, :array, :number, :string, :true, :false, :null ) rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/ rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/ rule :true, "true" rule :false, "false" rule :null, "null" end BabelBridge::Shell.new(JsonParser.new).startsource
Adding Functionality to The Parse Tree
Every rule you define creates a node class. Every time an instance of that rule is parsed, an instance of that node class is created. With Babel-Bridge you can easily define methods on the parse-tree nodes for each rule you declare. When calling the rule method, just open up a do-block. The code inside the do-block is executed inside the class for that node.Our goal now is to convert the resulting parse-tree from our JSON parser into the equivalent Ruby Hashes, Arrays and Strings etc. We are going to add an "evaluate" method to our nodes which returns the correct Ruby data-structures. For example, here is how we can add the evaluate method to our :null rule.
rule :null, "null" do def evaluate; nil end endTo evaluate the remaining JSON literals we can leverage the fact that Ruby's literals are nearly identical. In fact, "null" is the only one which is different. To avoid duplicating code (DRY!), I'm going to insert an extra rule called :ruby_compatible_literal. Calling to_s on a node just returns the source string that node matched. Using Ruby's "eval" method, we can evaluate that string and get the Ruby object back.
require "babel_bridge" class JsonParser < BabelBridge::Parser ignore_whitespace rule :document, any(:object, :array) rule :array, '[', many?(:value, ','), ']' rule :object, '{', many?(:pair, ','), '}' rule :pair, :string, ':', :value rule :value, any(:object, :array, :ruby_compatible_literal, :null ) rule :ruby_compatible_literal, any(:number, :string, :true, :false ) do def evaluate; eval(to_s); end end rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/ rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/ rule :true, "true" rule :false, "false" rule :null, "null" do def evaluate; nil end end end BabelBridge::Shell.new(JsonParser.new).startNext, let's add the evaluate method to array. When there are nested rules, you can access them by name. If that sub-rule can be matched more than once, it will return an array of the child nodes which were matched with that rule. Here, the #value method returns a :value nodes in matched in the array. We just need to convert that array to the #evaluated versions of each :value node.
rule :array, '[', many?(:value, ','), ']' do def evaluate; value.collect {|v| v.evaluate}; end endEvaluating Objects takes two steps. The :pair rule returns a 2-element array of the form: [key, value]. Then the :object rule uses the Hash[] constructor for the final result.
rule :object, '{', many?(:pair, ','), '}' do def evaluate; Hash[ pair.collect {|p| p.evaluate } ] end end rule :pair, :string, ':', :value do def evaluate; [ eval(string.to_s), value.evaluate ] end endHere is the full source-code for parsing JSON and generating Ruby data structures.
require "babel_bridge" class JsonParser < BabelBridge::Parser ignore_whitespace rule :document, any(:object, :array) rule :array, '[', many?(:value, ','), ']' do def evaluate; value.collect {|v| v.evaluate} end end rule :object, '{', many?(:pair, ','), '}' do def evaluate; Hash[ pair.collect {|p| p.evaluate } ] end end rule :pair, :string, ':', :value do def evaluate; [ eval(string.to_s), value.evaluate ] end end rule :value, any(:object, :array, :ruby_compatible_literal, :null) rule :ruby_compatible_literal, any(:number, :string, :true, :false) do def evaluate; eval(to_s); end end rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/ rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/ rule :true, "true" rule :false, "false" rule :null, "null" do def evaluate; nil end end end BabelBridge::Shell.new(JsonParser.new).startsource You'll notice that we didn't have to define evaluate on every node. Babel-Bridge has some intelligent defaults to help minimize code-size. Now run the code and enter some JSON values:
> [null] => [nil] > {"my key":"my value"} => {"my key"=>"my value"}Notice that the shell stopped printing our parse-tree. The shell looks for the "evaluate" method on the root node. If it exists, it prints out root_node.evaluate.inspect. You might also try inputing some illegal values. This shows off one of Babel-Bridge's powerful debugging tools - the parser-failure-info.
> [nil] Parsing error at line 1 column 2 offset 1 Source: ... [<HERE>nil] ... Parse path at failure: DocumentNode1 Expecting one of: "[" ArrayNode1 "]" ArrayNode1 "false" FalseNode1 "null" NullNode1 "true" TrueNode1 "{" ObjectNode1 /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/ StringNode1 /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/ NumberNode1If you want to use your parser in real code, you probably won't want to use the shell. Here's how:
json_parser = JsonParser.new parse_tree = json_parser.parse "[null]" if parse_tree ruby_data_structure = parse_tree.evaluate puts ruby_data_structure.inspect else puts "ParsingError: " + json_parser.parser_failure_info endFor further reading, you might be interested in my earlier blog post: How to Create a Turing Complete Programming Language in 40 Minutes (and 70 lines of code).