Thursday, January 31, 2013

Parsing JSON the Not-So-Hard Way With Ruby and Babel Bridge

Aaron Patterson recently posted this article on Parsing JSON the hard way using Racc. I think he did an excellent job explaining Racc. However, it was painful. Racc is an LALR(1) parser-generator similar to Yacc. While Yacc was fun to learn in school, it's also painful to use. LALR is hard to debug ("shift-reduce conflict" - !@#$% - if you know what I mean), and the parser generation steps are tedious to set up and maintain. There is a better way...

Enter Parsing Expression Grammars. I learned about PEGs from Treetop. Think of PEGs as greedy regular expressions with recursion. Being greedy, they don't try every possible pattern-match. This turns out to be a good thing; it makes them much easier to understand and debug. Adding recursive rule-matching to regular expressions makes PEGs approximately as expressive as LALR(1) parsers (though not identical). Treetop got me excited to work with parsers again, but sadly it still had some of the awkwardness of Yacc-style parser-generators. You have to build "treetop" files into ruby files before you can use them. There is yet an even better way...

Enter Babel Bridge (github). I wanted a PEG based parser-generator that allowed me to use 100% ruby with no extra build steps, so I built Babel Bridge. I've been evolving BB for over 2 years now. It is a powerful tool for building easy-to-understand parsers.

Here is how to write a JSON parser using the Babel Bridge gem. If you want to skip to the punch-line, here is the JSON parser source code.

Writing a JSON parser with Babel Bridge

First, install the babel_bridge gem:
gem install babel_bridge
With Babel Bridge you can create many different parsers and have many instances of each in one runtime. To define a new parser, create a class and inherit from BabelBridge::Parser.
 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  # define the parser here
end
The main component for defining your parser is the "rule". A rule is a named pattern for matching strings of characters. The first parameter of the "rule" method is the name of the rule. The remaining parameters specify one or more patterns to match. For more detailed info, please see the readme.

There are five literal values that are allowed with JSON: string, number, true, false and null. Let's start by making rules for each of them. (Credit: I used Aaron Patterson's regular expressions he derived from json.org.)
 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  rule :value, any(:number, :string, :true, :false, :null)

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null"
end

BabelBridge::Shell.new(JsonParser.new).start
In addition to the five rules for the literals, I added two more lines of code. First, I added a general rule called :value. It matches any of the five literals. Babel-Bridge starts matching using the first rule you declare. That "root rule" must include everything you want to match. JsonParser will now match any of the JSON literals. Second, I added a "shell". It is a Babel-Bridge tool to help make parser-testing easier. Try running the above code and typing in some JSON literals at the prompts.

 
> "string literal"

Parse tree:
  ValueNode1 > StringNode1 > "\"string literal\""

> 123

Parse tree:
  ValueNode1 > NumberNode1 > "123"

> null

Parse tree:
  ValueNode1 > NullNode1 > "null"
We are halfway done. There are two more important patterns for JSON: objects (hashes) and arrays. Arrays consist of 0 or more :values, delimited by "," and wrapped in "[]". Objects consist of 0 or more :pairs, again delimited by "," and wrapped in "{}". Pairs consist of a :string, then a ":", then a :value.
 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  rule :value, any(
    :object, :array, :number,
    :string, :true, :false, :null
  )

  rule :array,  '[', many?(:value, ','), ']'
  rule :object, '{', many?(:pair,  ','), '}'

  rule :pair, :string, ':', :value

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null"
end

BabelBridge::Shell.new(JsonParser.new).start
You can try out the new rules:
 
> ["my string",123,true]

Parse tree:
  ValueNode1 > ArrayNode1
    "["
    ValueNode1 > StringNode1 > "\"my string\""
    ValueNode1 > NumberNode1 > "123"
    ValueNode1 > TrueNode1 > "true"
    "]"

> {"my key":"my value"}

Parse tree:
  ValueNode1 > ObjectNode1
    "{"
    PairNode1
      StringNode1 > "\"my key\""
      ":"
      ValueNode1 > StringNode1 > "\"my value\""
    "}"
If JSON allowed literals to stand alone, we'd be done. Unfortunately JSON is a little more restrictive, so we have to add a new root-rule to ensure there is an array or object at the top-most level.

There is one other thing we need to add. Parsing-Expression-Grammars match every character or fail. Right now we are not handling white-space, so any white-space character in our string will cause our parsing to fail. BB makes this easy. Just call "ignore_whitespace" in your class declaration.

Below is the complete JSON parser. It should match any legal JSON and reject anything illegal. However, it just returns the parse-tree when it succeeds. In the next section we'll add the same functionality Aaron did in his blog post.
 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  ignore_whitespace

  rule :document, any(:object, :array)

  rule :array,  '[', many?(:value, ','), ']'
  rule :object, '{', many?(:pair,  ','), '}'

  rule :pair, :string, ':', :value

  rule :value, any(
    :object, :array, :number,
    :string, :true, :false, :null
  )

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null"
end

BabelBridge::Shell.new(JsonParser.new).start
source

Adding Functionality to The Parse Tree

Every rule you define creates a node class. Every time an instance of that rule is parsed, an instance of that node class is created. With Babel-Bridge you can easily define methods on the parse-tree nodes for each rule you declare. When calling the rule method, just open up a do-block. The code inside the do-block is executed inside the class for that node.

Our goal now is to convert the resulting parse-tree from our JSON parser into the equivalent Ruby Hashes, Arrays and Strings etc. We are going to add an "evaluate" method to our nodes which returns the correct Ruby data-structures. For example, here is how we can add the evaluate method to our :null rule.
 
  rule :null, "null" do
    def evaluate; nil end
  end
To evaluate the remaining JSON literals we can leverage the fact that Ruby's literals are nearly identical. In fact, "null" is the only one which is different. To avoid duplicating code (DRY!), I'm going to insert an extra rule called :ruby_compatible_literal. Calling to_s on a node just returns the source string that node matched. Using Ruby's "eval" method, we can evaluate that string and get the Ruby object back.
 
require "babel_bridge"
class JsonParser < BabelBridge::Parser
  ignore_whitespace

  rule :document, any(:object, :array)

  rule :array, '[', many?(:value, ','), ']'

  rule :object, '{', many?(:pair,  ','), '}'
  rule :pair, :string, ':', :value

  rule :value, any(:object, :array, 
    :ruby_compatible_literal, :null
  )

  rule :ruby_compatible_literal, any(:number, 
    :string, :true, :false
  ) do
    def evaluate; eval(to_s); end
  end

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null" do
    def evaluate; nil end
  end
end

BabelBridge::Shell.new(JsonParser.new).start
Next, let's add the evaluate method to array. When there are nested rules, you can access them by name. If that sub-rule can be matched more than once, it will return an array of the child nodes which were matched with that rule. Here, the #value method returns a :value nodes in matched in the array. We just need to convert that array to the #evaluated versions of each :value node.
 
  rule :array, '[', many?(:value, ','), ']' do
    def evaluate; value.collect {|v| v.evaluate}; end
  end
Evaluating Objects takes two steps. The :pair rule returns a 2-element array of the form: [key, value]. Then the :object rule uses the Hash[] constructor for the final result.
 
  rule :object, '{', many?(:pair,  ','), '}' do
    def evaluate; Hash[ pair.collect {|p| p.evaluate } ] end
  end

  rule :pair, :string, ':', :value do
    def evaluate; [ eval(string.to_s), value.evaluate ] end
  end
Here is the full source-code for parsing JSON and generating Ruby data structures.
 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  ignore_whitespace

  rule :document, any(:object, :array)

  rule :array, '[', many?(:value, ','), ']' do
    def evaluate; value.collect {|v| v.evaluate} end
  end

  rule :object, '{', many?(:pair,  ','), '}' do
    def evaluate; Hash[ pair.collect {|p| p.evaluate } ] end
  end

  rule :pair, :string, ':', :value do
    def evaluate; [ eval(string.to_s), value.evaluate ] end
  end

  rule :value, any(:object, :array, :ruby_compatible_literal, :null)

  rule :ruby_compatible_literal, any(:number, :string, :true, :false) do
    def evaluate; eval(to_s); end
  end

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null" do
    def evaluate; nil end
  end
end

BabelBridge::Shell.new(JsonParser.new).start
source You'll notice that we didn't have to define evaluate on every node. Babel-Bridge has some intelligent defaults to help minimize code-size. Now run the code and enter some JSON values:
 
> [null]
 => [nil]

> {"my key":"my value"}
 => {"my key"=>"my value"}
Notice that the shell stopped printing our parse-tree. The shell looks for the "evaluate" method on the root node. If it exists, it prints out root_node.evaluate.inspect. You might also try inputing some illegal values. This shows off one of Babel-Bridge's powerful debugging tools - the parser-failure-info.
 
> [nil]
Parsing error at line 1 column 2 offset 1

Source:
...
[<HERE>nil]
...

Parse path at failure:
  DocumentNode1

Expecting one of:
  "["                                                ArrayNode1
  "]"                                                ArrayNode1
  "false"                                            FalseNode1
  "null"                                             NullNode1
  "true"                                             TrueNode1
  "{"                                                ObjectNode1
  /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/ StringNode1
  /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/      NumberNode1
If you want to use your parser in real code, you probably won't want to use the shell. Here's how:
 
json_parser = JsonParser.new

parse_tree = json_parser.parse "[null]"

if parse_tree
  ruby_data_structure = parse_tree.evaluate
  puts ruby_data_structure.inspect
else
  puts "ParsingError: " + json_parser.parser_failure_info
end
For further reading, you might be interested in my earlier blog post: How to Create a Turing Complete Programming Language in 40 Minutes (and 70 lines of code).

4 comments:

  1. Really you have done great job,There are may person searching about that now they will find enough resources by your post, document translation services

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. I'm having found your blogsh very interesting and useful, especially grateful for a clear example of a script concept. In your example, anyone can follow like for instructions, which is very time-saving, thank you for sharing!
    Richard Brown data room due diligence

    ReplyDelete