Thursday, January 31, 2013

Parsing JSON the Not-So-Hard Way With Ruby and Babel Bridge

Aaron Patterson recently posted this article on Parsing JSON the hard way using Racc. I think he did an excellent job explaining Racc. However, it was painful. Racc is an LALR(1) parser-generator similar to Yacc. While Yacc was fun to learn in school, it's also painful to use. LALR is hard to debug ("shift-reduce conflict" - !@#$% - if you know what I mean), and the parser generation steps are tedious to set up and maintain. There is a better way...

Enter Parsing Expression Grammars. I learned about PEGs from Treetop. Think of PEGs as greedy regular expressions with recursion. Being greedy, they don't try every possible pattern-match. This turns out to be a good thing; it makes them much easier to understand and debug. Adding recursive rule-matching to regular expressions makes PEGs approximately as expressive as LALR(1) parsers (though not identical). Treetop got me excited to work with parsers again, but sadly it still had some of the awkwardness of Yacc-style parser-generators. You have to build "treetop" files into ruby files before you can use them. There is yet an even better way...

Enter Babel Bridge (github). I wanted a PEG based parser-generator that allowed me to use 100% ruby with no extra build steps, so I built Babel Bridge. I've been evolving BB for over 2 years now. It is a powerful tool for building easy-to-understand parsers.

Here is how to write a JSON parser using the Babel Bridge gem. If you want to skip to the punch-line, here is the JSON parser source code.

Writing a JSON parser with Babel Bridge

First, install the babel_bridge gem:
gem install babel_bridge
With Babel Bridge you can create many different parsers and have many instances of each in one runtime. To define a new parser, create a class and inherit from BabelBridge::Parser.
 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  # define the parser here
end
The main component for defining your parser is the "rule". A rule is a named pattern for matching strings of characters. The first parameter of the "rule" method is the name of the rule. The remaining parameters specify one or more patterns to match. For more detailed info, please see the readme.

There are five literal values that are allowed with JSON: string, number, true, false and null. Let's start by making rules for each of them. (Credit: I used Aaron Patterson's regular expressions he derived from json.org.)
 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  rule :value, any(:number, :string, :true, :false, :null)

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null"
end

BabelBridge::Shell.new(JsonParser.new).start
In addition to the five rules for the literals, I added two more lines of code. First, I added a general rule called :value. It matches any of the five literals. Babel-Bridge starts matching using the first rule you declare. That "root rule" must include everything you want to match. JsonParser will now match any of the JSON literals. Second, I added a "shell". It is a Babel-Bridge tool to help make parser-testing easier. Try running the above code and typing in some JSON literals at the prompts.

 
> "string literal"

Parse tree:
  ValueNode1 > StringNode1 > "\"string literal\""

> 123

Parse tree:
  ValueNode1 > NumberNode1 > "123"

> null

Parse tree:
  ValueNode1 > NullNode1 > "null"
We are halfway done. There are two more important patterns for JSON: objects (hashes) and arrays. Arrays consist of 0 or more :values, delimited by "," and wrapped in "[]". Objects consist of 0 or more :pairs, again delimited by "," and wrapped in "{}". Pairs consist of a :string, then a ":", then a :value.
 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  rule :value, any(
    :object, :array, :number,
    :string, :true, :false, :null
  )

  rule :array,  '[', many?(:value, ','), ']'
  rule :object, '{', many?(:pair,  ','), '}'

  rule :pair, :string, ':', :value

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null"
end

BabelBridge::Shell.new(JsonParser.new).start
You can try out the new rules:
 
> ["my string",123,true]

Parse tree:
  ValueNode1 > ArrayNode1
    "["
    ValueNode1 > StringNode1 > "\"my string\""
    ValueNode1 > NumberNode1 > "123"
    ValueNode1 > TrueNode1 > "true"
    "]"

> {"my key":"my value"}

Parse tree:
  ValueNode1 > ObjectNode1
    "{"
    PairNode1
      StringNode1 > "\"my key\""
      ":"
      ValueNode1 > StringNode1 > "\"my value\""
    "}"
If JSON allowed literals to stand alone, we'd be done. Unfortunately JSON is a little more restrictive, so we have to add a new root-rule to ensure there is an array or object at the top-most level.

There is one other thing we need to add. Parsing-Expression-Grammars match every character or fail. Right now we are not handling white-space, so any white-space character in our string will cause our parsing to fail. BB makes this easy. Just call "ignore_whitespace" in your class declaration.

Below is the complete JSON parser. It should match any legal JSON and reject anything illegal. However, it just returns the parse-tree when it succeeds. In the next section we'll add the same functionality Aaron did in his blog post.
 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  ignore_whitespace

  rule :document, any(:object, :array)

  rule :array,  '[', many?(:value, ','), ']'
  rule :object, '{', many?(:pair,  ','), '}'

  rule :pair, :string, ':', :value

  rule :value, any(
    :object, :array, :number,
    :string, :true, :false, :null
  )

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null"
end

BabelBridge::Shell.new(JsonParser.new).start
source

Adding Functionality to The Parse Tree

Every rule you define creates a node class. Every time an instance of that rule is parsed, an instance of that node class is created. With Babel-Bridge you can easily define methods on the parse-tree nodes for each rule you declare. When calling the rule method, just open up a do-block. The code inside the do-block is executed inside the class for that node.

Our goal now is to convert the resulting parse-tree from our JSON parser into the equivalent Ruby Hashes, Arrays and Strings etc. We are going to add an "evaluate" method to our nodes which returns the correct Ruby data-structures. For example, here is how we can add the evaluate method to our :null rule.
 
  rule :null, "null" do
    def evaluate; nil end
  end
To evaluate the remaining JSON literals we can leverage the fact that Ruby's literals are nearly identical. In fact, "null" is the only one which is different. To avoid duplicating code (DRY!), I'm going to insert an extra rule called :ruby_compatible_literal. Calling to_s on a node just returns the source string that node matched. Using Ruby's "eval" method, we can evaluate that string and get the Ruby object back.
 
require "babel_bridge"
class JsonParser < BabelBridge::Parser
  ignore_whitespace

  rule :document, any(:object, :array)

  rule :array, '[', many?(:value, ','), ']'

  rule :object, '{', many?(:pair,  ','), '}'
  rule :pair, :string, ':', :value

  rule :value, any(:object, :array, 
    :ruby_compatible_literal, :null
  )

  rule :ruby_compatible_literal, any(:number, 
    :string, :true, :false
  ) do
    def evaluate; eval(to_s); end
  end

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null" do
    def evaluate; nil end
  end
end

BabelBridge::Shell.new(JsonParser.new).start
Next, let's add the evaluate method to array. When there are nested rules, you can access them by name. If that sub-rule can be matched more than once, it will return an array of the child nodes which were matched with that rule. Here, the #value method returns a :value nodes in matched in the array. We just need to convert that array to the #evaluated versions of each :value node.
 
  rule :array, '[', many?(:value, ','), ']' do
    def evaluate; value.collect {|v| v.evaluate}; end
  end
Evaluating Objects takes two steps. The :pair rule returns a 2-element array of the form: [key, value]. Then the :object rule uses the Hash[] constructor for the final result.
 
  rule :object, '{', many?(:pair,  ','), '}' do
    def evaluate; Hash[ pair.collect {|p| p.evaluate } ] end
  end

  rule :pair, :string, ':', :value do
    def evaluate; [ eval(string.to_s), value.evaluate ] end
  end
Here is the full source-code for parsing JSON and generating Ruby data structures.
 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  ignore_whitespace

  rule :document, any(:object, :array)

  rule :array, '[', many?(:value, ','), ']' do
    def evaluate; value.collect {|v| v.evaluate} end
  end

  rule :object, '{', many?(:pair,  ','), '}' do
    def evaluate; Hash[ pair.collect {|p| p.evaluate } ] end
  end

  rule :pair, :string, ':', :value do
    def evaluate; [ eval(string.to_s), value.evaluate ] end
  end

  rule :value, any(:object, :array, :ruby_compatible_literal, :null)

  rule :ruby_compatible_literal, any(:number, :string, :true, :false) do
    def evaluate; eval(to_s); end
  end

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null" do
    def evaluate; nil end
  end
end

BabelBridge::Shell.new(JsonParser.new).start
source You'll notice that we didn't have to define evaluate on every node. Babel-Bridge has some intelligent defaults to help minimize code-size. Now run the code and enter some JSON values:
 
> [null]
 => [nil]

> {"my key":"my value"}
 => {"my key"=>"my value"}
Notice that the shell stopped printing our parse-tree. The shell looks for the "evaluate" method on the root node. If it exists, it prints out root_node.evaluate.inspect. You might also try inputing some illegal values. This shows off one of Babel-Bridge's powerful debugging tools - the parser-failure-info.
 
> [nil]
Parsing error at line 1 column 2 offset 1

Source:
...
[<HERE>nil]
...

Parse path at failure:
  DocumentNode1

Expecting one of:
  "["                                                ArrayNode1
  "]"                                                ArrayNode1
  "false"                                            FalseNode1
  "null"                                             NullNode1
  "true"                                             TrueNode1
  "{"                                                ObjectNode1
  /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/ StringNode1
  /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/      NumberNode1
If you want to use your parser in real code, you probably won't want to use the shell. Here's how:
 
json_parser = JsonParser.new

parse_tree = json_parser.parse "[null]"

if parse_tree
  ruby_data_structure = parse_tree.evaluate
  puts ruby_data_structure.inspect
else
  puts "ParsingError: " + json_parser.parser_failure_info
end
For further reading, you might be interested in my earlier blog post: How to Create a Turing Complete Programming Language in 40 Minutes (and 70 lines of code).

Saturday, January 12, 2013

Don't Learn a New Programming Language Every Year. Write one!

Don't learn a new language every year

Programmers often recommend learning a new programming language every year. I don't agree. I've studied many languages by reading their specs and dabbling with some code until I feel like I've got the basics, but I don't think I really learned them. My path to really "learning" Ruby showed me why.

Ruby was not a challenging language to learn. I picked up the Pragmatic Programmers book and read it end to end (an endorsement to both the language and the book). When the time came to start Imikimi, I chose Rails to do the web-side. However, I discovered that I hadn't really learned to *think* in the language until after several years of continuous professional use.

I was a C++ programmer before I was a Ruby programmer. A lot of my early Ruby code looks and feels a lot like C++ code. There were constructs in Ruby that I wasn't comfortable with, so I simply didn't use them. Over time I re-discovered language features I "learned" in my first reading of the Pragmatic book. Over time I learned how to program in Ruby-style from the community. By studying the Rails source and other gems I learned how to better use the language and how to write clean, readable code.

Learning a language by reading a book and writing a few small sample apps is equivalent to learning the grammar and a few phrases of spoken language. It's a shallow understanding which only broadens your perspective in shallow ways. I don't think you can claim to really know a language, spoken or programming, until you are conversant in it. It's a long, slow, but ultimately rewarding processes.

Here's my recommendation: To get started, do learn a few extra languages, shallowly. The second and third language you learn will broaden your horizons even if you don't dive too deep into them. After that, though, you need to focus on deepening your understanding either of your current language of choice, or a new one, through hours and hours of work on real projects, OR through my second recommendation:

Write a Toy language every few years

I'm writing a programming language. I've spent more than a decade thinking about programming languages and PL design. From 2001 until 2006 I worked on new language ideas full-time. Unfortunately none of that work saw the light of day, so I'm circling around again. I've learned a lot since then, primarily from Ruby and the Ruby community.

My current project, the River programming language, is an attempt to build the simplest Prototype-Based core language that can support a Ruby-like class-system on-top, written in the language itself. I've learned more about Ruby in the past few months than I have in the past few years.

You learn a lot about programming when you write a language implementation. It forces you to examine all the edge cases you normally ignore. And it happens fast. You are immediately forced into uncomfortable areas of programming languages you were unaware of. There are trade-offs to be made at every level. There are boot-strapping issues to resolve, and there are a host of subtle semantics in your host language that you never thought about.

How to get started

If you decide to follow this path, here are a few tips.
  • Use a flexible, agile host language. You will make many major refactors along the way, and you don't want to get bogged down in irrelevant details. This is true for any project, but especially true when working on something as meta as writing one language inside another. I'd recommend Ruby, but Python or even Javascript would work well.
  • Use a well-designed parser-generator (resource: my Babel-Bridge parser-generator gem)
  • Don't worry about performance at first, even if that is your primary goal.
  • Target Turing-complete functionality before all else. This will give you a solid base-camp to build out from as you explore the rest of your language. (resource: my How to write a Turing-Complete Programming Language in 40 minutes video)

You can follow my design progress on Twitter: #riverlang / @shanebdavis

Happy PL design!