Sunday, December 15, 2013

Principle of Exposure and Transparency - POET

The Visibility Problem

We have a problem in programming. The majority of our work is invisible. It's invisible to the end user. It's invisible to us. We can only see ~50 lines at a time of the thousands and millions of lines of code in our programs. Imagine a miopic Leonardo da Vinci who could only focus on a square inch of canvas trying to paint the Mono Lisa or even the Sistine Chapel.

Most modern debuggers can only show a dozen variables and a stack trace. Generously, that represents no more than a few hundred bytes of memory. Our running programs are measured in megabytes if not gigabytes. In anything but the most trivial program, we can't see more than a small fraction of a percent of our paused runtime state. When our creation is actually running, we are blind.

Two Challenges of Programming

All difficulties in programming can be separated into two categories. There are the problems due to core computer-science limitations or limitations of our personal understanding. These are legitimate challenges which require real engineering, design and exploration. All other problems are due to a lack of understanding of what our creation is actually doing.

One way to measure a programmer's skill is how well they can simulate programs in their head. This can be measured concretely by how much code can they write without error without running or compiling. Mental program simulation is the only way we can mange the complete lack of visibility in running programs. The best programmers have developed an intuitive sense about running programs over years and decades.

Simulating programs in our head is a major part of modern programming. This is broken. The computer is infinitely better at running programs. There is a monumentous opportunity to improve programming by developing better tools for inspecting, exposing and making our running programs transparent.

The Principle of Exposure and Transparency (POET)

POET: All bugs are trivial once their cause is exposed; given enough runtime transparency, all bugs are trivial.

The vast majority of bugs only require a few lines to fix. Most of our debugging time is spent trying to understand bugs well enough to know what two lines to change. POET embodies this observation: All this wasted time can be reclaimed with better instruments for understanding our running programs.

Linus's Law

Given enough eyeballs, all bugs are shallow. - Eric Raymond, The Cathedral and the Bazaar

POET is a generalization of Linus's Law. Getting more people looking at the code - "enough eyeballs" - is one way to better expose the code's runtime properties. However, it is a very inefficient method. It is akin to saying "with enough miopic da Vincis, we can paint the whole Mono Lisa." While perhaps true, a better solution would be repair his vision so Leonardo can do it without assistance.

Homework

Pay attention the next time you run your code. Why are you running it? What information are you attempting to gain? How much time does it take to gather this information? Be sure to include the time it takes you to context switch, type in commands, wait for the run to start and do whatever interaction it takes to bring the program to the required state. How might you gain that information quicker with better tools? LiveReload plugins for web-page development is one such example.
When you are working on a bug or performance problem, note how much time you spend understanding the problem vs how many lines of code were changed in the final fix. If you could send that patch back in time, how much time would you have saved? Note the key insight along your discovery path led to the final solution. How could you have gotten that insight sooner? ( Spark Inspector was just that tool for us in a recent debug jam session for our iPad app. )

Wednesday, October 9, 2013

Bugs and Mutations

It took me hours to find this bug, and it turned out to be pretty surprising. I had good check-ins before and after the change. I was refactoring my code to make the rectangle class immutable. After the change text was no longer positioned correctly:

correct position

incorrect position

I changed one line in my text-layout code to be compatible with immutable rectangles:

- metrics.area.location = metrics.area.location.neg.add offset
+ metrics.area = rect metrics.area.location.neg.add(offset), metrics.area.size

I can tell you there were no bugs in the new immutable Rectangle class.

Can you spot the bug?

There is no bug. The new line is correct, and yet that's all I changed.

It turns out I originally had a bug due to a mutation I wasn't taking into account, but that mutation made the layout algorithm work. When I stopped mutating the rectangle object, the "bug" stopped making the algorithm work. The reason it took me hours to find this bug was the new code was "right", but the algorithm was wrong.

Here's what was actually wrong, in 3 code snippets (CoffeeScript):

1) The original code that worked but had an accidental mutation:

    updateTextLayout: ->
      offset = point()
      @area = @metricsList[0].area  # object assigned here
      for metrics in @metricsList
        metrics.area.location = metrics.area.location.neg.add offset  
          # gets mutated here the first time through the loop
        offset = offset.add 0, @fontSize
        @area = @area.union metrics.area
      @size = @area.size

2) Code with one line changed that seemed correct but didn't work:

    updateTextLayout: ->
      offset = point()
      @area = @metricsList[0].area
      for metrics in @metricsList
        metrics.area = rect metrics.area.location.neg.add(offset), metrics.area.size  
          # @area no longer gets mutated
        offset = offset.add 0, @fontSize
        @area = @area.union metrics.area
      @size = @area.size

3) Final working code:

    updateTextLayout: ->
      offset = point()
      @area = null  # FIX - don't capture the wrong value here
      for metrics in @metricsList
        metrics.area = rect metrics.area.location.neg.add(offset), metrics.area.size
        offset = offset.add 0, @fontSize
        @area = if @area then @area.union metrics.area else metrics.area    
          # FIX - capture the right value here
      @size = @area.size

Wednesday, September 11, 2013

Sashimi Coding

Sashimi is a Japanese delicacy consisting of very fresh raw meat or fish sliced into thin pieces. - Wikipedia

Sashimi-coding* is the art of slicing up code into the smallest possible parts. Sashimi-coding works best when used consistently across the entire application. It may seem painful at first, but once you get used to it, you'll wonder how you ever lived without it.

Sashimi-Functions are Tiny

Sashimi-functions are the smallest possible functions.

10 lines max for dynamic languages (Javascript, Coffeescript, Ruby, Python)
20 lines max for static languages (Java, C++, Objective-C)

The Five Benefits of Sashimi-Functions

Easier to understand
Easier to test
More reusable
Self documenting
Inheritance & monkey-patching friendly

Easier to Understand

A function which is twice as long is more than twice as hard to understand. Sashimi-functions are not just a little easier to understand, they are typically much easier to understand.

Easier to Test

Breaking a large function into sashimi-functions allows you to test simpler components in isolation before testing them together.

More Reusable

Slicing a function into many sashimi-functions opens up the possibility of reusing sub-parts of the original function.

Self Documenting

Comments in the middle of functions are a code-smell. Sashimi-functions eliminate the need for these comments. Split the function in two, and the comment becomes the name of the new function.

Inheritance & Monkey-Patching

In both cases, adding custom code must be done in whole-function units. Sashimi-functions make it easy to augment or replace only the code you care about.

Performance Cost?

There is one down-side. There may be a performance cost to sashimi-functions. In static languages the compiler can usually inline away any overhead, but in dynamic languages it depends a lot on how smart the runtime environment is.

In almost all cases the large increase in developer productivity wins over the small decrease in performance. As always, avoid premature optimization and profile-first so you only optimize what needs optimizing.

* I heard the term Sashimi Coding first from Jason Thane who in turn heard it from Nathan McCoy.

Tuesday, June 11, 2013

Riak and Monotable

The Imikimi Monotable project has been placed on indefinite hold. Unfortunately we don't have the manpower. It is an excellent solution to our back-end data-store needs, and probably for many others as well. However, it is still going to require a lot of work to bring to production-ready.

Instead, we are looking seriously at Riak for our back-end store. We hope to put all our Kimi-data and most of our Sql-data in two separate Riak clusters. Riak has several benefits that Monotable was designed to solve:

no central server - every node is the same and any node can respond to any request
scales easily by adding one node at a time
can configure replication levels on a per-bucket basis
supports map-reduce
can store binary data - up to 16 megabytes per record (this our requirement, not Riak's exact max size)

However, Riak has some down-sides:

Riak's "eventually consistent" system only makes progress on a record's consistency when it is read. It is not clear that any progress is made if the record is never read after a write or consistency failure. This means the latest version of a record could go under-replicated indefinitely.
Riak's scaling is limited by your initial "ring-size". My hazy understanding is we need to set our ring-size to 20x our max expected server-count.
Riak has a lot of cross-talk between nodes. There is potentially an upper limit of 300 servers before performance degrades.
Riak does not allow range-selects on the primary key
The behavior and scalability of secondary indexes is not clear to me. Typically distributed hash tables can't do secondary indexes. Riak has secondary indexes, but there is probably a catch that we don't understand yet.
I don't think Riak has "live" map-reduce.

Monotable is designed to achieve all of the above goals and scale to very large sizes. However, we are a long ways from needing >300 servers. As long as Riak's indexes work reasonably well and we can do off-line map-reduce jobs without disrupting site performance, it should be good enough for us for now.

Monotable and Map-Reduce

Live map-reduce was not an original requirement for Monotable. However, it is a very powerful feature. It would bring Monotable from being a simple key-value store with basic range-selects to a system with powerful indexing and querying capabilities. It turns out the core of Monotable is well suited for it.

In an offline map-reduce, each chunk can be stream-processed by the mapper and fed into the reducer. Once a chunk has been reduced to a single output, the results of all chunks can be merged until a single overall result is output. If the reducer can write data back to Monotable, you can use this to generate secondary indexes.

These indexes are more powerful than SQL indexes which have a 1-to-1 mapping between source and index records. With the reduce step we can arbitrarily choose to index at any level of reduction. In fact we can store the result of every reduction if we wish. This allows us to generate an index + aggregate hybrid. With this data structure we can compute the map-reduce result over any sub-range in O(ln n) time.

This much was already a possibility within the original Monotable requirements. Just take the Monotable core and add a standard map-reduce system over top. The question becomes how can we make it live-updating?

Acknowledgment: Live-Map-Reduce is not new. I believe CouchDB does this already. CouchDB, however, doesn't scale easily to large distributed clusters.

Live-Updating Map-Reduce in Monotable

Anytime a source-record is changed, the off-line generated index will become out-of-date. Using normal map-reduce, we'd have to delete all the old data and re-run the map-reduce. Deleting a large range of records is fast in Monotable, but re-running the map-reduce will be slow. Since we are already storing the entire reduce-tree, couldn't we re-map the changed record and then update only the O(ln n) reduce steps which depend on it?

Structure Definitions

First, I need to briefly describe the structure of Monotable. The entire Monotable stores all records, sorted by key. Monotable is logically partitioned into "ranges" which are similar to separate "tables" in other stores. Each range is broken up into one or more 64 meg "chunks". Each chunk exclusively covers a sub-range of the overall Monotable. All records in Monotable with keys >= the chunk's start key and < it's end key are stored in that chunk.

To keep things simple, let's separate out map-reduce. Each range in Monotable can have zero or more mappers or reducers. A mapper takes each record and transforms it and then writes the results back into Monotable in a different range. Each reducer takes all records in one chunk and reduces them down to one record which is written, again, to a different range. In both cases, the new range can in turn have a mapper or reducer.

To configure a standard map reduce, we would apply a mapper to our data-range which write records to another range we'll call the "index". The index-range would have a reducer which write records to a meta-index-range. This meta-index-range would in turn have the same reducer step to yet another meta-meta-index-range. This continues until the meta-meta-meta-*-index-range fits within one chunk.

Live Updating

Now that we have a sense of our data-structure, how do we keep it up to date? Due to the nature of distributed data, we can't afford to do it synchronously. Monotable guarantees atomic operations within a chunk, but provides no guarantees across chunk boundaries.

When a record changes, the first step is to mark it's chunk as "dirty" with respect to all mappers and reducers applied to that chunk. Asynchronously, the master server for each chunk can later initiate the process of "cleaning" the chunk by updating all effected maps and reductions.

Unfortunately, simply re-applying the map or reduce step for the chunk doesn't work. The mapper may generate arbitrary output based on the record's data. If we naively insert the new value in the index we may be forgetting to remove the old value, which may be anywhere else in the index. We must have both the old and the new value so we can delete the old value before inserting the new one. Note that if the mapped old and mapped new values are identical, we can skip the rest of the update.

Keeping the old value of a record adds complexities, particularly since we want to allow more than one active mapper or reducer. For every mapper/reducer we need to be able to access the version of every changed records as they were the last time that mapper/reducer ran.

A New Chunk Data Structure with Versioning

I believe the answer is similar to how CouchDB stores its entire database - as a continually appending log representation of a b-tree. This format keeps a complete revision history of all changes. Each mapper or reducer just needs to track the offset of the snapshot when it was last run.

If we limit this to each individual chunk, then we avoid the main problem of CouchDB: it is painful to compact large databases. Since chunks are limited in size, we always have relatively small files which will be fast to compact. We have to be a bit careful when we compact chunks to ensure that we don't lose any of the snapshots that the map/reducers need. However, if they are kept up-to-date at roughly the same rate as our compaction schedule they should usually be fairly up-to-date and we only need to keep the most recent few snapshots. We just need a compaction algorithm that takes into account a list of snapshots that must be maintained after compaction.

Keep in mind that all this snapshotting and revision tracking is managed within one server. No synchronization needs to happen with other servers in the cluster.

As a side-effect, Monotable gains a general snapshotting system. On a chunk-by-chunk bases we could establish any arbitrary history tracking plan up to tracking all history.

Chaining

Now we know how on map/reduce step will stay up to date. The rest of the system falls out automatically. The mapper will dump its outputs into a separate range of Monotable which in turn will become "dirty" and trigger a reducer step. The reducer will write to yet another range triggering another reducer step. This will continue until the source range is reduced down to <= 1 chunk.

Chaining will add some latency to the system. Each step invokes an asynchronous delay between being marked "dirty" and the master-server for that chunk scheduling the map/reducer action. The good news is we are effectively working with a tree with very high branching factor. Pessimistically, given the minimum chunk size of 32megabytes and a large, 1k record size, we have a branching factor over 32,000. A two-level tree can address 1 billion records. Four levels can track 1 quadrillion records.

My other concern is how much work this might involves if there are random changes happening all over the source range all the time. Due to the asynchronous nature of the system, though, the answer is good. In the worse case, at the leaf level, there may be as many as one map operation per record change. This could be bad, but the mapper algorithm could be designed to run in time proportional to the number of changes. One change will only require a small amount of work to update.

As we chain "dirty" updates up the reduction tree, the story gets better. If several leaf nodes of one parent change at the same time, they will all mark their parent node as "dirty", but they won't each trigger a new reducer step. When the parent node's server schedulers the mapper step it will integrate all changes from its leaf nodes in one pass.

Brilliant!

The result is quite elegant. Each individual record change would trigger approximately 'k' steps of work reading/writing only approximately 'k' records - where 'k' is the depth of the reduction tree <= log(n) / log(32000) - n == number of records in the source range. K is <= 2 for n <= 1,000,000,000 records, and K is <= 4 for n <= 1,000,000,000,000,000,000 records. If records are changing faster than the reduction steps can propagate up the tree, k will become smaller since updates up the tree can cover more than one child-node per update. K will approach 1 as the speed of record updates increases. The system gets more efficient as it becomes more loaded.

Thursday, January 31, 2013

Parsing JSON the Not-So-Hard Way With Ruby and Babel Bridge

Aaron Patterson recently posted this article on Parsing JSON the hard way using Racc. I think he did an excellent job explaining Racc. However, it was painful. Racc is an LALR(1) parser-generator similar to Yacc. While Yacc was fun to learn in school, it's also painful to use. LALR is hard to debug ("shift-reduce conflict" - !@#$% - if you know what I mean), and the parser generation steps are tedious to set up and maintain. There is a better way...

Enter Parsing Expression Grammars. I learned about PEGs from Treetop. Think of PEGs as greedy regular expressions with recursion. Being greedy, they don't try every possible pattern-match. This turns out to be a good thing; it makes them much easier to understand and debug. Adding recursive rule-matching to regular expressions makes PEGs approximately as expressive as LALR(1) parsers (though not identical). Treetop got me excited to work with parsers again, but sadly it still had some of the awkwardness of Yacc-style parser-generators. You have to build "treetop" files into ruby files before you can use them. There is yet an even better way...

Enter Babel Bridge (github). I wanted a PEG based parser-generator that allowed me to use 100% ruby with no extra build steps, so I built Babel Bridge. I've been evolving BB for over 2 years now. It is a powerful tool for building easy-to-understand parsers.

Here is how to write a JSON parser using the Babel Bridge gem. If you want to skip to the punch-line, here is the JSON parser source code.

Writing a JSON parser with Babel Bridge

First, install the babel_bridge gem:

gem install babel_bridge

With Babel Bridge you can create many different parsers and have many instances of each in one runtime. To define a new parser, create a class and inherit from BabelBridge::Parser.

 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  # define the parser here
end

The main component for defining your parser is the "rule". A rule is a named pattern for matching strings of characters. The first parameter of the "rule" method is the name of the rule. The remaining parameters specify one or more patterns to match. For more detailed info, please see the readme.

There are five literal values that are allowed with JSON: string, number, true, false and null. Let's start by making rules for each of them. (Credit: I used Aaron Patterson's regular expressions he derived from json.org.)

 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  rule :value, any(:number, :string, :true, :false, :null)

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null"
end

BabelBridge::Shell.new(JsonParser.new).start

In addition to the five rules for the literals, I added two more lines of code. First, I added a general rule called :value. It matches any of the five literals. Babel-Bridge starts matching using the first rule you declare. That "root rule" must include everything you want to match. JsonParser will now match any of the JSON literals. Second, I added a "shell". It is a Babel-Bridge tool to help make parser-testing easier. Try running the above code and typing in some JSON literals at the prompts.

 
> "string literal"

Parse tree:
  ValueNode1 > StringNode1 > "\"string literal\""

> 123

Parse tree:
  ValueNode1 > NumberNode1 > "123"

> null

Parse tree:
  ValueNode1 > NullNode1 > "null"

We are halfway done. There are two more important patterns for JSON: objects (hashes) and arrays. Arrays consist of 0 or more :values, delimited by "," and wrapped in "[]". Objects consist of 0 or more :pairs, again delimited by "," and wrapped in "{}". Pairs consist of a :string, then a ":", then a :value.

 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  rule :value, any(
    :object, :array, :number,
    :string, :true, :false, :null
  )

  rule :array,  '[', many?(:value, ','), ']'
  rule :object, '{', many?(:pair,  ','), '}'

  rule :pair, :string, ':', :value

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null"
end

BabelBridge::Shell.new(JsonParser.new).start

You can try out the new rules:

 
> ["my string",123,true]

Parse tree:
  ValueNode1 > ArrayNode1
    "["
    ValueNode1 > StringNode1 > "\"my string\""
    ValueNode1 > NumberNode1 > "123"
    ValueNode1 > TrueNode1 > "true"
    "]"

> {"my key":"my value"}

Parse tree:
  ValueNode1 > ObjectNode1
    "{"
    PairNode1
      StringNode1 > "\"my key\""
      ":"
      ValueNode1 > StringNode1 > "\"my value\""
    "}"

If JSON allowed literals to stand alone, we'd be done. Unfortunately JSON is a little more restrictive, so we have to add a new root-rule to ensure there is an array or object at the top-most level.

There is one other thing we need to add. Parsing-Expression-Grammars match every character or fail. Right now we are not handling white-space, so any white-space character in our string will cause our parsing to fail. BB makes this easy. Just call "ignore_whitespace" in your class declaration.

Below is the complete JSON parser. It should match any legal JSON and reject anything illegal. However, it just returns the parse-tree when it succeeds. In the next section we'll add the same functionality Aaron did in his blog post.

 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  ignore_whitespace

  rule :document, any(:object, :array)

  rule :array,  '[', many?(:value, ','), ']'
  rule :object, '{', many?(:pair,  ','), '}'

  rule :pair, :string, ':', :value

  rule :value, any(
    :object, :array, :number,
    :string, :true, :false, :null
  )

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null"
end

BabelBridge::Shell.new(JsonParser.new).start

source

Adding Functionality to The Parse Tree

Every rule you define creates a node class. Every time an instance of that rule is parsed, an instance of that node class is created. With Babel-Bridge you can easily define methods on the parse-tree nodes for each rule you declare. When calling the rule method, just open up a do-block. The code inside the do-block is executed inside the class for that node.

Our goal now is to convert the resulting parse-tree from our JSON parser into the equivalent Ruby Hashes, Arrays and Strings etc. We are going to add an "evaluate" method to our nodes which returns the correct Ruby data-structures. For example, here is how we can add the evaluate method to our :null rule.

 
  rule :null, "null" do
    def evaluate; nil end
  end

To evaluate the remaining JSON literals we can leverage the fact that Ruby's literals are nearly identical. In fact, "null" is the only one which is different. To avoid duplicating code (DRY!), I'm going to insert an extra rule called :ruby_compatible_literal. Calling to_s on a node just returns the source string that node matched. Using Ruby's "eval" method, we can evaluate that string and get the Ruby object back.

 
require "babel_bridge"
class JsonParser < BabelBridge::Parser
  ignore_whitespace

  rule :document, any(:object, :array)

  rule :array, '[', many?(:value, ','), ']'

  rule :object, '{', many?(:pair,  ','), '}'
  rule :pair, :string, ':', :value

  rule :value, any(:object, :array, 
    :ruby_compatible_literal, :null
  )

  rule :ruby_compatible_literal, any(:number, 
    :string, :true, :false
  ) do
    def evaluate; eval(to_s); end
  end

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null" do
    def evaluate; nil end
  end
end

BabelBridge::Shell.new(JsonParser.new).start

Next, let's add the evaluate method to array. When there are nested rules, you can access them by name. If that sub-rule can be matched more than once, it will return an array of the child nodes which were matched with that rule. Here, the #value method returns a :value nodes in matched in the array. We just need to convert that array to the #evaluated versions of each :value node.

 
  rule :array, '[', many?(:value, ','), ']' do
    def evaluate; value.collect {|v| v.evaluate}; end
  end

Evaluating Objects takes two steps. The :pair rule returns a 2-element array of the form: [key, value]. Then the :object rule uses the Hash[] constructor for the final result.

 
  rule :object, '{', many?(:pair,  ','), '}' do
    def evaluate; Hash[ pair.collect {|p| p.evaluate } ] end
  end

  rule :pair, :string, ':', :value do
    def evaluate; [ eval(string.to_s), value.evaluate ] end
  end

Here is the full source-code for parsing JSON and generating Ruby data structures.

 
require "babel_bridge"

class JsonParser < BabelBridge::Parser
  ignore_whitespace

  rule :document, any(:object, :array)

  rule :array, '[', many?(:value, ','), ']' do
    def evaluate; value.collect {|v| v.evaluate} end
  end

  rule :object, '{', many?(:pair,  ','), '}' do
    def evaluate; Hash[ pair.collect {|p| p.evaluate } ] end
  end

  rule :pair, :string, ':', :value do
    def evaluate; [ eval(string.to_s), value.evaluate ] end
  end

  rule :value, any(:object, :array, :ruby_compatible_literal, :null)

  rule :ruby_compatible_literal, any(:number, :string, :true, :false) do
    def evaluate; eval(to_s); end
  end

  rule :string, /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/
  rule :number, /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/
  rule :true,   "true"
  rule :false,  "false"
  rule :null,   "null" do
    def evaluate; nil end
  end
end

BabelBridge::Shell.new(JsonParser.new).start

source You'll notice that we didn't have to define evaluate on every node. Babel-Bridge has some intelligent defaults to help minimize code-size. Now run the code and enter some JSON values:

 
> [null]
 => [nil]

> {"my key":"my value"}
 => {"my key"=>"my value"}

Notice that the shell stopped printing our parse-tree. The shell looks for the "evaluate" method on the root node. If it exists, it prints out root_node.evaluate.inspect. You might also try inputing some illegal values. This shows off one of Babel-Bridge's powerful debugging tools - the parser-failure-info.

 
> [nil]
Parsing error at line 1 column 2 offset 1

Source:
...
[<HERE>nil]
...

Parse path at failure:
  DocumentNode1

Expecting one of:
  "["                                                ArrayNode1
  "]"                                                ArrayNode1
  "false"                                            FalseNode1
  "null"                                             NullNode1
  "true"                                             TrueNode1
  "{"                                                ObjectNode1
  /"(?:[^"\\]|\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4}))*"/ StringNode1
  /-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?/      NumberNode1

If you want to use your parser in real code, you probably won't want to use the shell. Here's how:

 
json_parser = JsonParser.new

parse_tree = json_parser.parse "[null]"

if parse_tree
  ruby_data_structure = parse_tree.evaluate
  puts ruby_data_structure.inspect
else
  puts "ParsingError: " + json_parser.parser_failure_info
end

For further reading, you might be interested in my earlier blog post: How to Create a Turing Complete Programming Language in 40 Minutes (and 70 lines of code).

Saturday, January 12, 2013

Don't Learn a New Programming Language Every Year. Write one!

Don't learn a new language every year

Programmers often recommend learning a new programming language every year. I don't agree. I've studied many languages by reading their specs and dabbling with some code until I feel like I've got the basics, but I don't think I really learned them. My path to really "learning" Ruby showed me why.

Ruby was not a challenging language to learn. I picked up the Pragmatic Programmers book and read it end to end (an endorsement to both the language and the book). When the time came to start Imikimi, I chose Rails to do the web-side. However, I discovered that I hadn't really learned to *think* in the language until after several years of continuous professional use.

I was a C++ programmer before I was a Ruby programmer. A lot of my early Ruby code looks and feels a lot like C++ code. There were constructs in Ruby that I wasn't comfortable with, so I simply didn't use them. Over time I re-discovered language features I "learned" in my first reading of the Pragmatic book. Over time I learned how to program in Ruby-style from the community. By studying the Rails source and other gems I learned how to better use the language and how to write clean, readable code.

Learning a language by reading a book and writing a few small sample apps is equivalent to learning the grammar and a few phrases of spoken language. It's a shallow understanding which only broadens your perspective in shallow ways. I don't think you can claim to really know a language, spoken or programming, until you are conversant in it. It's a long, slow, but ultimately rewarding processes.

Here's my recommendation: To get started, do learn a few extra languages, shallowly. The second and third language you learn will broaden your horizons even if you don't dive too deep into them. After that, though, you need to focus on deepening your understanding either of your current language of choice, or a new one, through hours and hours of work on real projects, OR through my second recommendation:

Write a Toy language every few years

I'm writing a programming language. I've spent more than a decade thinking about programming languages and PL design. From 2001 until 2006 I worked on new language ideas full-time. Unfortunately none of that work saw the light of day, so I'm circling around again. I've learned a lot since then, primarily from Ruby and the Ruby community.

My current project, the River programming language, is an attempt to build the simplest Prototype-Based core language that can support a Ruby-like class-system on-top, written in the language itself. I've learned more about Ruby in the past few months than I have in the past few years.

You learn a lot about programming when you write a language implementation. It forces you to examine all the edge cases you normally ignore. And it happens fast. You are immediately forced into uncomfortable areas of programming languages you were unaware of. There are trade-offs to be made at every level. There are boot-strapping issues to resolve, and there are a host of subtle semantics in your host language that you never thought about.

How to get started

If you decide to follow this path, here are a few tips.

Use a flexible, agile host language. You will make many major refactors along the way, and you don't want to get bogged down in irrelevant details. This is true for any project, but especially true when working on something as meta as writing one language inside another. I'd recommend Ruby, but Python or even Javascript would work well.
Use a well-designed parser-generator (resource: my Babel-Bridge parser-generator gem)
Don't worry about performance at first, even if that is your primary goal.
Target Turing-complete functionality before all else. This will give you a solid base-camp to build out from as you explore the rest of your language. (resource: my How to write a Turing-Complete Programming Language in 40 minutes video)

You can follow my design progress on Twitter: #riverlang / @shanebdavis

Happy PL design!

Essence and Artifact