Thursday, October 18, 2012

Literate Randomizer

Yesterday I was seeding my database for a new Rails project. This project has profile pages with large free-form text fields. In order to get a good sense of how the site layout feels with real data, I needed some decent fake profiles. The existing tools we were using just weren't cutting it.

There are several gems available to help with this task.

Here is a quick-link to Literate Randomizer's readme and source.

Faker

Faker is great for generating fake names and addresses, but when it comes to generating fake prose, it follows the Loren Impsum model of obvious nonsense. This was not going to cut it:

Mollitia id adipisci. A voluptatum et veritatis quas enim veniam consectetur. Id perferendis quia eum et. Rerum consequatur harum quisquam impedit sit et.

Qui est qui autem inventore esse laboriosam. Vitae expedita numquam eveniet repudiandae voluptatibus est in. Ratione eos maxime ut debitis illo. Voluptatibus ea molestiae et quas non est.

Et ut distinctio qui nihil consequatur odio. Labore voluptas magni sapiente totam. Sed sed est nobis quia temporibus pariatur. Nihil accusamus molestiae qui occaecati ut. Sed consequatur nobis quis consequatur rerum.

random_data

RandomData was next. RD did generate some more convincing prose, but if you glance the source you'll see it is simply randomly selecting from 23 fixed sentances

Soaring through all the galaxies in search of Earth flying in to the night. Every stop I make I make a new friend; Can't stay for long just turn around and I'm gone again. Hey there where ya goin, not exactly knowin'. Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum. He just keeps on movin' and ladies keep improvin'.

Lorem ipsum dolor sit amet consectetur adipisicing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. He's going everywhere, B.J. McKay and his best friend Bear. Ulysses, fighting evil and tyranny with all his power and with all of his might. Ulysses, fighting evil and tyranny with all his power and with all of his might. Every stop I make I make a new friend; Can't stay for long just turn around and I'm gone again.

Literate Randomizer

If you want to generate a lot of random prose, neither of the above options work well. With a little googling I ran across this post by Tim Riley. In it he shows some code for using Markov-Chains to generate plausible prose generated based on a source text. Using that source text, we can get a probability distribution of which words follow which words. Given that we can generate some near-english truly random sentences.

Note: RandomData has a Markov tool, too, but it is very limited. I found another gem that just copy-pasted Tim's code into a gem-file, but it wasn't properly packaged nor usable.

I was bummed to realize Tim hadn't made a gem of his wonderful little insight, so I thought I'd dive in and do it for him. I thought I could just copy Tim's code, give him credit, and release a nice clean gem. Turns out it wasn't that simple, or perhaps I just had too much fun with this problem. Another problem with all the tools above is they don't give you much control over the randomly generated output. I wanted control over the length and number of sentences as well as the number of paragraphs. I wanted interesting punctuation. With Markov chaining, you end up with a lot of sentences that end in prepositions (the, and, of, etc...).

The result, is a nice little gem called literate_randomizer. Here is a simple example. For more, please go to the literate randomizer source/readme.

 require 'literate_randomizer'  
 LiterateRandomizer.word  
 # => "frivolous"   
 LiterateRandomizer.sentence  
 # => "Muscular arms round opening of sorts while Lord John Roxton."   
 puts LiterateRandomizer.paragraphs  
The last line outputs:

Memory at the narrative may have already as I could be reflected upon. Society in a dozen. Headquarters for the stream. Stream which if we. Iguanodon's hide and admirably the right line of Project Gutenberg-tm electronic. Forgive me down hull-down on every detail which! Until they were. Mantelpiece in front of a necklace and that same scene. Eighteen of a creature beside his white merchant or hope.

Rain must be. Falls the paths which may have opened out we limped homewards sadly mauled and no. AGREE THAT THE CHAIRMAN. Twentieth century and horrible havoc. Train of Crayston Medal for canoes we are the scent and they. Swooped something which had a purpose than the envelope but if this. Entirely sir why this book. All these creatures that on through the rope dangled down the dinosaurs. Exhaust of the vertebrate stage of them and it began amid shouting gesticulating. Worthy of the old volcanic vents and distribute it away and under the prevailing. Assured that they took into line of the cliff. Unloosed a frugal breakfast of it run to the chasm which with. Loosened our thwarted wishes of dark until they took one's heart to win through. Cotton jacket which you can exhibit to gain a refund of hers outlined.

Novelist could have a jacket which communicated with its wing. Colliery explosion and stroking. Town was justified if ever moved upon his remarks had returned to bring. XIII A Sight which the plateau and many small Japanese tray of fairyland. Sadly mauled and flitted about that all turning these lumps of such. Cutting a sentiment. Grieved for us give. Whites our largest I should be confuted if animals then the prospect of the risk? Camberwell tram with a sheet of language is our zareba.

LiterateRandomizer.paragraphs options:
  • :first_word => nil - the first word of the paragraph
  • :words => range or int - number of words in sentence
  • :sentences => range or int - number of sentences/paragraph
  • :punctuation => nil - punction to end the paragraph with
    (nil == randomly selected from punctuation_distribution)
  • :paragraphs => range or int - number of paragraphs
  • :join => "\n\n" - join the paragraphs
    If :join => false, returns an array of the paragraphs

Thursday, September 27, 2012

Learnable Programming

More great stuff from Bret Victor: http://worrydream.com/LearnableProgramming/

Start by jumping through and watching the embedded videos. They are self explanatory. Watch them right now. It'll only take a few seconds each.

I started working along a similar line about a decade ago. I called it "interactive execution". The key idea is the program is re-run with every single change you make. All values are logged. You can see the actual inputs and outputs of every function for the entire run. You can see them update in realtime as you edit your code. It is easy to dismisses these ideas as non-applicable in "real-world-code":

These design principles were presented in the context of systems for learning, but they apply universally. An experienced programmer may not need to know what an "if" statement means, but she does need to understand the runtime behavior of her program, and she needs to understand it while she's programming.
A frequent question about the sort of techniques presented here is, "How does this scale to real-world programming?" This is a reasonable question, but it's somewhat like asking how the internal combustion engine will benefit horses. The question assumes the wrong kind of change.
Here is a more useful attitude: Programming has to work like this. Programmers must be able to read the vocabulary, follow the flow, and see the state. Programmers have to create by reacting and create by abstracting. Assume that these are requirements. Given these requirements, how do we redesign programming?

I believe it isn't that hard to imagine how to apply "interactive execution" to real-world code. Two simple tricks allow you to achieve useful application of these ideas in any sized (single-threaded) code-base:

  1. Allow the programmer to select a function as the start-point. Execute the program up to the start of that function and start a transaction against the entire running state of the code. Then, roll back the transaction with every change and restart execution from that start function.
  2. Run the execution in a separate thread from the editing thread. Every character typed aborts the execution and restarts it. If the execution takes a second or two, that's OK. When the programmer wants to see the current outputs, they just stop typing for a bit. If it takes minutes or hours, the programmer is in charge of picking a better start point for her transactional, interactive-execution session.

This is really powerful stuff. I've often noticed that the skill level of a programmer is proportional to how long they can go between coding and executing their code to see if it works. The key skill here is being able to simulate the computer in your head. What a waste! The computer can simulate itself much better than your head can. Let the computer do what it's good at, and leave the human mind open to being creative.

I liked this quote about OO:

Bob Barton [said] "The basic principle of recursive design is to make the parts have the same power as the whole." For the first time I thought of the whole as the entire computer, and wondered why anyone would want to divide it up into weaker things called data structures and procedures. Why not divide it up into little computers... Why not thousands of them, each simulating a useful structure?

I think I'll start calling objects "little computers".

Wednesday, September 19, 2012

How to Create a Turing Complete Programming Language in 40 Minutes

Using a programming language is cool. Writing a programming language is cooler. Writing one in 40 minutes rocks! I just released a screencast on how to create a Turing Complete programming language in 40 minutes and 70 lines of code using my Babel Bridge parser generation gem and Ruby.

In this example I use my Babel Bridge parser-generator gem: homepage / github

Here is the 70-line-source-code from the screencast.


We continue to need fresh minds and new approaches to programming. There are many great challenges facing us in improving programmer productivity and supporting programming in our massively parallel age. I hope this screencast wets your appetite to pick up a parser-generator and play around with your own language. Maybe you'll invent the next Ruby, Java or Javascript!

Monday, March 19, 2012

Bret Victor - Inventing on Principle


I want to recognize excellence. Bret Victor's talk on Inventing on Principle is excellent. Go watch it now if you haven't already. vimeo-link

I loved how the first half exactly paralleled thoughts I've had for a long time which I've summed up as POET: the Principle of Exposure & Transparency - any problem becomes trivial with enough exposure. I was thinking more from the solving-problems angle where Bret Victor was thinking from the creativity angle. I think the ideas mirrored nicely. I definitely liked how he visualized interactive-execution in text-code. That's a nut I hadn't cracked yet. This is just the tip of the ice-berg. Lots of interesting work to be done here.

The principled approach of the second half of Bret's talk was fascinating. I've had a focused path for most of my career, but even still, after watching his talk I've had a hard time simplifying my path down to a single principle. I'm not sure if that's because I don't fully understand my path's essence as well as Bret understands his or if some goals are not distillable down to a one-sentence description. I like his point regarding how a good principle can give you guidance where-as a bad one can be meaningless. "Write beautiful code" - meaningless. "Creators need immediate connection with what they are creating" - meaningful and provides a system of measuring, giving feedback and guiding you through design-space.

I feel pretty strongly about eliminating "artifacts". Historical legacies that are no longer relevant can gum up the works, particularly when they are hidden in plane sight. I also love to invent. As a contrast to eliminating artifacts, I am also drawn to "essential" technologies. What technologies, if they existed, and were done "right" (with as little artifacts as possible) could change the world? I suppose in that light, my Essence & Artifact theme might just be my core principle. Eliminate artifacts; understand the Essence. One might argue it's closer to the meaningless end of the principle continuum. It certainly isn't as concrete as "No Modes". On the other hand, every feature, function, design and architecture can be measured against an Occam's Razor of "is this Essential, or is it Artificial?"

Essence is timeless, simple, powerful, maximally applicable, modular, and focused. But be careful, the purely essential is also purely theoretical.

Artifacts are historical, complex, constrained, difficult, narrowly focused and/or un-focused. But be respectful, artifacts are the only things that are actual.

Thursday, December 15, 2011

Where do Profits come from?

This week's EconTalk Podcast: Munger on Profits, Entrepreneurship, and Storytelling

In this week's EconTalk Munger and Robers presented and discussed several mini-stories about trading. They explored many angles of society's common, mistaken malignment of trade. Specifically, the malignment of profit acquired solely through trade without actually "producing anything". The entire 'cast is excellent and well worth your time listening to or reading the transcript (also available at the link above).

I want to point out two statements from the pod-cast that were particularly awesome.

"The source of profits is correcting errors."

The whole quote is:
The source of profits is correcting errors. The economy around us is full of errors. And what I mean by errors is a maladjustment, a divergence between what we are actually producing and what we "should" produce in order to use the stuff that we have--our mental resources, physical capital, labor--to produce the highest possible level of satisfaction of what the public wants. -Munger 

If you chew on that for a bit you'll find it is an astonishingly deep insight. In economics there is a concept of optimality where supply has been balanced to exactly meet demand and price has been adjusted to exactly match costs. In other words, an optimal market has no profits. What we have, instead, in the real world, is a sub-optimal market. Resources have not been balanced optimally. There are gaps in the market where goods and services are not being put to their optimal use. If someone can discover these, they can connect the good or service with a more optimal use, benefitting both the producer and the consumer, and gaining some profit. All three parties benefit from the trader correcting an error in the market.

Now for the second awesome statement. First I'll paraphrase:

Trading as a more equitable means of maximizing 
individual satisfaction than 'goodwill'.

The second statement is from the last story Munger and Roberts discuss. It appears to be a true story about a British officer name R.A. Radford who had a background in economics and became a POW during WWII. His POW camp of 2400 inmates lived solely on rations provided by the red cross. The rations were identical for every inmate and identical month to month. From an economics point of view, the production of the entire closed society was static and egalitarian. This clearly sets up a fascinating case-study for how a society might organize when production is not an issue. Radford wrote up his economic observations later in an essay which is available online: The Economic Organisation of a P.O.W. Camp - R.A. Radford.

The whole quote, directly from Radford's essay is:
Very soon after capture people realized that it was both undesirable and unnecessary, in view of the limited size and the equality of supplies, to give away or to accept gifts of cigarettes or food. "Goodwill" developed into trading as a more equitable means of maximizing individual satisfaction. -Radford

What a wonderful statement! Wonderful, I suppose, if you are already pro-free-markets. If you think about it, though, it makes sense. If I give you the bits of my rations I don't want I may benefit you some, but I don't know if you are the person in camp who would benefit most from my gift. How should I answer that question? I need some way of finding out the value of my extra rations to every person in the camp. The problem is there are 2400 people in the camp, and they each have some bit of extra rations they too aren't using and would like to find the most valuable use for. How do you solve such a complicated problem, and do you really need the cooperative 'goodwill' of all 2400 people?

No, you just need trade.

Trade and, by implication, a price system are the best known way of solving such a complicated constraint optimization problem. Every single trade benefits both parties, generating wealth for both. If it were not so, they wouldn't trade. If there are inefficiencies in the trades, if the overal economy is sub-optimal, if there are errors in the system, then traders can arise to correct those errors and bring the overall system closer to optimality profiting themselves, yes, but also the people they trade with, and via proof-by-mathematical-induction, everyone in the camp.

Fascinating! This simple podcast which was pretty light as EconTalk goes really got me thinking.

NOTE: All the above assumes the trades are without fraud or coercion. This appears to have been mostly true in the POW example, though if you read the essay, various interesting market failures did occur, and were, for the most part, corrected by market forces. Roberts and Munger to address this in their discussion as well.

Thursday, September 22, 2011

Parsing, Ruby and Babel-Bridge


Here is the video from a recent talk I did on parsing and the Babel-Bridge gem I wrote for generating parsers with ease in Ruby. I recommend viewing it full-screen in 480 or 720p so you can see the text in the live-demo. Enjoy!

Wednesday, September 21, 2011

The (Sad) State of Concurrency in Ruby 1.9.2

Why do we want concurrency? Because we want to speed things up!

What is making us slow?
  1. We are waiting for the CPU
  2. We are waiting for network I/O
  3. We are waiting for the user
  4. We are waiting for disk I/O
Cpu-Bound
What do we do? We have other cores, let's make use of them!

Threading   The mainstream answer these days is spawning threads. However, if you try this in Ruby you will be in for a surprise. Though Ruby 1.9 uses OS threads, the interpretor has what is called a "global interpreter lock" or "GIL". This means that any time one thread is running Ruby code, no other thread can be running Ruby code. This effectively means that, in Ruby, even if you have multiple threads, you can only use at most 1 core of your CPU at a time.

Threading also has all the common problems of race conditions, locks, synchronizing and general "thread safety". This can be a little intimidating to manage, but usually a clean efficient solution can be arrived at. The best idea is to share as little global data as possible. If you are writing a server, assign at most one thread per request and keep all the data for that request local and you are already 99% there.

Notes:
  • Rubinius is said to be working on a version of their runtime that doesn't have a GIL and could therefor use multiple cores.
  • jRuby works today without a GIL and allows the full use of multiple cores.
Forking   Okay, so threading isn't really an answer. Instead of running multiple threads, how about running multiple processes? Generally, this can work. If you run more than one instance of the Ruby VM, you can use more than one core. But, there are a lot of limitations, some of them unique to Ruby.

Processes don't share memory. Once you fork a new processes you can still read the state of your program before the fork, but any changes you make in process B won't update process A and visa versa. Once you fork you must coordinate your processes through some other means, usually that means Pipes. This is a highly error prone and tedious "API".

On Linux or other Unix-based systems (including MacOS), forking a new processes is usually very efficient. In abstract, forking means cloning the entire processes. In practice, though, the operating system has a nifty feature called "copy on write" (CoW). Which says the two processes can share most all of the memory that hasn't been written to since the fork. As soon as one or the other processes writes to the memory (usually a CPU "page" = 4k), it is copied over into the new processes. This means that when you fork a new processes nothing needs to be copied at first and only a few things need to be copied over time as you start altering things. This is not only CPU efficient, it is also memory efficient. Great, right? Well, in Ruby it's not so great. Ruby's garbage collector (1.9.x) writes to every object every time it runs. This means that the first time either processes triggers garbage collection, the entire processes must be copied. In other words, Ruby loses all benefits of CoW. If you are writing a large application that uses a lot of memory, forking only makes sense for very long-running processes because the overhead of forking is very high.

Notes:
  • Ruby Enterprise Edition patched Ruby 1.8.x to be CoW friendly. There is currently no Enterprise Edition for Ruby 1.9.x and forward support of REE seems to be in question.
  • Rubinius is supposed to be CoW friendly. I have not tested this.
  • jRuby, well, just use Threads. Forking may work, too, if Java allows it and is CoW friendly. Both seem likely, but I don't know.

Conclusion   If your app is CPU-bound your best bet is to optimize your code. It is very hard to impossible to use multiple cores in Ruby unless you are willing to switch fully to jRuby.

Network I/O Bound
EventMachine   If you are writing some sort of server that is going to respond to network requests or if your app sends out network requests it is highly likely that you are network I/O bound. In the server case, you'd like to be able to be processing another request while you wait for the first one's input to arrive. In the client case, you'd like to be doing other things while waiting for the server to get back to you.

If your app is network I/O bound, there is a good solution in Ruby: EventMachine. All network I/O is "evented". When you fire off a network request, you pass in a block of what you want to do after the network request completes. The call returns immediately. You can then do other work while the network request processes and when it is done, the EventMachine "Reactor" (global event loop) will queue your block and execute it. Similarly if you are waiting for input, you can pass in a block of what to do once the input is ready, the call returns immediately and you can do other work while the input processes.

Eventing has the nifty advantage of being concurrent without having to be "thread-safe". It is known as a "cooperative multitasking". Threading is "preemptive multitasking". The down side of eventing is you have to pass around a lot of blocks that are "defferred" to run later at some point. Code execution order can get confusing and your code-size goes up significantly.

Another thing to be aware of is eventing only runs on one thread. Even if you are using jRuby, you will still only be using one core. EventMachine has the idea of "offline" processes which run on separate threads from a thread pool. This gets you halfway there, but keep in mind that your Reactor, the global event loop, is still single-threaded. At some point that one thread running the event loop is going to be a bottleneck no matter how many offline threads you have running. Obviously if you are running offline threads you have to think about thread-safety again, too.

Fibers   Fibers are a cool new technology in Ruby 1.9. They are basically threads, except they actually share just one OS thread and they only multitask when you tell them to. This makes thread-saftey trivial, much like eventing, PLUS you can still write your code in a linear fashion. A mix of eventing and fibers basically gets you all the simplicity of ignoring thread-safety with little of the conceptual or code overhead of pure eventing. Sounding pretty good, right?

Conclusion   Eventing, optionally with Fibers, can solve your network IO bound situation. It is a clean framework that isn't as complicated to write for as a pure threading solution. The key benefit over threading is simpler concurrency and allegedly higher performance on a single core. The down side is you are limited to one core.

Note   Threading works well for network-bound processes, too. Although Ruby has the GIL, the GIL is released whenever Ruby makes calls to the OS - including networking calls. This means that when a thread is waiting for network IO, another ruby thread can be running on the core. Threads solve the network-bound problem if you can mange the thread safety.

Waiting for the User
There isn't a lot of user interfaces written in Ruby, but concurrency is a common problem in GUI design. Generally pure Eventing works great for this. Inputs, in the form of user mouse and keyboard interactions can trigger events, and outputs, in the form of screen updates, are usually quick enough that they don't need their own threads. I'd be curious to hear about anyone who has practical experience with this in Ruby, but I suspect this isn't a major problem. You are mostly likely to end up being CPU bound if your interface is properly evented rather than User-bound.

Disk I/O Bound
What if your app is disk IO bound? I'm writing a store server in Ruby called Monotable. Ideally, under maximum load, it should keep N disks reading and writing at their max capacity at all times. In principle, Eventing and Fibers should work here. Whenever I read or write to disk, I could suspend execution of the current Fiber and let other events fire. When the read/write completes, the system could queue my fiber as an event and it would continue on the main thread when its turn comes up.

Unfortunately neither EventMachine nor any of the Fiber-aware gems available support evented disk IO. Looking at the Ruby IO library it appears there is no way to asynchronously read or write and be notified or test later if the read/write has completed. I've looked at IO#read_nonblock and IO#write_nonblock and it is not clear those are answers to this problem.

Conclusion   Only threading can solve the disk-IO-bound problem. Since the GIL is released with any File I/O, other threads can be doing work while the first is blocked. It seams the best solution will be to use EventMachine for network IO and defer any disk IO work to threads.

Back to Being CPU Bound...
If you are network IO bound, you can use EventMachine and Fibers to speed you up, but under enough load you will be CPU bound. For disk IO bottlenecks, threads are your best bet, and they will speed you up, but you will ultimately be CPU bound here as well. Unfortunately, Ruby doesn't help us much with this. Rubinius is looking promising, and jRuby is here today with multi-core support, but if you want your code to work with the standard, mainstream Ruby 1.9.x you are mostly out of luck.

This is an incomplete exploration of the problem. I'd love to hear if people discover more on this topic.

This is written 2011-09-21. My current version of Ruby is ruby 1.9.2p290 (2011-07-09 revision 32553) [x86_64-linux].

Other Interpretors
Gems dealing with non-blocking IO
GIL - Global Interpretor Lock / Global VM Lock
No-one has an answer to the non-blocking Disk IO question