Tuesday, March 16, 2010

Parsing

I'm about as anal about source code as anyone can get. I loose tooth enamel over long functions and winding spaghetti code, and even still, I have to say Upstart's code quality is pretty good. Its mostly straightforward, pretty well partitioned, and its about 30-50% comment by volume in most places.

Then you get to configuration file parsing.

Upstart grew up alongside libnih, a general purpose "toolbox" library full of useful generic building blocks for low-level C code. Upstart's config file format is parsed by nih-config. This is where the issues begin. nih-config is an awkward beast. A bit more than a tokenizer, a bit less than a parser, its clumsy and tends to leave the user with a good bit of the parsing to do by themselves.

The Right Way to do anything in programming is, of course, to steal the work of people who already did it better. So I went on my way to find pre-existing parser generators and parsing libraries.

flex/bison was the name I'd heard most before. Its not perfect but it more or less does the job. Its failing, though, can be summarized in 3 words: start on start.

The problem with flex/bison is that they need to separate the string into larger tokens, and they need to do this without examining the progress of parsing so far, which means that we can't tell bison that the first start is a keyword and the second start is an identifier.

On the other hand, take a look at treetop. Treetop is based on Parsing Expression Grammars, which are a lot like regular expressions, but with added firepower that lets them handle context-free languages.

To explain that better for non-automata-theory-nerds, try to write a regex that looks at a string of < and > characters and determines if they're well nested (i.e. <<>><> matches but <>>< doesn't). It can't be done. Regexes don't have that power.

But now imagine that you could have many regexes assigned to variables, and that you could "use" one regex from inside another, so you could write "a followed by b followed by something that matches regex C". If we allow for recursion, we can solve our problem easily.

A = /^(<$A>|)$/

(Note, I'm assuming you're using a regex dialect where < and > aren't meaningful).

So Treetop gives us this power, but again, its no good. Treetop is for Ruby. Upstart is not in Ruby and never will be. We need something with Treetop's powers, that's written in, or at least can output, C code.

And that's exactly what I've been writing.

8 comments:

Jef Spaleta said...

been writing....as in as a standalone library project?

-jef

Alexander said...

While its sort of a pain its not *impossible* to handle a case like this in flex/bison, you just have to extend the grammar a bit so you have tokens for keywords, identifiers and keyword_or_identifier.

Casey said...

@Jef: It will be part of libnih, which is light enough that if you needed this it'd almost be worth rolling it in.

@Alexander: Yeah, looked at it. Not really worth the effort IMHO.

Kevin Kofler said...

Actually the easiest solution is to have each keyword be a dedicated token and then:
identifier: ID | START { $$.id = "start"; } | ...
If that makes the grammar no longer LR(1), use Bison's GLR mode.

lutter said...

Shameless plug: you can also use Augeas for this. The added benefit to your users is that they can use the 'lens' you use for reading configs to modify them from their programs.

Casey said...

@lutter: Told Scott about that. He asked what your test coverage was :)

lutter said...

@Casey: real men don't need no stinking tests ;) Seriously though, it's pretty thorough. Pop by #augeas on freenode and I'll be happy to answer more questions

驚訝 said...

good, excellent, nice job!!........................................