Archive for the 'LMX' Category

LMX and Adium message history Q&A

Saturday, March 17th, 2007

There’s been some discussion of LMX on the web since I announced LMX 1.0’s release. As I mentioned then, LMX is the library that powers Adium’s message history feature. Mostly, people have questioned whether XML was the best choice for logging given the message history requirement.

I recommend first reading my post on the Adium blog about message history. (If you came here from the Adium blog, sorry for the bouncing back and forth—that’s the last bounce, I promise. ;)

Welcome back. Let’s begin.

The questions and objections listed here are drawn from the comments on my LMX 1.0 announcement post, this article on the O’Reilly XML blog, the reddit post about LMX, and Tim Bray’s mention of LMX.

  • Why not just store the messages in reverse order? Then you wouldn’t need a backward parser; you could retrieve the n most recent messages from the top with an ordinary parser.

    Because file I/O doesn’t have an insert mode; you can only overwrite. That means that Adium would have to rewrite the entire rest of the file every time it inserted a message (which is when the message is received or sent). That would get very expensive for long transcripts, and some Adium users leave their chats open all the time, so their transcripts would indeed get very long.

  • How do you append to the transcript? You must have to leave off the end tag, which means that the file is not a valid XML document until you close it, which would be bad if Adium crashed, since the end tag would never get written and the transcript would be broken XML.

    Not so. This time, overwrite behavior is our friend: Adium simply overwrites the </chat> tag each time it writes a message, and appends a new </chat> tag in the same write. The file is always a valid XML document, thanks to overwriting.

    Yes, this is slightly wasteful, but the waste here is constant (that is, it does not go up over time) and insignificant. The upsides vastly outweigh the downsides.

  • Why go with XML if you have to perpetrate such hackery as a backward parser? Why not use SQLite or a plain-text format?

    SQLite: We would have had to include it with Adium, since Adium 1.0’s minimum requirement was OS X 10.3, and SQLite has only been bundled with Mac OS X since 10.4. LMX is much smaller than SQLite. Also, we’re not big on formats that aren’t directly human-readable.

    Plain-text format: A simple format (e.g. TSV) would have some growing pains if we ever wanted to grow (or shrink) the format, and a more complex format would require a new parser from the ground-up just like XML does. For this purpose, we like XML’s trade-off between readability and extensibility, and LMX fills in the gap for reading from the end.

    For more on formats we didn’t elect and why not, you can read our LogFormatIdeas page on the Trac (deprecated since we chose a format, but still around for posterity).

  • How will you determine the encoding of the data, or read entity declarations? Those things are at the start of the file, and you’re parsing from the end.

    LMX naïvely assumes that the data is UTF-8 and that the application knows about any entities it will need. Yes, this is wrong, but Adium didn’t need anything different.

    Either 2.0 or 3.0 will do a forward parse until the opening tag of the root element, in order to discover the actual encoding and any entity declarations. (I’m not doing it in a 1.0 version because 1.0’s parser is a hedge of thorns, and I’m not willing to touch it for something that most people won’t need anyway. And I’m tempted to leave this out of 2.0 as well, since 2.0 will be a big enough version with its rewrite of the parser in pure C.)

  • How does LMX tell whether –> is the end of a comment or simply an unescaped > following two hyphens?

    Simple: It assumes it’s the end of a comment.

    There’s no way to definitively find out one way or another without scanning all the way to the start of the data and backtracking. This is one of the pitfalls of a backward parser. It’s the nature of the game, so all I can do is say “make really sure you’re feeding good XML to the parser”. That includes not having unescaped ‘>’s in your text.

  • What about storing one message per line and scanning through the file line-by-line?

    Because you can have a valid XML log file without that constraint, and constraints like that are the sort of detail you don’t want to rely on, because other apps can break them. (To Tim Bray: Part of the point of the Unified Logging Format is that we want other IM/chat clients to use it, which means that we should be forgiving when their output doesn’t exactly match ours.)

  • Can I grep these logs?

    Mostly. You can grep an XML log in the usual way, but your expression can’t contain non-ASCII characters, <, >, or & unless you replace them with the appropriate entity references. We recommend using the search field in the Chat Transcript Viewer anyway.

I’m glad I finally announced LMX 1.0—not just because it is now, finally, out the door, but also because people have suggested new alternatives that we on the Adium team never thought of. For example, this reddit comment suggests saving one file per message (in a directory per chat), and this other one suggests inserting a fake start tag before the –nth message element, and the O’Reilly article suggests a hybrid XML+binary format. We never thought of any of these.

To be totally clear, we’re not switching—this post is a clarification, not an announcement. Two of those ideas won’t work for various reasons; the problem with the one-file-per-message idea can be overcome by tarring old chats. But LMX is not a future plan—we’ve written it and it’s here, and the same goes for the Unified Logging Format.

Call it inertia, but replacing either one with something else will require either the existing solution to break or the proposed replacement to exhibit massive, world-changing superiority. These things are done and they work, so at this point, we’re not going to rock the boat. It ain’t broke anymore, so we’re not fixing it.

The design for LMX 2.0

Monday, March 12th, 2007

LMX 1.0 didn’t really have much design to it. I set out to clone NSXMLParser‘s API, which I did, but didn’t give a whole lot of thought to how I would actually implement the parser.

As a result, the parser itself is one humongous method that takes a lot of effort to read. It is only navigable at all because I had the foresight to put in lots of #pragma marks.

LMX 2.0 will not make that mistake. This time, there’s a design, and the parser will not all be in one function. Here’s the design, which I drew on a quadrille pad:

All states have prefix “lmx_state_”. All states are functions; struct LMXParser's “state” member has type LMXParserStateFunc, which is a function pointer. There is also a “saved_state” member, used when entering entity_ref state. parser->state is called for every character in the XML data.

To be explicit, these are all implementation details which will not be exposed to clients of the API.

And the scanner I used to import this from dead tree format is the CanoScan LiDE 600F I mentioned in passing in my post about my HP M425 camera.

LMX 1.0 released

Saturday, March 3rd, 2007

Some of you know that I’m a developer on Adium. (Hopefully all of you; it is mentioned in the sidebar. ;)

Adium has a feature called “message history”. When you open a new chat with a person, message history shows you the last n messages from your previous chat with that person. Since 1.0 (which changed message history to draw from the logs rather than separate storage and changed the log format to be XML rather than bastardized HTML—more info on the Adium blog post), message history has been implemented using a library that I wrote called LMX.

LMX is a reverse XML parser. Whereas most XML parsers (AFAIK, all of them except LMX) parse the XML data from the start to the end, LMX parses it from the end to the start. Thus, while characters are kept in their original order (“foo” will still be “foo”; it will not become “oof”), everything else is reported in the reverse order: elements close before they are opened, and appear from last to first. All this is by design, so that Adium can retrieve the last n message elements without having to parse all the message elements before them.

Today, LMX gets its very own webpage (not just a page on the Adium wiki, but a real webpage), and is released at version 1.0. It’s the same code as shipped with Adium 1.0.1, but shined up into a release tarball.

So, if you too ever find yourself in desperate need of a reverse XML parser, now there is one.