LMX and Adium message history Q&A
There’s been some discussion of LMX on the web since I announced LMX 1.0’s release. As I mentioned then, LMX is the library that powers Adium’s message history feature. Mostly, people have questioned whether XML was the best choice for logging given the message history requirement.
I recommend first reading my post on the Adium blog about message history. (If you came here from the Adium blog, sorry for the bouncing back and forth—that’s the last bounce, I promise. ;)
Welcome back. Let’s begin.
The questions and objections listed here are drawn from the comments on my LMX 1.0 announcement post, this article on the O’Reilly XML blog, the reddit post about LMX, and Tim Bray’s mention of LMX.
-
Why not just store the messages in reverse order? Then you wouldn’t need a backward parser; you could retrieve the n most recent messages from the top with an ordinary parser.
Because file I/O doesn’t have an insert mode; you can only overwrite. That means that Adium would have to rewrite the entire rest of the file every time it inserted a message (which is when the message is received or sent). That would get very expensive for long transcripts, and some Adium users leave their chats open all the time, so their transcripts would indeed get very long.
-
How do you append to the transcript? You must have to leave off the end tag, which means that the file is not a valid XML document until you close it, which would be bad if Adium crashed, since the end tag would never get written and the transcript would be broken XML.
Not so. This time, overwrite behavior is our friend: Adium simply overwrites the </chat> tag each time it writes a message, and appends a new </chat> tag in the same write. The file is always a valid XML document, thanks to overwriting.
Yes, this is slightly wasteful, but the waste here is constant (that is, it does not go up over time) and insignificant. The upsides vastly outweigh the downsides.
-
Why go with XML if you have to perpetrate such hackery as a backward parser? Why not use SQLite or a plain-text format?
SQLite: We would have had to include it with Adium, since Adium 1.0’s minimum requirement was OS X 10.3, and SQLite has only been bundled with Mac OS X since 10.4. LMX is much smaller than SQLite. Also, we’re not big on formats that aren’t directly human-readable.
Plain-text format: A simple format (e.g. TSV) would have some growing pains if we ever wanted to grow (or shrink) the format, and a more complex format would require a new parser from the ground-up just like XML does. For this purpose, we like XML’s trade-off between readability and extensibility, and LMX fills in the gap for reading from the end.
For more on formats we didn’t elect and why not, you can read our LogFormatIdeas page on the Trac (deprecated since we chose a format, but still around for posterity).
-
How will you determine the encoding of the data, or read entity declarations? Those things are at the start of the file, and you’re parsing from the end.
LMX naïvely assumes that the data is UTF-8 and that the application knows about any entities it will need. Yes, this is wrong, but Adium didn’t need anything different.
Either 2.0 or 3.0 will do a forward parse until the opening tag of the root element, in order to discover the actual encoding and any entity declarations. (I’m not doing it in a 1.0 version because 1.0’s parser is a hedge of thorns, and I’m not willing to touch it for something that most people won’t need anyway. And I’m tempted to leave this out of 2.0 as well, since 2.0 will be a big enough version with its rewrite of the parser in pure C.)
-
How does LMX tell whether –> is the end of a comment or simply an unescaped > following two hyphens?
Simple: It assumes it’s the end of a comment.
There’s no way to definitively find out one way or another without scanning all the way to the start of the data and backtracking. This is one of the pitfalls of a backward parser. It’s the nature of the game, so all I can do is say “make really sure you’re feeding good XML to the parser”. That includes not having unescaped ‘>’s in your text.
-
What about storing one message per line and scanning through the file line-by-line?
Because you can have a valid XML log file without that constraint, and constraints like that are the sort of detail you don’t want to rely on, because other apps can break them. (To Tim Bray: Part of the point of the Unified Logging Format is that we want other IM/chat clients to use it, which means that we should be forgiving when their output doesn’t exactly match ours.)
-
Can I grep these logs?
Mostly. You can grep an XML log in the usual way, but your expression can’t contain non-ASCII characters, <, >, or & unless you replace them with the appropriate entity references. We recommend using the search field in the Chat Transcript Viewer anyway.
I’m glad I finally announced LMX 1.0—not just because it is now, finally, out the door, but also because people have suggested new alternatives that we on the Adium team never thought of. For example, this reddit comment suggests saving one file per message (in a directory per chat), and this other one suggests inserting a fake start tag before the –nth message element, and the O’Reilly article suggests a hybrid XML+binary format. We never thought of any of these.
To be totally clear, we’re not switching—this post is a clarification, not an announcement. Two of those ideas won’t work for various reasons; the problem with the one-file-per-message idea can be overcome by tarring old chats. But LMX is not a future plan—we’ve written it and it’s here, and the same goes for the Unified Logging Format.
Call it inertia, but replacing either one with something else will require either the existing solution to break or the proposed replacement to exhibit massive, world-changing superiority. These things are done and they work, so at this point, we’re not going to rock the boat. It ain’t broke anymore, so we’re not fixing it.