In a previous post, I promised to blog about a bit more of the details of MarkLogic, a native XML database. Not wanting to break my word, I've finally put my thoughts on "paper."
The incipience of the company and its database stems from a need in the marketplace for a unified platform that provides an integrated search engine with the features typically associated with database management systems. It marries the advanced capabilities found in the former (including indexing of semi-structured data, fast querying of information stored in desperate formats, searches based on Boolean logic, stemming, and thesauri) with typical features of database systems (including role-based security, clustering, and others). This synergy was pioneered in 2001 by people from Google and other successful companies that are famous in these two domains.
This pairing is part of what provides the following features of MarkLogic Server:
- The ability to
secure content by roles
- Content stored in the database is accessible via a RESTful Web service interface or a .NET programming library similar to ADO.NET which they call it XCC
- The server provides a WebDAV interface as well allowing Office applications, for example, to open documents/content directly which can facilitate ad-hoc reporting requirements among other things
- Ability to import Microsoft Office documents, PDFs, Web pages, and other documents types into the database which can then be searched and exported to alternative formats
- Support for the W3C's XQuery language
- XML indexing that provide fast full-text and semi-structured searches
- An architecture that can purportedly scale to Internet-levels (hundreds of terabytes)
- Support for searches within XML content using stemming, thesauri, and spell-checking
- Transactional updates
In traditional RDBMSs, text and B-tree indexes are disconnected. As a result, they are constantly out of sync. To keep them aligned is costly. If these resources aren't able to keep up with demand or synchronization isn't done, searches won't find content that is stored in the DB even though it's there. Keeping the two types of indexes in sync at the cost of CPU and disc resources eventually hits an upper limit of what the computer can provide. As a result, large systems are limited by the RDBMS from scaling beyond this threshold. This design is only a limitation in the theatrical sense for small loads but can become a real issue when the size of the database increases to enterprise- or Internet-levels.
Many RDBMSs currently support an XML data type. For example, Microsoft SQL Server, Oracle, and DB2 have such a feature; however, not all RDBMSs store XML natively. This distinction is made based on the way that the information is serialized. Native XML databases do not store the information in a relational backend. Oracle originally insisted that all XML could be shredded into a relational form and that native XML storage wasn't needed. IBM disagreed and saw that they needed to store XML in a non-relational form, and did so even with the first version of their product that offered XML support. Oracle has since changed its opinion and now stores XML in a non-relational form in its current release.
This reversal of opinions makes MarkLogic look very good because they saw that this shredding process and relational backend storage design was antiquated and would not work for document content as it did for last few decades with tabular data. Unlike traditional database engines, MarkLogic does not use a relational storage system and, thus they do not shred XML into this form. Instead, verbose XML content is converted to a logical representation which is compressed and serialized. This compression and decompression is costly in terms of time and CPU usage, so MarkLogic has striven to find a good balance between the amount of compression needed and the quantity of storage space consumed.
MarkLogic, as of version 4, has support for XQuery 1.0 (with or without it proprietary extensions) and is backward compatible with older versions of the draft standard.
MarkLogic has a number of customers in various industries including publishing, government, and others. Their first customer, Elsevier, is the largest publisher in the world. With their very first installation, they were able to meet the publication giant's requirement to store terabytes of data while simultaneously being able to query it between 10 and 100 times per second. The Elsevier deal cost many hundreds of million of dollars. [Update: I talked to David Rees of MarkLogic today (Mar. 10, 2009) and he assured me that the deal with Elsevier was competitively priced and far, far less than what I originally reported. I apologize for the mistake and for the misinformation.]
Since their initial offering, MarkLogic has since deployed a solution for many other customers including a Dallas-based company that needed to snoop at messages that went in and out of mainframe computers from many different sources. This company had to store this freeform data in a repository that would allow for fast retrieval and analysis. This customer tested DB2 and MarkLogic, but the latter eventually won out because of its ability to ingest massive amounts of XML data with optimal speeds. DB2 was able to store the information without issue; however, as the database grew, the time that it took to ingest additional amounts of information was prohibitive. Conversely, the time that it took MarkLogic to insert new data grew at a much slower rate and was almost independent of the amount of content under its control.
If MarkLogic is able to outperform traditional databases like this, why aren't other vendors changing to speed up their products? The reason is because their architectures and approaches are difficult to alter considering their age and install base. With a chance to start from scratch in a world where discs and memory were cheap (relative to the 70s where the roots of many RDBMSs were architecturally fixed by their ancestor the System R), MarkLogic was able to create a system that didn't discount ideas that required larger amounts of disc usage or memory.
Some of the ways that MarkLogic diverges from traditional systems in this regard is that it does in-line updates. This allows them to cache modifications in memory and to do sequential I/O on disk. This in-memory cache is indexed in an efficient manner using techniques that aren't possible in traditional engines that can't use such caching mechanisms. When a delete is performed in a MarkLogic server, a flag is set in memory indicating that the record has been removed; however the information on disc isn't removed at the same rime. Rather, a background process periodically finds the records that have been marked for deletion and (if configured to do so), removes them from disc. If a piece of content has been updated in the cache but is subsequently deleted before this background process has run, the update is never reflected on disc and the records are removed. Thus, I/O is reduced and requests are fulfilled more quickly.
The memory that MarkLogic uses for its cache is journaled and can be recovered if the system crashes before this background processes has had a change to flush it to disc. It is also sensitive not to store too much data in RAM to limit resource pressures on the system and to ensure that restoration times after crashes aren't prohibitively time consuming. The journaling mechanism employed by MarkLogic is purportedly less costly than that of other RDBMSs.
This approach also allows for data to be queried even if it has been deleted. If the database is configured not to remove the information from disc but to simply mark it as deleted, queries can be made over information that was previously removed. The net effect is the ability to "time travel" through old data (if you will). This notion is similar to a recycling bin that is never emptied and isn't capped in size. The result of course is that more disc storage must be available, so the usefulness of this feature must be weighed against the storage costs.
MarkLogic is a really compelling technology that is certainly worth investigating, and I hope this little writeup has played a small part in that process.