In a previous post, I promised to blog about a bit more of the details of MarkLogic, a native XML database. Not wanting to break my word, I've finally put my thoughts on "paper."
The
incipience of the company and its database stems from a need in the marketplace for a unified
platform that provides an integrated search engine with the features typically
associated with database management systems.
It marries the advanced capabilities found in the former (including
indexing of semi-structured data, fast querying of information stored in
desperate formats, searches based on Boolean logic, stemming, and thesauri)
with typical features of database systems (including role-based security,
clustering, and others). This synergy was pioneered in 2001 by people
from Google and other successful companies that are famous in these two
domains.
This
pairing is part of what provides the following features of MarkLogic Server:
- The ability to
secure content by roles
- Content stored in
the database is accessible via a RESTful Web service
interface or a .NET programming library similar to ADO.NET which they call it
XCC
- The server provides
a WebDAV interface as well allowing Office
applications, for example, to open documents/content directly which can
facilitate ad-hoc reporting requirements among other things
- Ability to import
Microsoft Office documents, PDFs, Web pages, and other documents types into the
database which can then be searched and exported to alternative formats
- Support for the
W3C's XQuery language
- XML indexing that provide
fast full-text and semi-structured searches
- An architecture
that can purportedly scale to Internet-levels (hundreds of terabytes)
- Support for
searches within XML content using stemming, thesauri, and spell-checking
- Transactional
updates
In
traditional RDBMSs, text and B-tree indexes are disconnected. As a result, they are constantly out of
sync. To keep them aligned is
costly. If these resources aren't able
to keep up with demand or synchronization isn't done, searches won't find
content that is stored in the DB even though it's there. Keeping the two types of indexes in sync at
the cost of CPU and disc resources eventually hits an upper limit of what the
computer can provide. As a result, large
systems are limited by the RDBMS from scaling beyond this threshold. This design is only a limitation in the
theatrical sense for small loads but can become a real issue when the size of
the database increases to enterprise- or Internet-levels.
Many
RDBMSs currently support an XML data type.
For example, Microsoft SQL Server, Oracle, and DB2 have such a feature;
however, not all RDBMSs store XML natively.
This distinction is made based on the way that the information is
serialized. Native XML databases do not
store the information in a relational backend.
Oracle originally insisted that all XML could be shredded into a relational form and that native XML storage wasn't
needed. IBM disagreed and saw that they
needed to store XML in a non-relational form, and did so even with the first
version of their product that offered XML support. Oracle has since changed its opinion and now
stores XML in a non-relational form in its current release.
This
reversal of opinions makes MarkLogic look very good because they saw that this
shredding process and relational backend storage design was antiquated and
would not work for document content as it did for last few decades with tabular
data. Unlike traditional database
engines, MarkLogic does not use a relational storage system and, thus they do
not shred XML into this form. Instead,
verbose XML content is converted to a logical representation which is
compressed and serialized. This
compression and decompression is costly in terms of time and CPU usage, so MarkLogic
has striven to find a good balance between the amount of compression needed and
the quantity of storage space consumed.
MarkLogic, as of version 4, has support for XQuery 1.0 (with or without it proprietary extensions) and is backward compatible with older versions of the draft standard.
MarkLogic has a number of customers in various industries including publishing, government, and others. Their first customer, Elsevier, is the largest publisher in the world. With their very first installation, they were able to meet the publication giant's requirement to store terabytes of data while simultaneously being able to query it between 10 and 100 times per second. The Elsevier deal cost many hundreds of million of dollars. [Update: I talked to David Rees of MarkLogic today (Mar. 10, 2009) and he assured me that the deal with Elsevier was competitively priced and far, far less than what I originally reported. I apologize for the mistake and for the misinformation.]
Since
their initial offering, MarkLogic has since deployed a solution for many other
customers including a Dallas-based company that needed to snoop at messages
that went in and out of mainframe computers from many different sources. This company had to store this freeform data
in a repository that would allow for fast retrieval and analysis. This customer tested DB2 and MarkLogic, but
the latter eventually won out because of its ability to ingest massive amounts
of XML data with optimal speeds. DB2 was
able to store the information without issue; however, as the database grew, the
time that it took to ingest additional amounts of information was prohibitive. Conversely, the time that it took MarkLogic to
insert new data grew at a much slower rate and was almost independent of the
amount of content under its control.
If
MarkLogic is able to outperform traditional databases like this, why aren't
other vendors changing to speed up their products? The reason is because their architectures and
approaches are difficult to alter considering their age and install base. With a chance to start from scratch in a
world where discs and memory were cheap (relative to the 70s where the roots of
many RDBMSs were architecturally fixed by their ancestor the System R),
MarkLogic was able to create a system that didn't discount ideas that required
larger amounts of disc usage or memory.
Some
of the ways that MarkLogic diverges from traditional systems in this regard is
that it does in-line updates. This
allows them to cache modifications in memory and to do sequential I/O on
disk. This in-memory cache is indexed in
an efficient manner using techniques that aren't possible in traditional
engines that can't use such caching mechanisms.
When a delete is performed in a MarkLogic server, a flag is set in
memory indicating that the record has been removed; however the information on
disc isn't removed at the same rime.
Rather, a background process periodically finds the records that have
been marked for deletion and (if configured to do so), removes them from
disc. If a piece of content has been
updated in the cache but is subsequently deleted before this background process
has run, the update is never reflected on disc and the records are
removed. Thus, I/O is reduced and
requests are fulfilled more quickly.
The
memory that MarkLogic uses for its cache is journaled
and can be recovered if the system crashes before this background processes has
had a change to flush it to disc. It is
also sensitive not to store too much data in RAM to limit resource pressures on
the system and to ensure that restoration times after crashes aren't
prohibitively time consuming. The
journaling mechanism employed by MarkLogic is purportedly less costly than that
of other RDBMSs.
This approach also allows for data to be queried even if it has been deleted. If the database is configured not to remove the information from disc but to simply mark it as deleted, queries can be made over information that was previously removed. The net effect is the ability to "time travel" through old data (if you will). This notion is similar to a recycling bin that is never emptied and isn't capped in size. The result of course is that more disc storage must be available, so the usefulness of this feature must be weighed against the storage costs.
MarkLogic is a really compelling technology that is certainly worth investigating, and I hope this little writeup has played a small part in that process.
Excellent write up.
Thanks for taking the time.
Do you have any exposure to RPG writing to MarkLogic?
Do you mean IBM RPG? If so, no, I haven't heard of anything. MarkLogic's API is pretty universal; it has a RESTful Web service. If your programming language can do HTTP, you pretty much set. Good luck :-)