Recently in Databases Category

Anyone who knows me or reads this blog regularly can tell you that I'm a totally sold on cloud computing. Recently, I wrote a little about MarkLogic, blogged about IaaS, and I got to thinking: MarkLogic should build an infrastructure service that runs in the Amazon's cloud, scales to Internet levels, and comes with a pay-per-use sales model.

MarkLogic's native XML database is proven, mature, and has been deployed to many companies with household names; however, they aren't cheap. They are competitively priced, but still prohibitively expensive for many companies. The product is licensed by CPU sockets (IIRC), and for a license to run on a one-CPU-socket machine is something like $10,000. That's a lot of zeros for small and medium businesses, especially when relational products that are trusted and better known like MySQL, PostgreSQL, and SQL Server are available at no cost or for pennies on the dollar compared to MarkLogic. Switching from a tried and true relational database to an unknown database product from a less known source sounds really risky to many purse-string holders. If MarkLogic provided its wares as a service that organizations could consume over the Internet on a pay-per-use basis, reluctant companies would be able to utilize MarkLogic's product in an actual program that goes to production, allowing naysayers to see that this alternative has a lot of benefits and that the risks are not as great as they fear.

As the software's creator, who better to create this infrastructure service? They have already used it to build a highly scalable and highly available system called MarkMail that is currently indexing and searching over 7,000 news groups. For example, they know the problems they had with caching and proxying caused by Squid which they used in their early release of the use group indexing service. They solved it, made it scale, and could do the same with a general purpose data storage service, I'm sure.

If you survey the landscape of infrastructure services running in the cloud today, you're choices are few. You have Amazon's SimpleDB, Microsoft's SQL Data Services (eventually), or Google's App Engine datastore. (There might be others, but I don't know of them.) These services all have different and proprietary interfaces. If MarkLogic were to create an alternative to these, its interface would be standard XQuery 1.0. I don't know about you, but I would find a standardized API more appealing from a business and development point of view. From the former perspective, it would assure me that I'm not writing code that is locking me into one vendor's service; from the later, it would allow me to use existing knowledge, libraries, and tools rather than forcing me to use what is provided by the vendors or forcing me to create my own.

Considering some of the recent partnerships that MarkLogic's competitors IBM and Oracle have made with Amazon, I wonder if MarkLogic can afford not to, at the very least, offer a cloud-compatible license and a pricing model that allows for elastic scalability. While this would be a great first step, I'm convinced that a full blown data storage service would be adopted by many companies trying to store large amounts of semi-structured content with Internet-sized demands. If MarkLogic does go after this market, running it in Amazon's cloud rather than someone else's would be beneficial to others running in Amazon's data centers because the transfer rate of data sent within their private network is free. If MarkLogic does not provide a data storage service that is powered by a native XML datastore, supports a standardized query interface, scales to Internet levels, and is highly available, redundant, and performant, then someone else should using an open source alternative. I would be very interested in such a service regardless of who provides it.

In a previous post, I promised to blog about a bit more of the details of MarkLogic, a native XML database.  Not wanting to break my word, I've finally put my thoughts on "paper."


The incipience of the company and its database stems from a need in the marketplace for a unified platform that provides an integrated search engine with the features typically associated with database management systems.  It marries the advanced capabilities found in the former (including indexing of semi-structured data, fast querying of information stored in desperate formats, searches based on Boolean logic, stemming, and thesauri) with typical features of database systems (including role-based security, clustering,  and others).  This synergy was pioneered in 2001 by people from Google and other successful companies that are famous in these two domains.

 

This pairing is part of what provides the following features of MarkLogic Server:

  • The ability to secure content by roles
  • Content stored in the database is accessible via a RESTful Web service interface or a .NET programming library similar to ADO.NET which they call it XCC
  • The server provides a WebDAV interface as well allowing Office applications, for example, to open documents/content directly which can facilitate ad-hoc reporting requirements among other things
  • Ability to import Microsoft Office documents, PDFs, Web pages, and other documents types into the database which can then be searched and exported to alternative formats
  • Support for the W3C's XQuery language
  • XML indexing that provide fast full-text and semi-structured searches
  • An architecture that can purportedly scale to Internet-levels (hundreds of terabytes)
  • Support for searches within XML content using stemming, thesauri, and spell-checking
  • Transactional updates

In traditional RDBMSs, text and B-tree indexes are disconnected.  As a result, they are constantly out of sync.  To keep them aligned is costly.  If these resources aren't able to keep up with demand or synchronization isn't done, searches won't find content that is stored in the DB even though it's there.  Keeping the two types of indexes in sync at the cost of CPU and disc resources eventually hits an upper limit of what the computer can provide.  As a result, large systems are limited by the RDBMS from scaling beyond this threshold.  This design is only a limitation in the theatrical sense for small loads but can become a real issue when the size of the database increases to enterprise- or Internet-levels.

 

Many RDBMSs currently support an XML data type.  For example, Microsoft SQL Server, Oracle, and DB2 have such a feature; however, not all RDBMSs store XML natively.  This distinction is made based on the way that the information is serialized.  Native XML databases do not store the information in a relational backend.  Oracle originally insisted that all XML could be shredded into a relational form and that native XML storage wasn't needed.  IBM disagreed and saw that they needed to store XML in a non-relational form, and did so even with the first version of their product that offered XML support.  Oracle has since changed its opinion and now stores XML in a non-relational form in its current release. 

 

This reversal of opinions makes MarkLogic look very good because they saw that this shredding process and relational backend storage design was antiquated and would not work for document content as it did for last few decades with tabular data.  Unlike traditional database engines, MarkLogic does not use a relational storage system and, thus they do not shred XML into this form.  Instead, verbose XML content is converted to a logical representation which is compressed and serialized.  This compression and decompression is costly in terms of time and CPU usage, so MarkLogic has striven to find a good balance between the amount of compression needed and the quantity of storage space consumed. 


MarkLogic, as of version 4, has support for XQuery 1.0 (with or without it proprietary extensions) and is backward compatible with older versions of the draft standard.

 

MarkLogic has a number of customers in various industries including publishing, government, and others.  Their first customer, Elsevier, is the largest publisher in the world.  With their very first installation, they were able to meet the publication giant's requirement to store terabytes of data while simultaneously being able to query it between 10 and 100 times per second. The Elsevier deal cost many hundreds of million of dollars.  [Update: I talked to David Rees of MarkLogic today (Mar. 10, 2009) and he assured me that the deal with Elsevier was competitively priced and far, far less than what I originally reported.  I apologize for the mistake and for the misinformation.]

 

Since their initial offering, MarkLogic has since deployed a solution for many other customers including a Dallas-based company that needed to snoop at messages that went in and out of mainframe computers from many different sources.  This company had to store this freeform data in a repository that would allow for fast retrieval and analysis.  This customer tested DB2 and MarkLogic, but the latter eventually won out because of its ability to ingest massive amounts of XML data with optimal speeds.  DB2 was able to store the information without issue; however, as the database grew, the time that it took to ingest additional amounts of information was prohibitive.  Conversely, the time that it took MarkLogic to insert new data grew at a much slower rate and was almost independent of the amount of content under its control.

 

If MarkLogic is able to outperform traditional databases like this, why aren't other vendors changing to speed up their products?  The reason is because their architectures and approaches are difficult to alter considering their age and install base.  With a chance to start from scratch in a world where discs and memory were cheap (relative to the 70s where the roots of many RDBMSs were architecturally fixed by their ancestor the System R), MarkLogic was able to create a system that didn't discount ideas that required larger amounts of disc usage or memory.

 

Some of the ways that MarkLogic diverges from traditional systems in this regard is that it does in-line updates.  This allows them to cache modifications in memory and to do sequential I/O on disk.  This in-memory cache is indexed in an efficient manner using techniques that aren't possible in traditional engines that can't use such caching mechanisms.  When a delete is performed in a MarkLogic server, a flag is set in memory indicating that the record has been removed; however the information on disc isn't removed at the same rime.  Rather, a background process periodically finds the records that have been marked for deletion and (if configured to do so), removes them from disc.  If a piece of content has been updated in the cache but is subsequently deleted before this background process has run, the update is never reflected on disc and the records are removed.  Thus, I/O is reduced and requests are fulfilled more quickly.

 

The memory that MarkLogic uses for its cache is journaled and can be recovered if the system crashes before this background processes has had a change to flush it to disc.  It is also sensitive not to store too much data in RAM to limit resource pressures on the system and to ensure that restoration times after crashes aren't prohibitively time consuming.  The journaling mechanism employed by MarkLogic is purportedly less costly than that of other RDBMSs.

 

This approach also allows for data to be queried even if it has been deleted.  If the database is configured not to remove the information from disc but to simply mark it as deleted, queries can be made over information that was previously removed.  The net effect is the ability to "time travel" through old data (if you will).  This notion is similar to a recycling bin that is never emptied and isn't capped in size.  The result of course is that more disc storage must be available, so the usefulness of this feature must be weighed against the storage costs.

MarkLogic is a really compelling technology that is certainly worth investigating, and I hope this little writeup has played a small part in that process.

SQL Server 2008 has a great new feature called Table Value Parameters (TVP). I think there are a lot of applications for this new technology, but what about using it with some of Redmond's other new inventions, namely Entity Framework (EF) or LINQ to SQL? Specifically, can you drag and drop a stored procedure from the Server Explorer on to Entity Data Model Designer (in the case of EF) or the Object Relational Designer (in the case of LINQ to SQL)? The answer to this question is yes, but with a catch. Once you try to build the solution, you'll get one of the following compilation errors:

  • DBML1005: Mapping between DbType 'Structured' and Type 'System.Object' in Parameter 'source_key_list' of Function 'dbo.stp_GetCustomerSK' is not supported.
  • The function 'stp_GetCustomerSK' has a parameter 'source_key_list' at parameter index 0 that has a data type 'table type' which is not supported, the function was excluded.

The first error is what you'll get when you're trying to use TVPs with LINQ to SQL and the second what you'll get when using EF. These errors were reported from Visual Studio 2008 SP 1.

This state of affairs reinforces my opinion about these two ORM technologies. I think EF is an immature version one technology that should be avoided in most cases. (Try back with version two or three.) In the latter case, I think that it should only be used on products with a short life span considering that Microsoft is no longer investing any R & D dollars into it and will probably deprecate it in the next release of the .NET framework.

I was looking into RDF Gateway the other week, and found it to be a very interesting product. It provides an RDF store which can be accessed using SPARQL or T-SQL and persists its data in a SQL Server database. It also allows you to expose your SPARQL queries by creating RDF Server Pages (RSPs). This technology is similar in nature to ASP or JSP in that it mixes markup with SPARQL quires and a JavaScript-like language called RDF Query Language (RQL) similarly to the way ASP embeds JScript/VBScript in markup and JSP intermixes it with Java. RSP documents are interpreted by RDF Gateway, which runs as a Windows NT service. These scripts are requested over HTTP, making them accessible to a wide array of applications and programming environments. After just a cursory investigation, it isn't hard to see how one could use this product to expose an RDF store as a collection of RESTful Web services. In case you're having trouble visualizing this, have a look at the following figure that shows the overall architecture of how this might be done:

In my view, each RESTful service would have one RSP script. The client would indicate which action or operation they wanted to invoke by setting a parameter on the query string. The RSP page would grab this argument, and invoke the corresponding function to fulfill that specific type of request. I realize that this design would make the service an REST-RPC hybrid (using Richardson and Ruby's term which they coined in their book RESTful Web services), but it would avoid the need to use something like Squid simply to do URL rewriting (since RDF Gateway doesn't have such abilities natively).

To show how this would work, I'll describe the idiom used by the different RSP pages above. Because each service would have one RSP page, it would perform the role of "dispatcher." It would grab the action off the query string, switch on it, and call the appropriate function. To demonstrate how this might look, consider the following snippet of RQL code:

<%
import "operations/a1.rql";
import "operations/a2.rql";
import "/common.rql";
%>
<Response>
<%
Response.contentType = "text/xml";

var action = Request["action"]; // Dispatcher

switch (action)
{
case "a1":
a1();
break;
case "a2":
a2();
break;
default:
_Error(); // Imported from common.rql
break;
}
%>
</Response>

This code fragment is the service's entry point, its "main" if you will. To keep the code organized, each of the service's operations is fulfilled by different functions which are factored out into separate RQL script and imported in the main RSP page. Each of the RSP pages for the different services would be almost identical. (I can imagine a little code generator to produce this code and a plug-in for Visual Studio or other editors to call it.)

When the action is fired, it would process the body of the HTTP request, use SPARQL to update or query the RDF store, and return the results in XML or JSON format. (The snippet above precludes the use of JSON, but that could be easily be changed and which format should be returned could be based on the value of a query string argument, e.g., json=1.)

Like the format of the results, the contents of the request would be XML or JSON. In the latter case, I think that RQL's eval statement could be used to turn it into an object that could be processed directly. In the case of XML, a little munging would have to be done, and, I'm not sure what support RDF Gateway has for this. Once converted to objects that could be used in the RQL script, they would be passed to SPARQL to perform the operation-specific processing as shown in the following snippet:

function a1()
{
    var input;    

// RQL doesn't have a contentType property on their
// Request object, so we have to see if the input
// is JSON by trying to eval it. I realize that
// this makes the system susceptible to script injection,
// but this is just a POC.
//
// The object returned from eval and MungeXml contains
// two properties, lbound (lower bound) and ubound (upper
// bound).

try {
input = eval(Request.binaryRead());
input.lbound; input.ubound;
}

catch(ex)
{
input = MungeXml();
}

// Use input parameters in a SPARQL expression to update
// or query the datastore.

var rs = (SELECT TOP 5 ?p ?s ?o USING mydata
WHERE {?p ?s ?o} and between(?o, #(input.lbound),
#(input.ubound)));

    Response.Write("<data>");    

    for (; rs.EOF; rs.MoveNext())
    {
        Response.Write("<datum><p>");
        Response.Write(rs["p"]);
        Response.Write("</p><s>");
        Response.Write(rs["s"]);
        Response.Write("</s><o>");
        Response.Write(rs["o"]);        
        Response.Write("</o></datum>");
    }    

    Response.Write("</data>");
}

As I mentioned at the beginning of this post, I've only had time to perform a short investigation into RDF Gateway. So, if there are glaring problems or emissions with the ideas presented here, please let know as a learn more about this exciting product.

If you would like to get started with RDF Gateway, surf over to Intellidimension's Web site and request a 60 day trial license by emailing [email protected].

Part of optimizing the IO subsystem of a machine running Microsoft SQL Server is deciding how many files a user database and tempdb should be broken up into. Denny Cherry provides some guidelines to determine this in an article that he published earlier this year. In it, he provides a couple of formulas that can be used to mathematically determine the appropriate number of files depending on the hardware being used. In the following discussion of these equations, let C, O, and P be defined as follows:

C = Number of CPUs
O = Number of cores per CPU
P = Physical files

According to Cherry, the number of files that should make up a user database can be computed using this formula:

P = C * O / 4

For example, if you have dual dual cores, you would have one physical file for the user databases:

P = C * O / 4
P = 2 * 2 / 4
P = 1

Or, with a dual quad core machine, the number of physical files is two as determined thus:

P = C * O / 4
P = 2 * 4 / 4
P = 2

To compute the number of files that makeup tempdb, use this formula:

P = C * O

For example, if you have dual dual cores, you would have 4 files for tempdb:

P = C * O
P = 2 * 2
P = 4

Or, if you have dual quads, the number of physical files is 8:

P = C * O
P = 2 * 4
P = 8

Keep in mind that I'm talking about files here not filegroups. Filegroups can purportedly be used to increase parallelism as noted on Microsoft's Web site; however, I would take that suggestion with a grain of salt. In his book, Inside SQL Server 2005: The Storage Engine Kalen Delaney has this to say about using file groups to increase parallelism:

Most SQL Server databases have a single data file in the default filegroup....As a user acquires greater database sophistication, she might want to use multiple devices to spread out the IO...The easiest way to do this is to create the database file on a RAID device. Still no need for filegroups...the user might decide that she really wants multiple files...In this case, she still does not need the filegroups — she can accomplish her goals by using create database with a list of files in separate drives.


More sophisticated database administrators might decide to have different tables assigned to different drives or to use the table and index partition feature of SQL Server 2005. Only then they will need to use filegroups...

There's usually no performance benefit in doing so.

To make this distinction clear, what I'm talking about here is a create statement such as this one:

CREATE DATABASE Archive
ON
PRIMARY
(NAME = Arch1,
FILENAME = 'W:\archdat1.mdf', -- Drive one: W
SIZE = 100MB,
MAXSIZE = 200,
FILEGROWTH = 20),
(NAME = Arch2,
FILENAME = 'X:\archdat2.ndf', -- Drive two: X
SIZE = 100MB,
MAXSIZE = 200,
FILEGROWTH = 20),
(NAME = Arch3,
FILENAME = 'Y:\archdat3.ndf', -- Drive three: Y
SIZE = 100MB,
MAXSIZE = 200,
FILEGROWTH = 20)
LOG ON
(NAME = Archlog1,
FILENAME = 'Z:\archlog1.ldf', -- Drive four: Z
SIZE = 100MB,
MAXSIZE = 200,
FILEGROWTH = 20),
(NAME = Archlog2,
FILENAME = 'Z:\archlog2.ldf', -- Drive four: Z
SIZE = 100MB,
MAXSIZE = 200,
FILEGROWTH = 20)

In doing this, you are instructing SQL Server to create different files to store your data. You are not telling it what objects to put in which files, however. This can only be done with filegorups. Rather, SQL Server decides which files to put which things in. If each of these files is on a different drive, or array of drives, you can achieve greater performance as Delaney points out.