Part of optimizing the IO subsystem of a machine running Microsoft SQL Server is deciding how many files a user database and tempdb should be broken up into. Denny Cherry provides some guidelines to determine this in an article that he published earlier this year. In it, he provides a couple of formulas that can be used to mathematically determine the appropriate number of files depending on the hardware being used. In the following discussion of these equations, let C, O, and P be defined as follows:

C = Number of CPUs
O = Number of cores per CPU
P = Physical files

According to Cherry, the number of files that should make up a user database can be computed using this formula:

P = C * O / 4

For example, if you have dual dual cores, you would have one physical file for the user databases:

P = C * O / 4
P = 2 * 2 / 4
P = 1

Or, with a dual quad core machine, the number of physical files is two as determined thus:

P = C * O / 4
P = 2 * 4 / 4
P = 2

To compute the number of files that makeup tempdb, use this formula:

P = C * O

For example, if you have dual dual cores, you would have 4 files for tempdb:

P = C * O
P = 2 * 2
P = 4

Or, if you have dual quads, the number of physical files is 8:

P = C * O
P = 2 * 4
P = 8

Keep in mind that I'm talking about files here not filegroups. Filegroups can purportedly be used to increase parallelism as noted on Microsoft's Web site; however, I would take that suggestion with a grain of salt. In his book, Inside SQL Server 2005: The Storage Engine Kalen Delaney has this to say about using file groups to increase parallelism:

Most SQL Server databases have a single data file in the default filegroup....As a user acquires greater database sophistication, she might want to use multiple devices to spread out the IO...The easiest way to do this is to create the database file on a RAID device. Still no need for filegroups...the user might decide that she really wants multiple files...In this case, she still does not need the filegroups — she can accomplish her goals by using create database with a list of files in separate drives.


More sophisticated database administrators might decide to have different tables assigned to different drives or to use the table and index partition feature of SQL Server 2005. Only then they will need to use filegroups...

There's usually no performance benefit in doing so.

To make this distinction clear, what I'm talking about here is a create statement such as this one:

CREATE DATABASE Archive
ON
PRIMARY
(NAME = Arch1,
FILENAME = 'W:\archdat1.mdf', -- Drive one: W
SIZE = 100MB,
MAXSIZE = 200,
FILEGROWTH = 20),
(NAME = Arch2,
FILENAME = 'X:\archdat2.ndf', -- Drive two: X
SIZE = 100MB,
MAXSIZE = 200,
FILEGROWTH = 20),
(NAME = Arch3,
FILENAME = 'Y:\archdat3.ndf', -- Drive three: Y
SIZE = 100MB,
MAXSIZE = 200,
FILEGROWTH = 20)
LOG ON
(NAME = Archlog1,
FILENAME = 'Z:\archlog1.ldf', -- Drive four: Z
SIZE = 100MB,
MAXSIZE = 200,
FILEGROWTH = 20),
(NAME = Archlog2,
FILENAME = 'Z:\archlog2.ldf', -- Drive four: Z
SIZE = 100MB,
MAXSIZE = 200,
FILEGROWTH = 20)

In doing this, you are instructing SQL Server to create different files to store your data. You are not telling it what objects to put in which files, however. This can only be done with filegorups. Rather, SQL Server decides which files to put which things in. If each of these files is on a different drive, or array of drives, you can achieve greater performance as Delaney points out.

Lately, I've been doing a lot of investigation of native XML databases in general and MarkLogic in particular. The motivation for this analysis was sparked by something I read in a research paper that was published last year by Michael Stonebraker et al wherein they state that:

...major RDBMS vendors can be outperformed by 1-2 orders of magnitude by specialized engines....We conclude that the current RDBMS code lines, while attempting to be a "one size fits all" solution, in fact, excel at nothing. Hence, they are 25 year old legacy code lines that should be retired in favor of a collection of "from scratch" specialized engines. The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow's requirements, not continue to push code lines and architectures designed for yesterday's needs.

I found this paper on the Web site of the Very Large Data Base Endowment, an organization promoting and exchanging scholarly work related to databases, while looking for information about designing large-scale, terabyte-size databases.

Because the problems that I'm typically confronted with involve XML and Web services, I wondered:

  • What specialized XML database engines are available?
  • Are these specialized databases able to outperform RDBMS in my scenarios as Stonebraker and his colleagues concluded?

I surfed around a learned that there are two types of XML databases: Hybrid and native XML databases. The difference between these two is an XML database that uses a hybrid architecture shreds the XML into a relational model for storage while a native one stores the information in its native document form. The former is what Stonebraker is advising against. To further explain the difference, a native XML database often exhibits the following characterizes (as described on XMLmind's Web site):

  • Their basic unit of storage is the XML document, the structure of which is preserved according to a data model such as the XML Infoset or the XQuery/XPath Data Model;
  • They accept any well-formed XML document irrespective of shape (a property sometimes called Schema independence); and
  • They support an XML-aware query language, typically XQuery, XPath and/or XSLT.

For more information about native XML databases, check out Ronald Bourret's article Going Native: Making the Case for XML Databases on xml.com.

With this understanding, I compiled the following list of specialized native XML database servers. I intend to investigate these different products as time permits. (This list doesn't include embeddable engines, only standalone servers.) I'm sure there are others. If you know of any, please let me know.

Product

Vendor

Comment

DB2

IBM

Documentum XML Store

EMC

Formerly X-Hive/DB by X-Hive

eXist

N/A

Open source

MarkLogic Server

MarkLogic

MonetDB/XQuery

MonetDB

Open source

Progress Sonic XML Server

Progress Sonic

Oracle Database

Oracle

SQL Server

Microsoft

Support in the 2005 edition is very poor IMO and should not be used carelessly

TEXTML Server

IXIA Software

TigerLogic XDMS

TigerLogic

Company was recently renamed from RainingData

XMS

Xpriori

XQuantum XML Database Server

Cognetic Systems, Inc.

XStreamDB

Bluestream

Of these, I've delved into MarkLogic the deepest so far. I've been in contact with these guys a number of times via phone and email, I've posted messages on their developer list, installed their community edition (which is limited to a measly 100 MB), read a bunch of their documentation, and watched some of their Web casts. We may have one of their consultants onsite in the near term as well to explain how we can utilize their product. In the coming weeks, I'll blog about more details about MarkLogic, so stay tunes.

North Plains' announcement earlier this month that they their DAM solution, TeleScope, is now being offered as a service in the cloud.  Some market analysts are saying that running DAM in the cloud isn't for everyone (yet?), but I've heard that 10% of the market is ready now (10% * 1/2B= 50M).  North Plains isn't alone is providing DAM SaaS.  Widen is doing it too (as their VP of marketing describes in this video).  To relieve concerns about uploading assets to the cloud, they have created an appliance that runs within a customer's facility, allowing their assets to remain local, but is maintained by Widen remotely.  Besides these two DAM vendors, ClearStory just announced that they are getting into the SaaS market too with ActiveMedia

Cloud computing is quickly coming to the forefront even in a market that is full of paranoid companies, requires specialized hardware, and is characterized by extremely large media files. It's only a matter of time!

I found a company called CMS Watch the other day that offers a number of intriguing comparative analyst reports on DAM systems, Enterprise Content Management (ECM) suites, Web Content Management (WCM) products, and Enterprise Search. They also have an interesting blog that I would recommend subscribing to called CMS Watch. In one entry in this blog, the author, Joseph Bachana, discusses innovations in the digital asset management marketplace that he expects to see this year and next. I've summarized his predictions here:

  • Many DAM vendors will devote their development efforts to OEMing third-party products into their applications.
  • More and more vendors will expose Web services APIs in response to a market that is hungry to integrate DAM systems into other systems (e.g., Web Content Management systems, Customer Relationship Management systems, Workflow Management Systems, Resource Management Systems, etc.).
  • DAM offerings will increasingly adopt Adobe's XMP format which will allow customers to transfer an asset's metadata with its content as it's passed around different (possibly external) systems.
  • More DAM providers will be integrating with Adobe's Version Cue which will allow their creative customers to take part in workflows, collaborate with their peers, version their work, and check in/out content.
  • Many companies will be working to integrate their DAM system with SharePoint for its content management capabilities, workflow engine functionality, and to further integrate into the customer's back office IT infrastructure.
  • DAM vendors will invest in building Rich Internet Applications (RIAs) using Adobe Air, Adobe Flex, and Microsoft Silverlight, so that they can provide UIs for their systems that are accessible from various operating systems and connected via the Web.
  • DAM repositories will be viewed more and more as the single, authoritative source for all intellectual property. For this reason, DAM providers will be expected to be increasingly open and interoperable, so that the content they stored is accessible to other systems that don't hold the authoritative copy of the media.
  • Digital Asset Management providers will partner with or purchase suppliers of XML databases and text mining engines to facilitate the creation of these authoritative repositories that can store creative and textual content.

Will we ever see an open source DAM system that can go head to head with EMC, IBM, Oracle and the rest? Tony Byrne says in his blog that he thinks it will not happen soon if ever. Joseph Bachana disagrees and says that it "may take the harnessed fervor of the open source community to bring all these threads and more together in the marketplace." I think the CEO of MarkLogic would agree with Bachana. In his blog, Dave Kellogg conjectures that, in the future, the "high-end of the [Enterprise Content Management (ECM)] market will split between Documentum and Alfresco," an open source ECM system. Though it is an ECM and not a DAM solution, if Kellogg is correct that about half of the big ECM customers will use Alfresco, many will also require it to support some DAM-related features as their needs won't stop at information processing.  Many will also need a secure repository that facilitates the management and distribution of media files, thus crossing over into the realm of the digital asset management systems.  As a result, this OSS project may be augmented to include DAM-like capabilities proving Bachana right and giving us a free (as in beer) DAM alternative.