The eDiscovery Paradigm Shift: eDiscovery from Database Management Systems

Wednesday, November 3, 2010

eDiscovery from Database Management Systems

I have spent the majority of my career building software and services companies that develop and sell enterprise class applications running on large SQL databases for the Fortune 2000. Therefore, when I read Jason Kruse’s article on the Law Technology News Blog titled, “Database Discovery Is Dubious, but Unavoidable”, I had to smile and agree.

Jason basically states that if the legal community thought that harvesting electronic evidence from email systems was difficult, they haven’t seen anything yet as extracting information from databases (where most enterprise electronic information is stored) is going to prove to be much more challenging. Further, there is going to have to be much more cooperation between the legal department and the Information Technology (IT) department.

As an enterprise application development expert, I have spent many hours trying to figure out how to get data into databases. As an eDiscovery technologist, I am looking forward to the challenge of getting it back out again. The next few years are going to be fun.

The full text of Jason’s article is as follows:

Structuring data into databases has long been a solution to store complex data that can be retrieved and reported in variable ways. That data solution, however, has a legal problem in the e-discovery context.

It took years for many litigators and judges to become comfortable with discovery of e-mail and other electronic records. But as more forms of electronic records enter into discovery disputes, lawyers are back on unfamiliar ground. "There are still types of evidence that lawyers prefer to ignore and hope will go away, the way e-mail discovery was ten years ago," says Rob Brunner, who leads the Financial and Enterprise Data Analytics practice at FTI Consulting. "And I hate to say it, but e-mail was an easy problem compared to what's next."

Brunner is speaking specifically of structured data, especially electronic evidence from databases. However, structured data includes a broad swath of content types, including common sources most might consider a document, such as e-mail. Any time a software system, whether a large, enterprise database or an e-mail server, pulls information from a number of different files and merges them into a single view, it is functioning like a database. Unstructured data is commonly defined as data that is not stored in a database or in a semantically tagged document.

Structured data is discoverable in litigation, but lawyers are finding that there is little guidance for handling it. "This is an issue that's only going to demand more attention, because we're swimming in this kind of information," says Brunner. "In trying to explain why discovery of this information is important, I like to point out that 500 out of 500 Fortune 500 companies have structured data. It's something you won't be able to avoid in many cases."

With structured data, information is in the form of separate files that are linked so that information can be pulled from different sources, analyzed, and compiled. For example, a typical corporate human resources system contains information about employees that can be viewed as individual employee records or compiled into statistics about the entire work force.

Like the mass of ice below an iceberg's waterline, the amount of structured data is often the bulk of corporate data, but is rarely seen. According to the Data Warehousing Institute, a technology research firm, approximately 47 percent of corporate data is structured in nature, compared to 31 percent of unstructured data. (The remaining 22 percent was described as semi-structured data.)

IGNORED NO LONGER
The Sedona Conference, a nonprofit legal think tank largely concerned with preservation and production of electronically stored information in civil litigation, has announced that it will publish a commentary on the discovery of information from databases after December 2010.

This will be one of the first such efforts to provide guidance for the discovery of structured data. "The commentary focuses on what is the basis of relevance in a database," says Conrad Jacoby, the founder of efficientEDD.com and editor of the forthcoming document. "Structured data is so difficult to define that we have to start by building the most basic groundwork for discovery of this information."

Unfortunately, the drafting committee found the issue of discovery of structured data so problematic that it had to scale back its ambitions. Initially, the organizers had hoped to address structured data in many forms, including the emerging problem of structured data that is accessed over the web. But in the end, the commentary only addresses database evidence, and ignored all other structured sources. "This is a complicated conversation to have and sometimes the sides talked past one another," says Jacoby. "It's a matter of vocabulary. You can have really excellent lawyers and technical people, but their vocabulary and logic are not the same. The same words can have different meanings and you can go around and around and around."

For example, Jacoby says defining a word as simple as "search" created a headache for his group. In many cases, a database only logs the first words of the text field, meaning a lot of data is not easily retrievable. "You might look at a database and assume that a database query would search all records in the database," he says. "But it turns out that some information is not indexed or searchable. So then what are you searching? Is it even possible to get all relevant information out of a database cost-effectively?"

Structured data is often important for litigation, especially for establishing damages and issues of liability. Unfortunately, it is often ephemeral and endlessly changing. Jocoby points out that automated transaction logs for many businesses are continually deleted and overwritten from point of sale systems. A cash register often keeps a record until cash out, and then the record is uploaded to a regional, then a national database, then it is often overwritten when a credit card transaction clears. "Even finding a record is hard," he says. "A record could be in multiple places or none of the expected places."

A CREDIBILITY PROBLEM
The Sedona Conference and many court systems still struggle with very basic questions, such as how to identify discoverable structured data for litigation. Databases are different in almost every organization. Even the common systems are typically customized for each customer. But even more problematic, many databases are purpose-built and understood in depth by only a few people. "The nastiest issues arise with proprietary systems," says Craig Carpenter, vice president of marketing with Recommind. "You tend find them in large multinationals and the odds are that only 15 people on the planet know how to access some of these systems."

Discovery of databases can become expensive, but for different reasons than the discovery of e-mail and other records. In e-mail, much of the cost is in human review to protect privilege when producing a collection of records. With databases, cost overruns are more likely to arise when you are trying to get information out of a system. "In many cases, pulling a single record is impossible without preserving the larger data set," says Jacoby. "That's when people say, 'fine, then give me the whole database,' and the producing party resists, and now you have a fight on your hands."

Though database files are discoverable in electronic form, courts have been reluctant to grant plaintiffs broad access to them for litigation. Courts struggle with information that is not contained in discrete documents, unlike the static artifact traditionally considered to be a document. There is little case law regarding authenticating structured data, but there are procedures that can be used to try to verify information pulled from such sources is complete.

For example, experts recommend that the most basic step to take in database discovery is to review the regular checks a database makes of the records it produces, which make sure that the results of a database search query and the production information match. Most databases are designed with complex reporting and data mining tools and lawyers can take advantage of these functions to obtain detailed reports of information being produced. "There is no industry checklist, but you can do some verification to make sure data is at least not corrupted," says Brunner.

Because information stored in a database is constantly changing, both the producing and requesting parties can manipulate data and present it in any light they choose. "Just because it comes out of a database doesn't mean it is accurate," says Jacoby. "I think courts make that mistake, and it's important to make sure that a judge understands a database record can be wrong."

Database evidence is often incomplete and misleading, and validating the integrity of data does not validate its meaning. Unlike written records, which can be read and interpreted based on their literal meanings, database records are often stripped of context when produced for litigation. "A pharmaceutical company may have 10,000 adverse records in a database for a particular drug, but how many of those are legitimate complaints?" Jacoby says. "You can slice and dice data all kinds of ways that may be statistically valid but misleading."

To head off this problem, lawyers need to be conversant in the language of structured data and be able to explain issue to the court so that records are presented accurately. Experts say that the problems of database discovery are so complicated and technical that sometimes the only way to communicate the issues is to find simple analogies to technology that laymen understand. "I testified a year and a half ago in a $250 million suit, and I struggled to explain to the judge how things work," says Brunner. "I described the system as a big calculator, as in 'you put the data into the computer and add it up to create a record.' It was an oversimplification, but it worked."

Unfortunately, e-discovery vendors have been slow to respond to this issue. Brunner specializes in this kind of discovery, but he says the industry has yet to provide a reliable, reusable road map for structured data in e-discovery as it has for other data types. "Services have developed to address the low hanging fruit like e-mail and other file types that are relatively easy to build a solution for," says Brunner. "We're just beginning to tackle the question of how you create a repeatable process for structured data."