This Page

has been moved to new address

The eDiscovery Paradigm Shift

Sorry for inconvenience...

Redirection provided by Blogger to WordPress Migration Service
----------------------------------------------------- Blogger Template Style Name: Snapshot: Madder Designer: Dave Shea URL: mezzoblue.com / brightcreative.com Date: 27 Feb 2004 ------------------------------------------------------ */ /* -- basic html elements -- */ body {padding: 0; margin: 0; font: 75% Helvetica, Arial, sans-serif; color: #474B4E; background: #fff; text-align: center;} a {color: #DD6599; font-weight: bold; text-decoration: none;} a:visited {color: #D6A0B6;} a:hover {text-decoration: underline; color: #FD0570;} h1 {margin: 0; color: #7B8186; font-size: 1.5em; text-transform: lowercase;} h1 a {color: #7B8186;} h2, #comments h4 {font-size: 1em; margin: 2em 0 0 0; color: #7B8186; background: transparent url(http://www.blogblog.com/snapshot/bg-header1.gif) bottom right no-repeat; padding-bottom: 2px;} @media all { h3 { font-size: 1em; margin: 2em 0 0 0; background: transparent url(http://www.blogblog.com/snapshot/bg-header1.gif) bottom right no-repeat; padding-bottom: 2px; } } @media handheld { h3 { background:none; } } h4, h5 {font-size: 0.9em; text-transform: lowercase; letter-spacing: 2px;} h5 {color: #7B8186;} h6 {font-size: 0.8em; text-transform: uppercase; letter-spacing: 2px;} p {margin: 0 0 1em 0;} img, form {border: 0; margin: 0;} /* -- layout -- */ @media all { #content { width: 700px; margin: 0 auto; text-align: left; background: #fff url(http://www.blogblog.com/snapshot/bg-body.gif) 0 0 repeat-y;} } #header { background: #D8DADC url(http://www.blogblog.com/snapshot/bg-headerdiv.gif) 0 0 repeat-y; } #header div { background: transparent url(http://www.blogblog.com/snapshot/header-01.gif) bottom left no-repeat; } #main { line-height: 1.4; float: left; padding: 10px 12px; border-top: solid 1px #fff; width: 428px; /* Tantek hack - http://www.tantek.com/CSS/Examples/boxmodelhack.html */ voice-family: "\"}\""; voice-family: inherit; width: 404px; } } @media handheld { #content { width: 90%; } #header { background: #D8DADC; } #header div { background: none; } #main { float: none; width: 100%; } } /* IE5 hack */ #main {} @media all { #sidebar { margin-left: 428px; border-top: solid 1px #fff; padding: 4px 0 0 7px; background: #fff url(http://www.blogblog.com/snapshot/bg-sidebar.gif) 1px 0 no-repeat; } #footer { clear: both; background: #E9EAEB url(http://www.blogblog.com/snapshot/bg-footer.gif) bottom left no-repeat; border-top: solid 1px #fff; } } @media handheld { #sidebar { margin: 0 0 0 0; background: #fff; } #footer { background: #E9EAEB; } } /* -- header style -- */ #header h1 {padding: 12px 0 92px 4px; width: 557px; line-height: 1;} /* -- content area style -- */ #main {line-height: 1.4;} h3.post-title {font-size: 1.2em; margin-bottom: 0;} h3.post-title a {color: #C4663B;} .post {clear: both; margin-bottom: 4em;} .post-footer em {color: #B4BABE; font-style: normal; float: left;} .post-footer .comment-link {float: right;} #main img {border: solid 1px #E3E4E4; padding: 2px; background: #fff;} .deleted-comment {font-style:italic;color:gray;} /* -- sidebar style -- */ @media all { #sidebar #description { border: solid 1px #F3B89D; padding: 10px 17px; color: #C4663B; background: #FFD1BC url(http://www.blogblog.com/snapshot/bg-profile.gif); font-size: 1.2em; font-weight: bold; line-height: 0.9; margin: 0 0 0 -6px; } } @media handheld { #sidebar #description { background: #FFD1BC; } } #sidebar h2 {font-size: 1.3em; margin: 1.3em 0 0.5em 0;} #sidebar dl {margin: 0 0 10px 0;} #sidebar ul {list-style: none; margin: 0; padding: 0;} #sidebar li {padding-bottom: 5px; line-height: 0.9;} #profile-container {color: #7B8186;} #profile-container img {border: solid 1px #7C78B5; padding: 4px 4px 8px 4px; margin: 0 10px 1em 0; float: left;} .archive-list {margin-bottom: 2em;} #powered-by {margin: 10px auto 20px auto;} /* -- sidebar style -- */ #footer p {margin: 0; padding: 12px 8px; font-size: 0.9em;} #footer hr {display: none;} /* Feeds ----------------------------------------------- */ #blogfeeds { } #postfeeds { }

Saturday, May 10, 2008

In Search of Integrated Conceptual eDiscovery Search Technology

Over the past 6 months I have been investigating cost effective, integrated conceptual eDiscovery search technology delivered under a SaaS model. The basis for this investigation is to identify a way to extend the current capabilities of eDiscovery search through a forward thinking search technology that can be tightly integrated on the same Microsoft stack based eDiscovery platform with email archiving and other proactive data retention technology, Electronic Data Discovery (EDD) software and an Online Review Tool (ORT). My finding are that the current state of forward thinking search technology is such that it requires the support of a separate and proprietary database and therefore does not lend itself to integration with EDD and ORT platforms that sit on standard SQLServer solutions.

Where this current state of the market leaves the user is with a choice of either moving large amounts of data or least large amounts of index files and associated data between platforms or investing in a completing propriety eDiscovery solution.

In the process of this investigation, I have found several outstanding articles that touch on the various topics incumbent in this discussion. The first article, found on Law.com, titled "In Search of Better E-Discovery Methods" by H. Christopher Boehning and Daniel J. Toal, does an excellent job of discussing some of the standard criteria for new search technology and whether or not it surpasses currently available keyword and Boolean search technology.

The second article is actual a Blog posting by Cher Devey, titled "Alternative Search Technologies - Too Good to be True" on her "eDiscovery Myth or Reality?" Blog. Ms. Devey discusses the concept and viability of human intervention into the search process. (Please note that the full text of Ms. Devey's Blog Post can be found at the bottom of this posting).

The full text of Mr. Boehning's and Mr. Toal's article is as follows:

As the burdens of e-discovery continue to mount, the search for a technological solution has only intensified. The holy grail here is a search methodology that will enable litigants to identify potentially relevant electronic documents reliably and efficiently.

In an effort to achieve these often competing objectives, litigants most commonly search repositories of electronic data for documents containing any number of defined search terms (keyword searches) or search terms appearing in a specified relation to one another (Boolean searches). These search technologies have been in use for years, both in litigation and elsewhere, and accordingly are well understood and widely accepted by courts and practitioners.

But keyword and Boolean searches are far from perfect solutions; they are blunt instruments. Such searches will identify only those electronic documents containing the precise terms specified. These methodologies therefore will not catch documents using words that are close, but not identical, to the specified search terms, such as abbreviations, synonyms, nicknames, initials and misspelled words.

On the other hand, using more search terms may reduce the risk that an electronic search will miss a relevant document, but only at the price of increasing -- often quite dramatically -- the number of irrelevant documents found in the search. This is a serious problem because counsel must manually review whatever documents the searches yield in order to sift out non responsive materials, make privilege determinations and designate confidential documents. Keyword and Boolean searches thus require a careful balance to be struck: Unduly restrictive searches may miss too many responsive documents while over broad searches threaten stratospheric discovery costs.

Against this backdrop, courts and litigants understandably have been intrigued by the claims of those promoting alternative search technologies, such as "concept searching." The vendors of such technologies suggest their search strategies are able to identify the overwhelming majority of responsive documents while virtually eliminating the need for lawyer involvement in the review process.

Such claims strike many in the legal community as too good to be true. And their skepticism is appropriately heightened because the precise methodologies that such vendors use often are shrouded in mystery, owing to their stated desire to safeguard their proprietary processes and techniques. But this also means their tantalizing claims cannot readily be subjected to independent scrutiny. The question thus posed -- and still largely unexplored -- is whether these alternative search technologies have anything to offer and, if so, how best to evaluate the competing technologies and the often sensational claims of their promoters.

To evaluate whether an alternative search technology might be helpfully employed in any particular case, it is first essential to understand how it works. Some of the principal alternative search technologies, which fall under the broad heading of "concept searching" methodologies, are as follows:

Clustering. Whereas keyword and Boolean searches mechanically apply certain logical rules to identify potentially relevant documents, clustering relies on statistical relationships, which results in documents containing similar words being clustered together in relevant categories. The clustering tool compares each document in a pool to "seed" documents, which have already been designated as relevant. The more words a document has in common with a seed document, the more likely it is to be about the same subject and therefore to be responsive. Moreover, clustering tools generally rank documents based on their statistical similarity to the seed documents.

Taxonomies and ontologies. A taxonomy tool is used to categorize documents containing words that are subsets of the topics relevant to a litigation. For example, if one of the topics of interest is "dogs," a taxonomy tool would capture documents that mention "golden retrievers," "poodles" and "chihuahuas." Ontology tools perform similar searches, but are not confined to identifying subset relationships. Building on the last example, an ontology tool would capture documents that mention "kennels" or "veterinarians."

Bayesian Classifiers. Bayesian search systems use probability theory to make educated inferences about the relevance of documents based on the system's prior experience in identifying relevant documents in the particular litigation. The search results then would be ranked based on the predicted likelihood of their relevance to the litigation.

HOW APPROACHES COMPARE
These alternative search technologies may sound promising in concept, and the claims about their efficiency and accuracy likely add to their allure, but the question remains whether these approaches outperform the standard search approach.

Keyword searching (including with the use of Boolean connectors), its acknowledged limitations notwithstanding, has secured such widespread acceptance for a reason. As an initial matter, the technology and search methodology is well understood and familiar to anyone who has used Westlaw, Lexis or similar search engines. It therefore can be easily discussed with both opposing counsel and judges. The simplicity of keyword searching also doubtlessly promotes negotiated resolution of discovery disputes because the parties have less reason to fear that ignorance about the technology will lead them to strike a bad bargain.

But the simplicity of keyword searching is also its principal weakness. Keyword searches capture only documents containing the precise terms designated, which virtually assures that such a search will miss relevant documents. And, on the other side of the equation, keyword searches will mechanically capture every document -- whether relevant or not -- containing any search term. This means keyword searches may be both substantially under- and over-inclusive. Concept searching systems, by contrast, are not dependent on a particular term appearing in a document and therefore may locate documents a Boolean search would not. But they may suffer from other infirmities.

So how does concept searching stack up? The best evidence to date comes from the Text REtrieval Conference, which in 2006 designed an independent research project to compare the efficacy of various search methods. In view of the prevalence of keyword and Boolean searches in litigation today, TREC was particularly interested in determining whether the alternative search methodologies outlined above were better than Boolean.

As its starting point, the TREC study used a test set of 7 million documents that had been made available to the public pursuant to a Master Settlement Agreement between tobacco companies and several state attorneys general. Attorneys assisting in the study then drafted five test complaints and 43 sample document requests (referred to as topics). The topic creator and a TREC coordinator then took on the roles of the requesting and responding counsel and negotiated over the form of a Boolean search to be run for each document request.

In addition to the Boolean searches, computer scientists from academia and other institutions attempted to locate responsive documents for each topic utilizing 31 different automated search methodologies, including concept searching. The results were striking. On average, across all the topics, the negotiated Boolean searches located 57 percent of the known relevant documents.

But none of the alternative search methodologies reliably performed any better. That is to say, for each topic, the Boolean search did about as well as the best alternative search methodology.

Interestingly, although the Boolean searches generally outperformed the alternative search protocols, the methods did not necessarily retrieve the same responsive documents. In fact, when all of the responsive documents found by the 31 alternative runs were combined, TREC discovered that the alternative search runs collectively had located, on average, an additional 32 percent of the responsive documents in each topic.

As a result, while the Boolean search generally equaled or outperformed any of the individual alternative search methods, those searches also captured at least some responsive documents that the Boolean search had missed.

COST ANALYSIS
This suggests that even if alternative search methodologies have not yet been shown to beat Boolean searches, their use to supplement Boolean searches might increase the number of responsive documents located. But at what cost? The potential benefits of locating any additional documents through use of an alternative search methodology would still have to be weighed against the cost, both in money and resources, required to locate them.

The relevant cost here is not just the price of using the alternative search technology, but also the number of false positives identified by the approach (i.e. documents retrieved by the search, but turn out not to be responsive). Any automated search method -- whether a keyword or concept search -- will yield false positives, which counsel must review and filter out prior to production, which can be a costly process. It therefore is far from clear that use of an alternative search methodology in addition to a keyword or Boolean search will be appropriate in any particular case, a question the TREC study does not attempt to address.

For now, the available evidence suggests that keyword and Boolean searches remain the state-of-the-art and the most appropriate search technology for most cases. This seems particularly true when keyword or Boolean searches are used in an iterative manner, where litigants: (i) negotiate search terms and Boolean operators, (ii) run the agreed-upon searches, (iii) review the preliminary results, and (iv) adjust the searches through a series of meet-and-confers. This type of "virtuous cycle of iterative feedback" has been endorsed by courts and commentators alike.

The intuition of the legal community that an iterative approach to electronic discovery promotes reliability and efficiency finds empirical support in the TREC study. As part of its study, TREC employed an expert tobacco document searcher who used an "interactive" search methodology.

TREC found that the expert searcher located, on average, an additional 11 percent of the relevant documents beyond those that had been located by the initial Boolean searches, which means that an interactive Boolean approach ultimately located 68 percent of the relevant documents -- far better than any of the alternative search methodologies.

CONCLUSION
It may be that alternative search methodologies eventually will surpass the performance of keyword and Boolean searches, but that day does not yet seem to have arrived.

The independent research conducted to date suggests that, for the time being at least, nothing beats Boolean, particularly when used as part of an iterative process.

That does not necessarily mean that alternative search technologies are not worth considering, either independently or along with Boolean or keyword searches. But practitioners would be well advised to carefully scrutinize the marketing claims of the purveyors of such technologies and to factor in often substantial direct and indirect costs of such approaches.

H. Christopher Boehning and Daniel J. Toal are litigation partners at Paul, Weiss, Rifkind, Wharton & Garrison. Associate Jason D. Jones and Aaron Gardner, the firm's discovery process manager, assisted in the preparation of this article.

The Full Text of Ms. Devey's Blog Posting is as follows:

It seems that alternative search technologies (alternative to the familiar Keyword and Boolean searches) touted by Vendors are considered as ‘too good to be true’. Check it out yourself at In Search of Better E-Discovery Methods By H. Christopher Boehning and Daniel J. Toal, New York Law Journal April 23, 2008

The above legal article also mentioned the Text Retrieval Conference (TREC) 2006 study which was also examined by Will Uppington in the article, Better Search for E-Discovery, March 11th, 2008

What I find interesting in Will Uppington’s article is the finding; ‘One of the best ways to get better search queries is to commit human resources to improving them, by putting a “human-in-the-loop” while performing searches’.

Reading in between these two ‘search themed’ titles, one from the legal side and the other from a technical perspective, highlighted the contrasting findings and interpretation on the TREC 2006 study

What else can we say/talk about the ‘human-in-the loop’, the ‘virtuous cycle of iterative feedback’ & “interactive” search methodology?

Well such phrases/concepts are not new. What is new is that the ‘human actions’ aspects are creeping (awareness?) into the ediscovery space. Other knowledge researchers outside the ediscovery domain have been busily coming up with phrases/concepts such as the ‘concept searching’ methodologies. Reality (or inertia adoption) testing of such newer technologies are clearly not well understood (too good to be true?) by the courts and practitioners.

On human actions and computer programs, a beautiful quote comes from my friend, Roger C: “While computer programs can write other computer programs, they can’t write the first program”.

Labels: , , , , , ,

Thursday, May 8, 2008

How Will the Courts React to the Demise of Bates Numbers?

I was very intrigued by a recent article by Tom O'Connor on the law.com legal technology site titled "Bates Stamps' Days May Be Numbered". Mr. O'Connor astutely points out that current eDiscovery platform vendors, in response to the changes in the Federal Rules of Civil Procedure in December of 2006, now enable users to manage native Electronically Stored information (ESI) throughout the entire case lifecycle without every having to convert to an image file (i.e. TIFF or PDF) and adding bates numbers for the purpose of tracking. And, I would suspect that within the next several years these same eDiscovery platforms will be begin to enable the integration, management and open sharing of all responsive and relevant evidence (paper and ESI) not only between opposing parties but also with the courts in a manner that fulfills the requirements of source location, chain of custody, audit/use logs, proof of authenticity, proof that the evidence hasn't been tampered with and a standard digital numbering system that will enable all parties to be able to collaborate and communicate on common pieces of evidence without confusion. This "single source of truth" is probably something akin to the holy grail in concept, probably not that far outside the box from a technical standpoint, but more than likely a real stretch for the courts.

And this last point is the basis for my fascination with Mr. O'Connor's contention that "Bates Stamps Days May be Numbered". I have no doubt that the eDiscovery technologists are close if not already there in regards to retiring Bates Numbers. I just don't see the market and more importantly case law moving fast enough for it happen anytime soon.

As point of reference, following is the full content of Tom O'Connor's article:

One of the most challenging problems facing litigation attorneys is how to work with the massive volume of digital documents produced during the discovery phase of a case.

For years, they have relied on a system of scanning and sequentially numbering individual document pages, extracting the text electronically and producing single-page TIFF files as the standard method. But that process is simply not effective when dealing with terabytes of data.

To address the sheer volume, many vendors are advocating a new way of working with electronic documents that can reduce costs as much as 65 percent by eliminating the need for text extraction and imaging in the processing phase.

Beyond immediate cost savings, this approach also provides cheaper native file production,reducing imaging costs for production sets and saving up to 90 percent of the time needed to process documents. How? By not using Bates numbers on every page.

It also may solve a second problem, because it addresses the preference (under recent federal and state rule changes) for using native files in productions, which cannot be Bates numbered.

Currently, to provide Bates numbering, many vendors generate TIFF images from native files and then Bates number those images. But this process complicates native file review and -- at anywhere from eight to 20 cents per TIFF -- adds considerable cost to the process.

Typically, during processing, data is culled, de-duplicated; metadata and text are extracted; and then a TIFF file is created. An unavoidable consequence is that the relationship of the pages to other pages, or attachments, is broken -- and then must be re-created for the review process.

Page-oriented programs handle this by using a load file to tie everything together from the key of a page number. But most new software use a relational database that stores the data about a document in multiple tables. To load single page TIFFs into a relational database involves a substantial amount of additional and duplicative work in the data load process.

A document-based data model, rather than a page-based approach, eliminates the text extraction and image creation steps from the processing stage and cuts the cost of that process in half. Documents become available in the review platform much faster -- as imaging often accounts for as much as 90 percent of the time to process. This enables early case assessment without any processing, by simply dragging and dropping a native file or a PST straight into the application -- which cannot be achieved with the page-based batch process.

Relational databases allow for one-to-many and many-to-many relationships and support advanced features and functions -- as well as compatibility with external engines for tasks such as de-duping and concept searching.

Applications that support these functions -- such as software from Equivio, Recommind and Vivisimo Inc. -- are all document-based and will not perform in the old page environment.
Programs that use the document model can eliminate batch transfer. This process increases data storage due to the need for data replication in the transfer process and is also prone to a high rate of human error. And elimination of the time that inventory (in this case, electronic data) is stationary will eliminate overall cost as well as reduce production time.

Firms responding to a litigation hold letter can use a tool such as Recommind's Axcelerate to automate first pass review. David Baskin, Recommind's vice president of product management, says this process "improves review accuracy and consistency, which drastically expedites the review process."

"It helps attorneys gain insight into a document collection before review has even begun while insuring that all documents reviewed and the legacy workflow is maintained."
Using the document approach, files can be moved between the first-pass tool and stronger litigation support programs as needs arise, rather than in large batch transfers. This integration creates an easier, faster and more cost-effective e-discovery process.

Another example of this is eCapture from IPRO Tech Inc. It helps users load large sets of documents, files and materials rapidly and in a native workflow, eliminating conversion of native files to TIFF images and extracting the text files.

A modern litigation support program must be able to review native documents that are not just paper equivalents, and directly enable review of any file that is in common use in business today. The future belongs to these new technologies, where native files are processed without the need to convert to TIFF and are identified by their unique hash algorithm.

Attorneys and clients who focus on a document-based system will save time and money and can conduct native file review. In today's world of vast quantities of electronic documents,the days of the Bates stamp are numbered.

Labels: , , , , , , , , ,

Tuesday, May 6, 2008

Kazeon Press Release Focuses Attention on the Cost of a Latte and the Falling Cost of eDiscovery

Like lots of other eDiscovery professionals this morning, I had to do a double take when I read the press release today from Kazeon announcing their new partnership with Attenex, titled, "Attenex and Kazeon Announce eDiscovery Alliance". The double take was caused by the fact that Kazeon announced that they can deliver industry-leading price/performance for in-house processing of ESI in preparation for reactive and proactive eDiscovery matters as low as $4.30 per Gigabyte. This is actually not new news as they had previously contended in a press release titled "ESG Lab Finds Kazeon’s Information Server Delivers Fast and Cost Effective Information Access" this same financial offering in a slightly less alarming way by stating that the ESG Lab verified impressive price/performance with a single Kazeon Information Server appliance able to index 2,500 documents per dollar and a cost as low as $4,300 dollars per terabyte (which is approximately $4.19 per gigabyte).

However, any way you crunch the numbers, position the cost or spin the offering, it is just flat alarming and bordering on unbelievable for both users and technology vendors in the eDiscovery market. Bottom line, whether or not you believe that Kazeon is comparing true eDiscovery apples with the rest of the apples in the market, it doesn't matter as this is definitely the first shot across the bow of the rest of the eDiscovery vendors. Prices are comming down and the rest of the market is going to have to keep up. Unfortunately, many of them have very old technology that requires lots of manual manipulation and processing and therefore may not have the "legs" to stay in the race. I don't see any dramatic changes in 2008 as users will still trying to figure out what they are getting for thier investments or not getting from each of the vendors. However, once this is all common knowledge, 2009 may be the year of the changing of the vendor guard in eDiscovery.

All of that being said, the best comments on this press release came from Kurt Leafstrand of Clearwell on his eDiscovery 2.0 Blog. The title of his posting is "eDiscovery Processing: You Get What You Pay For" and does a really great job of questioning whether or not Kazeon can really support even the minimum requirements of the EDRM processing node for what basically amounts to the less than the cost of my favorite Starbukcs venti, breve, sugar free hazlenut latte.

With all the talk of the cost of lattes, maybe the next big eDiscovery winner will be the vendor that announces a cross marketing deal with Starbucks to include a free venti latte with every GB of data procesed. Remember that you heard it first here on the eDiscovery Paradigm Shift Blog.

Because I beleive that it is one of the more humerous eDiscovery posts that I have read in some time, I am inlcuding the entire contents of Kurts Blog post below:

Anyone reading today’s announcement from Kazeon could be forgiven for doing a double-take: did someone misplace the decimal point? Kazeon claims that it can perform “processing of ESI in preparation for eDiscovery matters as low as $4.30 per Gigabyte.” Assuming that’s not simply a typo, it begs an obvious question: If Kazeon really can process information at a tiny fraction of what e-discovery service providers are charging, how come every e-discovery service provider isn’t going out of business? Why wouldn’t everyone take this incredibly good deal?
The answer (in press releases, as in politics) lies in definitions. Exactly what sort of processing would you be getting for your four dollars and change?

You’ll have to ask Kazeon to get the answer to that one, but give a venti latte to a bleary-eyed e-discovery service provider who’s just pulled an all-nighter preparing for a meet-and-confer, and they’ll tell you all about the nuances, complexities, and risks inherent in e-discovery processing that may be difficult for enterprise search/information lifecycle management vendors to grasp. Quite likely, they will refer you to EDRM’s processing node overview, which outlines the basic goals of robust processing:

  1. Capture and preserve the body of electronic documents;
  2. Associate document collections with particular users (custodians);
  3. Capture and preserving the metadata associated with the electronic files within the collections;
  4. Establish the parent-child relationship between the various source data files;
    Automate the identification and elimination of redundant, duplicate data with the given dataset;
  5. Provide a means to programmatically suppress material that is not relevant to the review based on criteria such as keywords, date ranges or other available metadata;
  6. Unprotect and reveal information within files; and Accomplish all of these goals in a manner that is both defensible with respect to clients’ legal obligations and appropriately cost-effective and expedient in the context of the matter.

And that’s just the high-level overview. After the caffeine from the latte starts to kick in, they’ll tell you it’s also absolutely critical to:

  1. Provide statistical count tie-outs that reconcile every incoming email, loose file, and attachment with the processed document set
  2. Automatically scan critical large container files (such as PSTs) for errors and problems prior to processing
  3. Automatically perform custodian mapping to track ownership of all documents
    Maintain detailed reports on every anomaly encountered during processing, down to the individual email, loose file, and attachment
  4. Automatically handle common metadata anomalies (with logging) so that the maximum number of documents are made available for review
  5. Provide robust and thorough handling for container files regardless of container format
    Support non-email content types such as contacts, calendar entries, tasks, and notes
  6. Robustly handle embedded objects
  7. Provide full visibility into exceptions encountered during processing, along with an integrated exception handling process to allow repaired/decrypted data to be easily added back into the document set

All that for under five bucks? That’s quite a deal! But remember, if you drive by your corner gas station tomorrow morning and they’re advertising regular unleaded for 20 cents a gallon: It may be cheap, but it’s probably not gas you’re getting.

Labels: , , ,

Sunday, May 4, 2008

Update on the EDRM XML Standard

As an enterprise software technologist with over 25 year of experience struggling with integrating disparate data sources, proprietary applications and enabling new solutions to access legacy platforms, I have a deep appreciation for standards in the pursuit of interoperability.

Given all of this, I recently came across a tremendous Blog posting titled "Does the emperor have any clothes on? Thoughts on EDRM" by Rob Robinson on his Information Governance Blog that issues some very valuable questions in regards to the new XML standards as set forth by the EDRM.

In response, I agree that George Socha and Tom Gelbmann, founders of the EDRM, have done a superb job in creating both the EDRM and the resulting XML standard. However, I am not sure that I agree that this new standard "has no clothes" and is nothing more that a marketing ploy. First of all, I spend most of my waking hours talking to litigators at law firms and the legal departments of Fortune 1000 enterprise I have still haven't had anyone ask about this standard. So, if it was designed as a marketing ploy, and I don't beleive that it was, it is working.

As with any new industry standard, there is definitely room for improvement, maturation and evolution based on additional input and response from the market. And, without a doubt, Mr. Robinson's considerations are a good start on a list of potential improvements.

Following is the complete Blog posting:

One of the most prominent topics today in electronic discovery - from both a news and views standpoint - is the EDRM (Electronic Discovery Reference Model ) and its XML2 (Extensible Markup Language) project. As a technology marketer by trade, I find that the Electronic Discovery Reference Model provides a great way in which to "break down" electronic discovery into components that can easily be described, compared, and considered. In my opinion, George Socha and Tom Gelbmann, founders of the EDRM, have done a superb job in creating a "lingua franca" for discussing electronic discovery. However, a question still exists in the minds of many about the true long term viability of the current EDRM approach to its XML standard. While the importance of the standard is championed by marketers across the electronic discovery landscape, does it really provide any advantage beyond the marketing hype for consumers of electronic-discovery-related legal technology?

Per the EDRM website, the goal of the EDRM XML2 project is to provide a standard, generally accepted XML schema to facilitate the movement of electronically stored information (ESI) from one step of the electronic discovery process to the next, from one software program to the next, and from one organization to the next. The ESI includes both underlying discovery materials (e.g., email messages and attachments, loose files, and databases) and information about those materials (e.g., the source of the underlying ESI, processing of that ESI, and production of that ESI). While I truly believe in the benefits of technology standards to enhance and ensure interoperability of products and services, I think that the current EDRM approach to XML interoperability is one that may leave a lot to be desired in the area of objective accountability. It is in this area of objective accountability that I might suggest we take a deeper look at the current EDRM XML approach and determine if the "emperor (EDRM XML standard) has any clothes on." This consideration of the "emperors clothes" is in no way, shape, or form "people-centric" in focus, as I believe that EDRM leaders and participants have the industries best interest at heart. However, the consideration is "approach-centric" in focus and hopefully presents some questions, oft-spoken in private yet never spoken in public, about the EDRM XML project.

Consideration #1: Is the EDRM XML standard coordinated with other industry standards bodies in the technology arena?
The desire for legal XML standardization is certainly a need recognized by legal professionals beyond the EDRM organization. In fact one of the leading standardization bodies, OASIS* (Organization for the Advancement of Structured Information Standards) has a specific group (LegalXML) focused specifically on legal electronic change of information. Yes, that group may not currently have a specific technical group for electronic discovery related information exchange, but one question I might submit is that if there is a structure in place for the development and evaluation of standards, and if that group already has legal focused technical committees, why would a group set out to develop a standard autonomously from that group? Is there a technical reason why one would not at least coordinate efforts with such a group? Or, is there an accountability reason that one might not coordinate with such a group?

Additionally, if the focus is on true interoperability, one organization that has a great model for standardization is SNIA** (Storage Networking Industry Association). SNIA standards are primarily related to data, storage, and information management and address such challenges as interoperability, usability, and complexity. Considering that law firms, corporations, and governmental agencies have a high propensity to use equipment from SNIA member organizations, might it not make sense to coordinate with SNIA to where the EDRM XML standard might fit in the data, storage, and information management area? Is there a technical reason why one would not at least coordinate efforts with SNIA? Or, is there an accountability reason that one might not coordinate with SNIA?

Based on my current understanding of EDRM coordination activities, it appears that a majority of standards considerations have been more introspective (focused on EDRM participating vendors), than extrospective. With this introspection in mind, I might suggest the emperor (EDRM XML standard) is not as fully clothed as he may like others to believe.

Consideration #2: Does the EDRM XML standard represent the true needs of legal technology professionals in the field of eDiscovery?
No doubt the working organizations and members of the current EDRM XML2 project represent a great many of the thought leaders in the electronic discovery vendor arena. However, in developing a standard ultimately designed to help consumers of electronic discovery technology, I would ask how many of the top law firms, corporations, and governmental agencies have even reviewed the standard to ensure it meets it needs? Interoperability between participating vendor legal products and services is great when one is championing the ease of use and integration of products/services with other products/services, but if the interoperability is based on the transfer of standard information between applications/devices, does it not make sense that the information is fully vetted with a representative body of the actual consumers before establishing a standard and beginning to announce vendor compliance with such a standard. Is there a technical reason why one would not at least seek to survey top law firms, corporations, and governmental agencies on what they believe the standard should contain? Or, is there an accountability reason why one would not seek to survey top law firms, corporations, and governmental agencies on what they believe the standard should contain?

Based on my current understanding, it appears that the EDRM XML standard has not truly been vetted with potential end user consumers. With this lack of "vetting" in mind, I might suggest the emperor (EDRM XML standard) is not as fully clothed as he may like others to believe. (One argument that could be made is that end users are not interested in spending the time to understand the standard and pronounce their needs concerning the standard. While I agree this is a solid argument, it also begs the question of who is ultimately driving the standardization? (Client needs or Vendor/Group desire?)

Consideration #3: Is there true interoperability testing prior to certifying a product/service as EDRM XML compliant?
With respect to software, the term interoperability is used to describe the capability of different programs to exchange data via a common set of exchange formats, to read and write the same file formats, and to use the same protocols. From a legal technology perspective, is it wise to pronounce a product or service "interoperable" when in fact those services may have never been tested with actual "other vendor" products/services? Said in a different way, does interoperability assume that if Widget A works with Product A and Widget A works with Product B, that Product A and Product B work together? When one considers the extensive interoperability approach of organizations such as the SNIA Interoperability Committee and Microsoft (Windows Hardware Qualification Lab***), I might suggest the emperor (EDRM XML standard) is again not as fully clothed as he may like others to believe.

Consideration #4: Can the EDRM organization be truly objective in evaluating its work and work product?
When you consider that standards organizations such as OASIS and SNIA are non-profit organizations that have elected leaders, it is understandable why they are considered objective in presenting their work and work product. Does that mean that because EDRM is not a non-profit organization and does not have elected leadership that it is not objective? Certainly not - however, I might suggest to you that for industry-wide acceptance of standards and work from standards bodies, it is very important to ensure an organization is viewed as one that is structured for objectivity. Is the EDRM structured in a manner today to portray objectivity? While that is certainly a subjective question, I might suggest that currently EDRM does not appear to be fully objective based on both the fact it does not have elected leadership and the fact that the founders have dual relationships with many of the EDRM participants (as they are very well known and well respected consultants in the electronic discovery space). Can an organization like this be truly objective? Based on my view, I would say that perception is reality to a marketer and no matter how objective and noble the organization may be - it does need to come across as objective to be viewed as "an emperor with clothes". Again, as I mentioned earlier, this view is based on the current approach of the EDRM toward the XML standard, not the people involved.

Does the emperor have any clothes on?
With these four aforementioned considerations in mind, I might suggest that in its current form today, the EDRM XML2 project and its XML standard will not gain wide spread acceptance beyond those organizations that participate purely based on the marketing benefit of participating. I might also suggest that there may become a point when, based on acceptance without critique by analysts/media/users, the standard becomes a check box in Requests For Proposals and thus might hurt those organizations with excellent products/services that are not compliant with the EDRM XML standard.

Personally, I do believe in the value of an XML standard for the electronic exchange of information among electronic discovery vendors. However, in seeking this standard I would certainly recommend seeking it through the framework of discussion with those who have been down the standardization path before (OASIS/SNIA) and leverage as many of their resources as possible so as not to have to "reinvent" practices/processes. Also, I would recommend seeking to ensure the standard represents what the end user/consumer requirements are as stated by the end user/consumer. Asking vendors for their thoughts is important - but certainly not as important as asking the end user/consumer. Finally, I would also recommend that the standards are prepared, presented, and evaluated by an organization structured for objectivity. Objective accountability seems to be a common denominator for the success of standards and standards organizations. If past performance is an indicator of future performance, it would make sense for the EDRM to organize to create not only the perception of objectivity, but a structure that lends itself to objectivity.

Does the emperor (EDRM XML standard) have any clothes on? While there may be areas in which I have overlooked and/or am misinformed, if asked today if the emperor had any clothes on, I would have to confess that I don't see any. What do you think?

Labels: ,