This Page

has been moved to new address

In Search of Integrated Conceptual eDiscovery Search Technology

Sorry for inconvenience...

Redirection provided by Blogger to WordPress Migration Service
----------------------------------------------------- Blogger Template Style Name: Snapshot: Madder Designer: Dave Shea URL: mezzoblue.com / brightcreative.com Date: 27 Feb 2004 ------------------------------------------------------ */ /* -- basic html elements -- */ body {padding: 0; margin: 0; font: 75% Helvetica, Arial, sans-serif; color: #474B4E; background: #fff; text-align: center;} a {color: #DD6599; font-weight: bold; text-decoration: none;} a:visited {color: #D6A0B6;} a:hover {text-decoration: underline; color: #FD0570;} h1 {margin: 0; color: #7B8186; font-size: 1.5em; text-transform: lowercase;} h1 a {color: #7B8186;} h2, #comments h4 {font-size: 1em; margin: 2em 0 0 0; color: #7B8186; background: transparent url(http://www.blogblog.com/snapshot/bg-header1.gif) bottom right no-repeat; padding-bottom: 2px;} @media all { h3 { font-size: 1em; margin: 2em 0 0 0; background: transparent url(http://www.blogblog.com/snapshot/bg-header1.gif) bottom right no-repeat; padding-bottom: 2px; } } @media handheld { h3 { background:none; } } h4, h5 {font-size: 0.9em; text-transform: lowercase; letter-spacing: 2px;} h5 {color: #7B8186;} h6 {font-size: 0.8em; text-transform: uppercase; letter-spacing: 2px;} p {margin: 0 0 1em 0;} img, form {border: 0; margin: 0;} /* -- layout -- */ @media all { #content { width: 700px; margin: 0 auto; text-align: left; background: #fff url(http://www.blogblog.com/snapshot/bg-body.gif) 0 0 repeat-y;} } #header { background: #D8DADC url(http://www.blogblog.com/snapshot/bg-headerdiv.gif) 0 0 repeat-y; } #header div { background: transparent url(http://www.blogblog.com/snapshot/header-01.gif) bottom left no-repeat; } #main { line-height: 1.4; float: left; padding: 10px 12px; border-top: solid 1px #fff; width: 428px; /* Tantek hack - http://www.tantek.com/CSS/Examples/boxmodelhack.html */ voice-family: "\"}\""; voice-family: inherit; width: 404px; } } @media handheld { #content { width: 90%; } #header { background: #D8DADC; } #header div { background: none; } #main { float: none; width: 100%; } } /* IE5 hack */ #main {} @media all { #sidebar { margin-left: 428px; border-top: solid 1px #fff; padding: 4px 0 0 7px; background: #fff url(http://www.blogblog.com/snapshot/bg-sidebar.gif) 1px 0 no-repeat; } #footer { clear: both; background: #E9EAEB url(http://www.blogblog.com/snapshot/bg-footer.gif) bottom left no-repeat; border-top: solid 1px #fff; } } @media handheld { #sidebar { margin: 0 0 0 0; background: #fff; } #footer { background: #E9EAEB; } } /* -- header style -- */ #header h1 {padding: 12px 0 92px 4px; width: 557px; line-height: 1;} /* -- content area style -- */ #main {line-height: 1.4;} h3.post-title {font-size: 1.2em; margin-bottom: 0;} h3.post-title a {color: #C4663B;} .post {clear: both; margin-bottom: 4em;} .post-footer em {color: #B4BABE; font-style: normal; float: left;} .post-footer .comment-link {float: right;} #main img {border: solid 1px #E3E4E4; padding: 2px; background: #fff;} .deleted-comment {font-style:italic;color:gray;} /* -- sidebar style -- */ @media all { #sidebar #description { border: solid 1px #F3B89D; padding: 10px 17px; color: #C4663B; background: #FFD1BC url(http://www.blogblog.com/snapshot/bg-profile.gif); font-size: 1.2em; font-weight: bold; line-height: 0.9; margin: 0 0 0 -6px; } } @media handheld { #sidebar #description { background: #FFD1BC; } } #sidebar h2 {font-size: 1.3em; margin: 1.3em 0 0.5em 0;} #sidebar dl {margin: 0 0 10px 0;} #sidebar ul {list-style: none; margin: 0; padding: 0;} #sidebar li {padding-bottom: 5px; line-height: 0.9;} #profile-container {color: #7B8186;} #profile-container img {border: solid 1px #7C78B5; padding: 4px 4px 8px 4px; margin: 0 10px 1em 0; float: left;} .archive-list {margin-bottom: 2em;} #powered-by {margin: 10px auto 20px auto;} /* -- sidebar style -- */ #footer p {margin: 0; padding: 12px 8px; font-size: 0.9em;} #footer hr {display: none;} /* Feeds ----------------------------------------------- */ #blogfeeds { } #postfeeds { }

Saturday, May 10, 2008

In Search of Integrated Conceptual eDiscovery Search Technology

Over the past 6 months I have been investigating cost effective, integrated conceptual eDiscovery search technology delivered under a SaaS model. The basis for this investigation is to identify a way to extend the current capabilities of eDiscovery search through a forward thinking search technology that can be tightly integrated on the same Microsoft stack based eDiscovery platform with email archiving and other proactive data retention technology, Electronic Data Discovery (EDD) software and an Online Review Tool (ORT). My finding are that the current state of forward thinking search technology is such that it requires the support of a separate and proprietary database and therefore does not lend itself to integration with EDD and ORT platforms that sit on standard SQLServer solutions.

Where this current state of the market leaves the user is with a choice of either moving large amounts of data or least large amounts of index files and associated data between platforms or investing in a completing propriety eDiscovery solution.

In the process of this investigation, I have found several outstanding articles that touch on the various topics incumbent in this discussion. The first article, found on Law.com, titled "In Search of Better E-Discovery Methods" by H. Christopher Boehning and Daniel J. Toal, does an excellent job of discussing some of the standard criteria for new search technology and whether or not it surpasses currently available keyword and Boolean search technology.

The second article is actual a Blog posting by Cher Devey, titled "Alternative Search Technologies - Too Good to be True" on her "eDiscovery Myth or Reality?" Blog. Ms. Devey discusses the concept and viability of human intervention into the search process. (Please note that the full text of Ms. Devey's Blog Post can be found at the bottom of this posting).

The full text of Mr. Boehning's and Mr. Toal's article is as follows:

As the burdens of e-discovery continue to mount, the search for a technological solution has only intensified. The holy grail here is a search methodology that will enable litigants to identify potentially relevant electronic documents reliably and efficiently.

In an effort to achieve these often competing objectives, litigants most commonly search repositories of electronic data for documents containing any number of defined search terms (keyword searches) or search terms appearing in a specified relation to one another (Boolean searches). These search technologies have been in use for years, both in litigation and elsewhere, and accordingly are well understood and widely accepted by courts and practitioners.

But keyword and Boolean searches are far from perfect solutions; they are blunt instruments. Such searches will identify only those electronic documents containing the precise terms specified. These methodologies therefore will not catch documents using words that are close, but not identical, to the specified search terms, such as abbreviations, synonyms, nicknames, initials and misspelled words.

On the other hand, using more search terms may reduce the risk that an electronic search will miss a relevant document, but only at the price of increasing -- often quite dramatically -- the number of irrelevant documents found in the search. This is a serious problem because counsel must manually review whatever documents the searches yield in order to sift out non responsive materials, make privilege determinations and designate confidential documents. Keyword and Boolean searches thus require a careful balance to be struck: Unduly restrictive searches may miss too many responsive documents while over broad searches threaten stratospheric discovery costs.

Against this backdrop, courts and litigants understandably have been intrigued by the claims of those promoting alternative search technologies, such as "concept searching." The vendors of such technologies suggest their search strategies are able to identify the overwhelming majority of responsive documents while virtually eliminating the need for lawyer involvement in the review process.

Such claims strike many in the legal community as too good to be true. And their skepticism is appropriately heightened because the precise methodologies that such vendors use often are shrouded in mystery, owing to their stated desire to safeguard their proprietary processes and techniques. But this also means their tantalizing claims cannot readily be subjected to independent scrutiny. The question thus posed -- and still largely unexplored -- is whether these alternative search technologies have anything to offer and, if so, how best to evaluate the competing technologies and the often sensational claims of their promoters.

To evaluate whether an alternative search technology might be helpfully employed in any particular case, it is first essential to understand how it works. Some of the principal alternative search technologies, which fall under the broad heading of "concept searching" methodologies, are as follows:

Clustering. Whereas keyword and Boolean searches mechanically apply certain logical rules to identify potentially relevant documents, clustering relies on statistical relationships, which results in documents containing similar words being clustered together in relevant categories. The clustering tool compares each document in a pool to "seed" documents, which have already been designated as relevant. The more words a document has in common with a seed document, the more likely it is to be about the same subject and therefore to be responsive. Moreover, clustering tools generally rank documents based on their statistical similarity to the seed documents.

Taxonomies and ontologies. A taxonomy tool is used to categorize documents containing words that are subsets of the topics relevant to a litigation. For example, if one of the topics of interest is "dogs," a taxonomy tool would capture documents that mention "golden retrievers," "poodles" and "chihuahuas." Ontology tools perform similar searches, but are not confined to identifying subset relationships. Building on the last example, an ontology tool would capture documents that mention "kennels" or "veterinarians."

Bayesian Classifiers. Bayesian search systems use probability theory to make educated inferences about the relevance of documents based on the system's prior experience in identifying relevant documents in the particular litigation. The search results then would be ranked based on the predicted likelihood of their relevance to the litigation.

HOW APPROACHES COMPARE
These alternative search technologies may sound promising in concept, and the claims about their efficiency and accuracy likely add to their allure, but the question remains whether these approaches outperform the standard search approach.

Keyword searching (including with the use of Boolean connectors), its acknowledged limitations notwithstanding, has secured such widespread acceptance for a reason. As an initial matter, the technology and search methodology is well understood and familiar to anyone who has used Westlaw, Lexis or similar search engines. It therefore can be easily discussed with both opposing counsel and judges. The simplicity of keyword searching also doubtlessly promotes negotiated resolution of discovery disputes because the parties have less reason to fear that ignorance about the technology will lead them to strike a bad bargain.

But the simplicity of keyword searching is also its principal weakness. Keyword searches capture only documents containing the precise terms designated, which virtually assures that such a search will miss relevant documents. And, on the other side of the equation, keyword searches will mechanically capture every document -- whether relevant or not -- containing any search term. This means keyword searches may be both substantially under- and over-inclusive. Concept searching systems, by contrast, are not dependent on a particular term appearing in a document and therefore may locate documents a Boolean search would not. But they may suffer from other infirmities.

So how does concept searching stack up? The best evidence to date comes from the Text REtrieval Conference, which in 2006 designed an independent research project to compare the efficacy of various search methods. In view of the prevalence of keyword and Boolean searches in litigation today, TREC was particularly interested in determining whether the alternative search methodologies outlined above were better than Boolean.

As its starting point, the TREC study used a test set of 7 million documents that had been made available to the public pursuant to a Master Settlement Agreement between tobacco companies and several state attorneys general. Attorneys assisting in the study then drafted five test complaints and 43 sample document requests (referred to as topics). The topic creator and a TREC coordinator then took on the roles of the requesting and responding counsel and negotiated over the form of a Boolean search to be run for each document request.

In addition to the Boolean searches, computer scientists from academia and other institutions attempted to locate responsive documents for each topic utilizing 31 different automated search methodologies, including concept searching. The results were striking. On average, across all the topics, the negotiated Boolean searches located 57 percent of the known relevant documents.

But none of the alternative search methodologies reliably performed any better. That is to say, for each topic, the Boolean search did about as well as the best alternative search methodology.

Interestingly, although the Boolean searches generally outperformed the alternative search protocols, the methods did not necessarily retrieve the same responsive documents. In fact, when all of the responsive documents found by the 31 alternative runs were combined, TREC discovered that the alternative search runs collectively had located, on average, an additional 32 percent of the responsive documents in each topic.

As a result, while the Boolean search generally equaled or outperformed any of the individual alternative search methods, those searches also captured at least some responsive documents that the Boolean search had missed.

COST ANALYSIS
This suggests that even if alternative search methodologies have not yet been shown to beat Boolean searches, their use to supplement Boolean searches might increase the number of responsive documents located. But at what cost? The potential benefits of locating any additional documents through use of an alternative search methodology would still have to be weighed against the cost, both in money and resources, required to locate them.

The relevant cost here is not just the price of using the alternative search technology, but also the number of false positives identified by the approach (i.e. documents retrieved by the search, but turn out not to be responsive). Any automated search method -- whether a keyword or concept search -- will yield false positives, which counsel must review and filter out prior to production, which can be a costly process. It therefore is far from clear that use of an alternative search methodology in addition to a keyword or Boolean search will be appropriate in any particular case, a question the TREC study does not attempt to address.

For now, the available evidence suggests that keyword and Boolean searches remain the state-of-the-art and the most appropriate search technology for most cases. This seems particularly true when keyword or Boolean searches are used in an iterative manner, where litigants: (i) negotiate search terms and Boolean operators, (ii) run the agreed-upon searches, (iii) review the preliminary results, and (iv) adjust the searches through a series of meet-and-confers. This type of "virtuous cycle of iterative feedback" has been endorsed by courts and commentators alike.

The intuition of the legal community that an iterative approach to electronic discovery promotes reliability and efficiency finds empirical support in the TREC study. As part of its study, TREC employed an expert tobacco document searcher who used an "interactive" search methodology.

TREC found that the expert searcher located, on average, an additional 11 percent of the relevant documents beyond those that had been located by the initial Boolean searches, which means that an interactive Boolean approach ultimately located 68 percent of the relevant documents -- far better than any of the alternative search methodologies.

CONCLUSION
It may be that alternative search methodologies eventually will surpass the performance of keyword and Boolean searches, but that day does not yet seem to have arrived.

The independent research conducted to date suggests that, for the time being at least, nothing beats Boolean, particularly when used as part of an iterative process.

That does not necessarily mean that alternative search technologies are not worth considering, either independently or along with Boolean or keyword searches. But practitioners would be well advised to carefully scrutinize the marketing claims of the purveyors of such technologies and to factor in often substantial direct and indirect costs of such approaches.

H. Christopher Boehning and Daniel J. Toal are litigation partners at Paul, Weiss, Rifkind, Wharton & Garrison. Associate Jason D. Jones and Aaron Gardner, the firm's discovery process manager, assisted in the preparation of this article.

The Full Text of Ms. Devey's Blog Posting is as follows:

It seems that alternative search technologies (alternative to the familiar Keyword and Boolean searches) touted by Vendors are considered as ‘too good to be true’. Check it out yourself at In Search of Better E-Discovery Methods By H. Christopher Boehning and Daniel J. Toal, New York Law Journal April 23, 2008

The above legal article also mentioned the Text Retrieval Conference (TREC) 2006 study which was also examined by Will Uppington in the article, Better Search for E-Discovery, March 11th, 2008

What I find interesting in Will Uppington’s article is the finding; ‘One of the best ways to get better search queries is to commit human resources to improving them, by putting a “human-in-the-loop” while performing searches’.

Reading in between these two ‘search themed’ titles, one from the legal side and the other from a technical perspective, highlighted the contrasting findings and interpretation on the TREC 2006 study

What else can we say/talk about the ‘human-in-the loop’, the ‘virtuous cycle of iterative feedback’ & “interactive” search methodology?

Well such phrases/concepts are not new. What is new is that the ‘human actions’ aspects are creeping (awareness?) into the ediscovery space. Other knowledge researchers outside the ediscovery domain have been busily coming up with phrases/concepts such as the ‘concept searching’ methodologies. Reality (or inertia adoption) testing of such newer technologies are clearly not well understood (too good to be true?) by the courts and practitioners.

On human actions and computer programs, a beautiful quote comes from my friend, Roger C: “While computer programs can write other computer programs, they can’t write the first program”.

Labels: , , , , , ,

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home