The eDiscovery Paradigm Shift: The New Generation of eDiscovery Search

Thursday, February 12, 2009

The New Generation of eDiscovery Search

The New Generation of eDiscovery Search technology train is getting ready to leave the station. However, after walking the tradeshow floors and attending many of the breakout sessions at last weeks LegalTech in New York, it is obvious that there is a tremendous amount of confusion regarding the definition and scope of the New Generation of eDiscovery Search technology and more importantly, how the courts view the use of such technology.

With the accelerating volume of Electronically Stored Information (ESI) or what I like to call Electronically Stored Evidence (ESE), the current legacy search technologies built into the current legacy eDiscovery tools and the associated best practices for document review are beginning to have a hard time "keeping up". Further, there is a tremendous amount of confusion and trepidation among litigators in regards to potential malpractice claims, sanctions and adherence to Rule 702 and Daubert challenges associated with employing the New Generation of eDiscovery Search technology. Finally, litigation technology vendors, whether purposely or not, have confused the market with fancy new marketing terms like "conceptual search", "transparent search", "linguistic search" and "clustering" without any real explanation of how they work and how to correctly employ them. Therefore, I thought that it was time to restart the campaign to both educate and lobby the eDiscovery industry in regards to the New Generation of eDiscovery Search.

First of all, I want to start with education regarding the pertinent issues. Without a doubt, one of the best article posted over the past 12 months on the legals issues surrounding the topic of what I am calling the New Generation of eDiscovery Search, was written by By Wayne C. Matus and John E. Davis in the New York Law Journal on October 31, 2008, titled, "Do Your Searches Pass Judicial Scrutiny?"

Since it is my impression that many of these very important issues are still either unknown to most in the eDiscovery "business" or are being ignored, I contacted Mr. Matus this week to get permission to repost his article.

Following is the full text of of "Do Your Searches Pass Judicial Scrutiny?":
Electronically stored information is increasing exponentially, and bills from law firms and discovery vendors to deal with this vast sea of data escalate significantly each year. Jason Baron, the director of litigation at the National Archives and Records Administration, believes that ESI is growing so fast that even with unlimited funds and human resources it will soon be impossible for humans to review these large document populations.[FOOTNOTE 1] Still, lawyers faced with potential malpractice claims and sanctions are loath to try new methods for handling the problem. It is time for change.

The traditional means used by litigators to address ESI is the application of keywords and Boolean search terms to identify relevant and non-privileged materials.[FOOTNOTE 2] While acknowledging that this method is unquestionably deficient, a recent article published in this publication concluded that "the available evidence suggests that keyword and Boolean searches remain the state of the art and the most appropriate search technology for most cases."[FOOTNOTE 3] We agree that, in a perfect world, if the parties can nevertheless meet and confer, and agree upon keywords to reduce the population to manageable proportions, the traditional judgmental method can be made to work. However, this is an imperfect world where plaintiffs and defendants do not always agree, and are not always equally motivated, to reduce costs. In fact, it is often quite the opposite. Moreover, even where the sides use judgmental sampling to agree upon keywords, the costs nevertheless usually remain too high.

THE JUDGMENTAL APPROACH
The judgmental approach to keywords ultimately fails because of "recall" and "precision." "Recall" measures how completely a process captures target data. "Precision" measures efficiency - the amount of irrelevant data captured along with the target data. Keywords, as judgmentally used by lawyers, recall too little, while capturing much that is irrelevant. An early landmark empirical study by David Blair and M.E. Maron[FOOTNOTE 4] showed that while lawyers thought they were retrieving about 75 percent of the relevant data, the true results were more like 20 percent. A subsequent study, conducted by the Text Retrieval Conference,[FOOTNOTE 5] confirmed this result, finding that only 22 percent of relevant documents were recalled using keyword search techniques, as opposed to approximately 78 percent found by other search techniques.[FOOTNOTE 6] Many lawyers will also tell you that it is common for reviewers to find only 10 to 40 percent of the recalled documents to be relevant, meaning lawyers are reading mostly junk.
We advocate two different approaches to yield better and more efficient results. First, we suggest that keywords are best used coupled with statistical, rather than judgmental, sampling. Second, we suggest that experienced counsel and vendors working with a combination of advanced conceptual search techniques can more efficiently and effectively deal with large amounts of ESI, resulting in a narrowed and enriched review set with a concomitant reduction in lawyer hours.

KEYWORDS DONE RIGHT
In Victor Stanley v. Creative Pipe, 250 FRD 251 (D. Md. 2008), Chief Magistrate Judge Paul W. Grimm of the U.S. District Court for the District of Maryland found counsel had waived the attorney-client privilege as to 165 inadvertently produced documents -- despite the use of 70 separate keyword searches in conducting their privilege screen -- because, among other reasons, counsel had failed to conduct "quality assurance testing." Clearly, judgmental sampling did not pass judicial scrutiny, while statistical sampling would likely have.

Counsel seeking to conduct a proper keyword search should instead consider the following steps:

• Sample the data. Counsel should isolate a random and statistically significant sample of the relevant datasets and then conduct a manual review of such data for relevance and privilege.[FOOTNOTE 7] This will educate counsel as to what to expect from the larger population and help in formulating keywords.
• Analyze and rank keywords. Counsel should then create and run search terms against the sample set, and (based on the information derived from the sample review) analyze their effectiveness by "recall" and "precision." This preliminary knowledge of the contents and richness of particular datasets will provide the basis to predict retrieval and review costs, and whether, for example, counsel should conduct any or just a limited review of such data.[FOOTNOTE 8]
• Review and repeat until satisfied that the search plan is defensible. This approach is plainly iterative in nature; it is the rare set of searches that achieves acceptable returns without adjustment. Successive application and fine-tuning of terms should permit counsel to achieve defensible levels of recall with superior precision rates. Practitioners should take note: One of the main factors cited by the court in Victor Stanley to determine if a party has conducted a reasonable search is if the party has reviewed a sample of the results to "assess its reliability, appropriateness for the task, and the quality of its implementation."[FOOTNOTE 9]

There is no consensus as to what percentage of recall will pass muster. Instead, counsel must be able to explain to the court the "reasonableness" under the circumstances of each step of the process, including the point at which a party was satisfied with the effectiveness of its search terms.[FOOTNOTE 10]

KEYWORDS: TO DISCLOSE OR NOT
Search terms created by counsel are generally protected, at least initially, by the attorney work-product doctrine, as their "mental impressions, conclusions, opinions, or legal theories ... concerning the litigation."[FOOTNOTE 11] But how does one show "reasonableness" of the search methodology without disclosing the keywords? Three recent cases, Victor Stanley, O'Keefe and Equity Analytics,[FOOTNOTE 12] have required that attorneys be able to explain and defend to the court, at its request, the methodology used to employ the search terms. One court has indicated it might find a waiver of privilege and require disclosure of the search terms.[FOOTNOTE 13]

OTHER FILTERING TECHNOLOGIES
Magistrate Judge John M. Facciola of the U.S. District Court for the District of Washington, D.C., recently pointed to authority that "concept searching" applications -- which use statistical and linguistic models to search for ideas as well as words and impose order upon disparate documents -- are "more efficient and more likely to produce comprehensive results" than keyword or Boolean searches.[FOOTNOTE 14] For example, the TREC 2007 Legal Track study found that 78 percent of relevant documents in a dataset were not found by Boolean keyword searches, but only by alternative search techniques.[FOOTNOTE 15] It is little wonder that Magistrate Judge Grimm has expressed optimism that concept-based searches studied by TREC would supplant keywords as the preferred method "for a variety of ESI discovery tasks."[FOOTNOTE 16]

These findings appear to have been confirmed. Earlier this year, the eDiscovery Institute disclosed its preliminary assessment of the study it conducted on the performance of computerized document review against human review. The study was conducted against a dataset drawn from the Verizon-MCI merger consisting of 1.3 terabytes and over two million documents. They concluded that computer systems allowed a comparable level of performance to be achieved with fewer people, less time and lower cost. While actual cost of traditional review was over $13.5 million, computer-assisted review was projected to cost just a fraction of that amount.[FOOTNOTE 17]

Concept search methodologies fall into three basic (and sometimes overlapping) categories:

• Probabilistic. This technique relies upon probabilistic search models such as "Bayesian classifiers," which evaluate and classify documents based on the interrelationships, proximity and frequency of usage of words found therein. The model may be given a "head start" by a sample set of relevant documents developed by attorneys at the outset of the process, which the computer analyzes and applies to the remaining documents. This technique can order groups and documents based on perceived potential importance to assist in the review process.
• Rule-based (or "clustering"). This statistically driven process analyzes the prevalence of words in documents and, based on such analysis, groups together documents interpreted as featuring like concepts. This technique can order documents by perceived potential importance as well.
• Linguistic. Sometimes referenced as "fuzzy search models," this technique seeks documents containing all forms of a target word or its synonyms in a general and/or case-specific thesaurus. Linguistic approaches may also rely upon statistics to analyze documents for terms along the same subject lines -- or sometimes to identify documents that use different ways to make the same point.[FOOTNOTE 18]

Differing tools often produce differing results, but some combination of each of these approaches (as well as Boolean keyword searches) -- using a transparent, iterative and measured process as described above -- can be used to best effect. As successive waves of ESI are received (as is often the case), moreover, certain of the concept-searching applications "learn" and become better at identifying correlations, associating documents with particular attributes with concepts of interest to counsel and minimizing false positives. Further, the statistics generated by this process permit counsel to draw educated lines as to where review should proceed and, sometimes more importantly, when it is reasonable to stop. The advantages of these powerful, computerized techniques become even more apparent where ESI reaches the terabyte range and the steep recall/precision tradeoff exhibited by keyword analyses may reach unacceptable levels. Two jurists have indicated in opinions an interest in hearing from experts as to such new approaches.[FOOTNOTE 19]

CONCLUSION
Given escalating volumes of ESI, with no end in sight, and the general impatience of courts with e-discovery mistakes, counsel and their clients soon may have no choice but to adopt discovery tools that are more efficient and precise than traditional Boolean search techniques. Courts have already put practitioners on notice of this emerging obligation. Combining measured approaches to search methodologies with advanced techniques can greatly assist in the organization of ESI and the cost-effective conduct of litigations and investigations. The future is now for these state-of-the-art search techniques.
Wayne C. Matus is a litigation partner in the New York office of Pillsbury Winthrop Shaw Pittman and one of two national leaders of the firm's e-discovery practice. John E. Davis is a senior associate in the firm's New York office specializing in e-discovery. Sandra Barragan, an associate at the firm, assisted in the preparation of this article.

::::FOOTNOTES::::
FN1 "EDD Showcase: Discovery Overload," Law Technology News, January 2008.
FN2 Although keyword searches and Boolean term searches are undeniably distinct, for purposes of this article we will refer to them interchangeably.
FN3 See "Assessing Alternative Search Methodologies," H. Christopher Boehning and Daniel J. Toal (NYLJ, April 22, 2008).
FN4 "An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System," Communications of the Association for Computing Machinery at 289-99, March 1985.
FN5 TREC is sponsored by the National Institute of Standards and Technology (NIST) and the Advanced Research and Development Activity of the Department of Defense.
FN6 These were the results of TREC 2007, the second year of the Legal Track study. See Jason R. Baron, Douglas W. Oard, Paul Thompson & Stephen Tomlinson, Overview of the TREC-2007 Legal Track, at §6 (linked at http://trec-legal.umiacs.umd.edu/). In a prior TREC-6 Ad Hoc Task study, for keywords to achieve just 50 percent recall, the architects had to accept a dismal 20 percent precision rate (whereby four of every five documents selected by keywords were nonresponsive). See H5 White Paper, Concept Search: Perceived Security, Actual Risk, at 2, citing Voorhees, Ellen M., and Harman, Donna, Overview of the Sixth Text REtrieval Conference (TREC-6), in NIST Special Publication 500-240: The Sixth Text REtrieval Conference (TREC 6), ed. E.M. Voorhees and D.K. Harman, 1-24 (Gaithersburg, MD: NIST 1997), and Voorhees, Ellen M., and Harman, Donna, Overview of the Seventh Text REtrieval Conference (TREC 7), ed. E.M. Voorhees and D. K. Harman, 1-24 (Gaithersburg, MD: NIST 1998).
FN7 E.g., Treppel v. Biovail Corp., 233 FRD 363, 374 (SDNY 2006).
FN8 See McPeek v. Ashcroft, 212 FRD 33, 35 (D. D.C. 2003) (ordering sampling of backup tapes to determine whether they contained relevant documents); Wiginton v. DB Richard Ellis Inc., 229 FRD 568, 570 (N.D. Ill. 2004) (ordering sampling of archived material based on keywords to determine whether it should be restored); see also Victor Stanley, 250 FRD at 261, citing The Sedona Conference Best Practices Commentary on the Use of Search & Information Retrieval Methods in E-Discovery, 8 Sedona Conf. J. 189 (2007) [hereinafter, "The Sedona Best Practices"]. For example, the sampling process may reveal that certain data sources (such as local hard drives) or file types (such as Microsoft Access files) have such low yield that the collection and review effort is not worthwhile. While opposing counsel may not agree to such decision, the data provided by the sample review will provide the evidentiary support needed to defend the reasonableness of such steps to the court.
FN9 Victor Stanley, 250 FRD at 256.
FN10 See, e.g., Security Financial Life Insurance Company v. Dept. of Treasury, 2005 WL 839543, *4 (D. D.C. April 12, 2005) ("In deciding whether an agency's document search is adequate, the issue is not whether other responsive records might possibly exist, but whether the search was adequate, judged by a reasonableness standard.") (internal citations omitted); see also Victor Stanley, 250 FRD at 261 n.10 ("the cost-benefit balancing factors of [FRCP] 26(b)(2)(c) apply to all aspects of discovery").
FN11 Fed. R. Civ. P. 26(b)(3); see Lockheed Martin Corp. v. L-3 Comm. Corp., 2007 WL 2209250 (M.D. Fl. July 29, 2007) ("documents containing instructions about how to conduct the [ESI] search and what specifically to search for are opinion work product" and therefore protected as attorney work product privileged material); see also Gibson v. Ford Motor Co., 2007 WL 41954, at *6 (N.D. Ga. Jan. 4, 2007) (document retention notice that included a list of search terms reflected attorney mental impressions and so constituted protected work product).
FN12 Victor Stanley, 250 FRD at 256, United States v. O'Keefe, 537 F.Supp.2d 14 (D. D.C. 2008), and Equity Analytics, LLC v. Lundin, 248 FRD 331 (D. D.C. 2008).
FN13 Counsel, early in the process, should consider disclosure of keywords and other aspects of the search protocol to the adversary and the court, and invite their comment and approval, as a means of managing discovery costs and risk. Magistrate Judge Grimm in Victor Stanley Inc., 250 FRD at 256, found, among other things, that defense counsel's failure to disclose the keywords used to screen for privileged documents in defending its search methodology justified a finding of waiver as to the inadvertently produced documents. The court provided a checklist for attorneys to follow when preparing the methodology to be used to gather and produce ESI: Attorneys should consider the reasonableness of "the keywords used; the rationale for their selection; the qualifications of the [creators of the search] to design an effective and reliable search and information retrieval method; whether the search [is] a simple keyword search, or a more sophisticated one, such as one employing Boolean proximity operators ... ." Id.
FN14 Disability Rights Council v. Washington Metropolitan Transit Authority, 242 FRD 139 (D. D.C. 2007), citing George L. Paul & Jason R. Baron, "Information Inflation: Can the Legal System Adapt?" 13 Rich. J.L. & Tech. 10 (2007).
FN15 TREC 2007 Legal Track.
FN16 Victor Stanley, 250 FRD at 261 n.10.
FN17 http://www.ediscoveryinstitute.org/research/index.html.
FN18 Such tools are described in further detail in The Sedona Best Practices at 191-216 & Appendix. While we are unaware of a court that has expressly endorsed this approach, parties have used these search techniques in conducting, among other things, internal investigations.
FN19 Magistrate Judge Grimm in Victor Stanley, 250 FRD at 260, and Magistrate Judge Facciola in O'Keefe, 537 F.Supp.2d at 24, and Equity Analytics, 248 FRD at 333, have indicated that conducting and defending e-discovery may at times require experts. Indeed, Magistrate Judge Facciola stated that search methodologies in e-discovery may be scrutinized under Rule of Evidence 702.