This Page

has been moved to new address

Become an eDiscovery Superhero with Conceptual Search and Categorization Technology

Sorry for inconvenience...

Redirection provided by Blogger to WordPress Migration Service
----------------------------------------------------- Blogger Template Style Name: Snapshot: Madder Designer: Dave Shea URL: mezzoblue.com / brightcreative.com Date: 27 Feb 2004 ------------------------------------------------------ */ /* -- basic html elements -- */ body {padding: 0; margin: 0; font: 75% Helvetica, Arial, sans-serif; color: #474B4E; background: #fff; text-align: center;} a {color: #DD6599; font-weight: bold; text-decoration: none;} a:visited {color: #D6A0B6;} a:hover {text-decoration: underline; color: #FD0570;} h1 {margin: 0; color: #7B8186; font-size: 1.5em; text-transform: lowercase;} h1 a {color: #7B8186;} h2, #comments h4 {font-size: 1em; margin: 2em 0 0 0; color: #7B8186; background: transparent url(http://www.blogblog.com/snapshot/bg-header1.gif) bottom right no-repeat; padding-bottom: 2px;} @media all { h3 { font-size: 1em; margin: 2em 0 0 0; background: transparent url(http://www.blogblog.com/snapshot/bg-header1.gif) bottom right no-repeat; padding-bottom: 2px; } } @media handheld { h3 { background:none; } } h4, h5 {font-size: 0.9em; text-transform: lowercase; letter-spacing: 2px;} h5 {color: #7B8186;} h6 {font-size: 0.8em; text-transform: uppercase; letter-spacing: 2px;} p {margin: 0 0 1em 0;} img, form {border: 0; margin: 0;} /* -- layout -- */ @media all { #content { width: 700px; margin: 0 auto; text-align: left; background: #fff url(http://www.blogblog.com/snapshot/bg-body.gif) 0 0 repeat-y;} } #header { background: #D8DADC url(http://www.blogblog.com/snapshot/bg-headerdiv.gif) 0 0 repeat-y; } #header div { background: transparent url(http://www.blogblog.com/snapshot/header-01.gif) bottom left no-repeat; } #main { line-height: 1.4; float: left; padding: 10px 12px; border-top: solid 1px #fff; width: 428px; /* Tantek hack - http://www.tantek.com/CSS/Examples/boxmodelhack.html */ voice-family: "\"}\""; voice-family: inherit; width: 404px; } } @media handheld { #content { width: 90%; } #header { background: #D8DADC; } #header div { background: none; } #main { float: none; width: 100%; } } /* IE5 hack */ #main {} @media all { #sidebar { margin-left: 428px; border-top: solid 1px #fff; padding: 4px 0 0 7px; background: #fff url(http://www.blogblog.com/snapshot/bg-sidebar.gif) 1px 0 no-repeat; } #footer { clear: both; background: #E9EAEB url(http://www.blogblog.com/snapshot/bg-footer.gif) bottom left no-repeat; border-top: solid 1px #fff; } } @media handheld { #sidebar { margin: 0 0 0 0; background: #fff; } #footer { background: #E9EAEB; } } /* -- header style -- */ #header h1 {padding: 12px 0 92px 4px; width: 557px; line-height: 1;} /* -- content area style -- */ #main {line-height: 1.4;} h3.post-title {font-size: 1.2em; margin-bottom: 0;} h3.post-title a {color: #C4663B;} .post {clear: both; margin-bottom: 4em;} .post-footer em {color: #B4BABE; font-style: normal; float: left;} .post-footer .comment-link {float: right;} #main img {border: solid 1px #E3E4E4; padding: 2px; background: #fff;} .deleted-comment {font-style:italic;color:gray;} /* -- sidebar style -- */ @media all { #sidebar #description { border: solid 1px #F3B89D; padding: 10px 17px; color: #C4663B; background: #FFD1BC url(http://www.blogblog.com/snapshot/bg-profile.gif); font-size: 1.2em; font-weight: bold; line-height: 0.9; margin: 0 0 0 -6px; } } @media handheld { #sidebar #description { background: #FFD1BC; } } #sidebar h2 {font-size: 1.3em; margin: 1.3em 0 0.5em 0;} #sidebar dl {margin: 0 0 10px 0;} #sidebar ul {list-style: none; margin: 0; padding: 0;} #sidebar li {padding-bottom: 5px; line-height: 0.9;} #profile-container {color: #7B8186;} #profile-container img {border: solid 1px #7C78B5; padding: 4px 4px 8px 4px; margin: 0 10px 1em 0; float: left;} .archive-list {margin-bottom: 2em;} #powered-by {margin: 10px auto 20px auto;} /* -- sidebar style -- */ #footer p {margin: 0; padding: 12px 8px; font-size: 0.9em;} #footer hr {display: none;} /* Feeds ----------------------------------------------- */ #blogfeeds { } #postfeeds { }

Monday, March 29, 2010

Become an eDiscovery Superhero with Conceptual Search and Categorization Technology

I grew up watching Batman and Robin on a really small black and white TV and reading Superman comic books. And, I have really enjoyed the resurgence of the Batman movie series. Given this fondness for superheros, I wanted to post a recent paper entitled, "Become an ESI Superhero" by Herbert L. Roitblat, Ph.D. of OrcaTec LLC and eDSG.

Dr. Roitblat takes a really clever approach to explaining eDiscovery categorization technology in terms that we can all understand. He contends that its time for us all to put on our tights and join the Justice League of Superheros. I'm not sure that the world is ready to see some of us in tights. But, Dr. Roitblat's analogy is very helpful in understanding how these new technologies can be used to appear as a superhero.

He contends that, "Categorization can help to make the review much more reliable by using the machine to learn the decision patterns of an expert and then using those decisions as recommendations for more detailed review. Think of it as a form of Vulcan mind meld that transfers the expertise of the best expert available to the rest of the review staff. "

He states that, "Semantic clusters help to gain a quick overview of what the collection is about and also suggest key terms and phrases that can be used to identify responsive documents. They provide a quick method of determining which documents merit further immediate review and which can be safely set aside. After review, if some documents in a cluster are marked responsive and others are not, it may be useful to examine why not."

And, he concludes that. "Concept selection gives you x-ray vision into the meaning of your collection. It lets you identify what the words mean in this particular context and to identify documents based on their meaning. It helps to highlight the documents that are most about a specific concept from the specific point of view of the context. For example, among the Enron emails, the word "osprey" is not used to refer to the bird, or the aircraft but to one of the off-books partnerships that got Enron into so much trouble."

If all of this really works as advertised, it will make us all look like superheros. Whether or not you decide to wear the tights is up to you.

The full text of Dr. Roitblat's paper is as follows:

Super heroes have super powers. Their powers enable them to do things far beyond the capabilities of ordinary people. Superman had his X-ray vision. Wonder Woman had a lariat that compelled complete honesty. The Flash had super speed. Batman had his superior intellect and technology.

These members of the Justice League each had powers that would come in handy managing today's eDiscovery. If these powers were available to lawyers today, would you use them?

How would or should lawyers go about making decisions about using their super powers? There are legal issues, certainly. But as far as I can see, the really critical question is whether these powers provide capabilities beyond those of traditional eDiscovery processes.

The eDiscovery powers I'm thinking about revolve around technology. We cannot afford to wait for a lightning bolt to strike. They include categorization, clustering, and concept searching. Tools like these have the power to help you see inside the case materials, derive the honest information from them with super speed, and amplify your superior intellect.

The goal for the eDiscovery process is to identify the ESI that is potentially relevant and to separate it from the ESI that is not. Ten years ago, we could argue about paper versus plastic, about whether it was more or less efficient to review documents on a computer screen or on paper. At the time, I worked on a case involving 13 million pages, which the producing party was determined to produce on paper. That would have been about 65 tons of paper, the weight of an old-fashioned steam locomotive. There was simply no way that the receiving party could go through all of that paper in a timely way. Just sorting the pages into date order was a daunting task.

Since that time, the volume of ESI that must be considered has only continued to leap over tall buildings. It is no longer practical to have the managing partner on a case read through all of the available documents. Few attorneys actually have the super power, the Flash's speed, for example, to read through millions of pages in a short time, so they resort to other means to help them get through it.

Categorization
One of the oldest approaches is to hire an army of temporary attorneys to read the documents. When Verizon was acquiring MCI in 2005, they responded to a DOJ second request by hiring 225 attorneys for four months, working 16 hours per day, 7 days per week. And all this effort was needed to review just 1.3 terabytes or 1.6 million documents. The review, alone, cost over $13 million or about $8.50 per document. As the volume of ESI continues to grow into multi-terabyte collections, this approach is simply not sustainable.

There is also evidence to suggest that this approach is not as accurate as it might be. What level of attention can a reviewer sustain while reading documents 16 hours a day, seven days a week, week after week? We had one client, in fact, in an unrelated matter, who asked us to use search technology to filter out the jokes because the reviewers were spending way too much time on them, rather than plowing through the potentially relevant material.

We (Roitblat, Kershaw, & Oot, 2010) recently published a paper in the Journal of the American Society for Information Science and Technology (JASIST) analyzing the performance of human reviewers in comparison to two computer-assisted review systems. None of the authors of that paper has any financial relationship with the reviewers or the companies providing the computer categorizers.

We set out to examine the idea that these computer systems could yield results that are comparable to those that would be obtained with a human review, and we found that they were. If we somehow knew which documents were truly responsive, then we could compare the judgments made during the first review with these true judgments. Unfortunately, no such oracle was available, so instead, we had to settle for a comparative method. We assessed the level of agreement between a new traditional review by new professional reviewers with the original review and between the computer systems and the original review. By comparing a new human review with the original review, we get an assessment of how well the traditional approach captures reliable aspects of the document collection. If the computer systems perform no worse, then it may be reasonable to use systems like this, rather than to spend the time and money needed to hire humans to do the work. I'll return to this assertion in a bit, after discussing some of the results.

After being trained on the issues involved, two teams of experienced professional reviewers were given a random sample of 5,000 documents to review for responsiveness. We could now assess the level of agreement of each of the teams with the original review and with each other. Team A agreed with the responsiveness decisions of the original review on about 76% the documents. Team B agreed with the original review on about 72% of the documents. They agreed with each other on about 70% of the documents.

You might think that the reason for such low agreement between the teams and the original review was because of the extreme conditions under which the original review was conducted. I have little doubt that the original review could have been done more reliably with more time (and more expense), but these results do not support that conclusion because the two re-review teams agreed with each other, even less often than they agreed with the original review, and were under practically no time pressure. Instead, it would seem that the low level of agreement is more likely explained by the unreliability of human responsiveness judgments in general. People are just not very reliable at identifying responsive documents, even in a small collection of a few thousand documents.

The two computer systems, on the other hand, agreed with the original review on about 83 % of the documents, certainly no worse than the level of agreement to be expected based on the human review . Replacing the army of human reviewers with a computer did not decrease the level of agreement, but it would very likely save a substantial amount of money.

Alan Turing, considered by many to be the father of computing, developed a test for assessing machine intelligence. His argument was that intelligence is a function. If the outputs of a machine are indistinguishable from the outputs of a human under a specific set of circumstances, then the computer could be said to implement the same function as the human—in this case the intelligence function.

In Turing's test all communication is done through a written medium, say a computer keyboard. The tester is supposed to have a conversation with a partner in another room. If the tester cannot tell whether she is communicating with a person or a machine, then the machine is said to be producing equivalent results and to be executing the same function. The computer could be said to be genuinely intelligent.

We could apply this same methodology in assessing computer versus human judgments of responsiveness. If we cannot tell the difference between the two kinds of systems, then we can conclude that they perform the same function. Based on the data reported in JASIST, that seems to be the case. If anything, the computers were a bit more reliable than the humans. If Batman were an attorney, he would use this kind of technology to gain an advantage over his adversaries. Think of it as a kind of BatReview.

Not every attorney is convinced by these results. For some, this skepticism reflects a romantic notion that there must be something special about having real live humans read every document. Their claim is that humans will find responsive documents that the computer will miss.

Although it may be true that humans will find documents that the computer might miss, it is at least equally true that one human will find documents that another human might miss and conversely, miss documents that another human reviewer might find responsive. It is also true, that the computer is likely to find responsive documents that the human might miss. The agreement between two groups of humans was no higher than the agreement between the computer systems and the original review. The available evidence does not support the claim that humans will find more responsive documents or that they will retrieve fewer nonresponsive documents than either of these categorization systems will.

Another claim is that the character of the documents missed by the computer will somehow be different from the character of the documents missed by the humans. This claim is more difficult to assess, because it is not obvious how to measure the character of the documents that are missed by one system or another. In order for this notion to be valid, however, and still allow the machines to achieve levels of agreement that are equal to or higher than those achieved by humans it would require that the humans miss an equal number of documents that the computer finds responsive. It implies that there is some systematic difference in the documents that the people find and the computer does not and some systematic difference in the documents that the computer finds and the people do not. It's not impossible that both systematic differences exist, but its practical significance is, at best, elusive.

How well any system, whether human or machine, performs is a matter of measurement. It seems unreasonable to claim that humans are somehow better than computers at distinguishing responsive from nonresponsive documents without measuring their performance along dimensions that matter. For example, the low level of agreement between human reviewers often comes as a shock to attorneys. Every review should include quality measures, whatever technology is used to perform it. Superman was not a superhero just because he flew around, that would only make him super. He was a superhero because he was successful at thwarting bad guys. His effectiveness could be measured by the number of evilness of the villians he defeated. The effectiveness of review should be measured by the ability to identify responsive documents and eliminate the nonresponsive.

Practically every process can be improved once it is measured. The steps taken to improve the quality of the review should be determined by the degree of risk in the case, in other words, they should be based on judgments of reasonableness.

The two computer systems used in the JASIST study did not make up their categorization by themselves. They don't actually decide what is responsive and what is not responsive. They form their categories on the basis of input from people. They implement the judgment of their "trainers" rather than make up their own. One system used in this test learns how to distinguish between responsive and nonresponsive documents from example judgments made by reviewers. The system is given a set of documents that the reviewers determined were responsive and a set that the reviewers determined were nonresponsive. From its analysis of these two sets, it derives a set of computational rules that distinguish between the two and applies these rules systematically to the remaining documents. This is the most common form of machine learning and automated categorization.

The other system was trained by linguists and attorneys to distinguish between responsive and nonresponsive documents. The trainers read the request and the training information provided to the original reviewers. They then adjusted the system's algorithms until it distinguished between documents in the way that these people determined was accurate. In both cases, the computer simply systematically implemented the decisions made by its team of trainers without getting bored, distracted, tired, or needing a vacation.

Keyword selection
Over the last few years, many attorneys have grown comfortable with one weak form of machine classification. Many of them have been using keywords to select or cull documents for further review. The attorneys pick a set of keywords or Boolean queries to use to select documents. These terms may be created by one side or negotiated between the two sides. Any document that contains one of these keywords or matches the query is selected for further processing, the others are simply ignored. For the most part, if a document does not have one of the keywords in it, it is never looked at again, so its information is effectively lost.

Keyword searching, although important, is the weakest form of machine classification available, hardly up to superhero standards, kind of like the Green Lantern with yellow things. The success of keyword searching depends critically on the ability of the attorneys to pick the right words that identify the responsive documents and do not overwhelm with nonresponsive ones. For more than 20 years we have known that attorneys are only about 20% successful at guessing the right words to search for (Blair and Maron, 1985). In the 2008 TREC Legal Track, they had two sides of an issue create search terms. The "defendant's" search terms retrieved just 4% of the responsive documents. The "plaintiff's" search terms retrieved 43% of the responsive documents, but 77% of the documents returned were nonresponsive, much higher than the 59% nonresponsive rate for the "defendant."

As you would expect, when they negotiated the search terms, as many of the thought-leaders in eDiscovery suggest, the results were intermediate, but still not stellar. Negotiated search retrieved only 24% of the responsive documents (http://trec.nist.gov/pubs/trec17/papers/LEGAL.OVERVIEW08.pdf, p. 10). In the 2007 TREC, it was even lower. Of the documents that were retrieved, 72% were found to be nonresponsive. Keyword culling serves to reduce the volume of documents that will be considered for review, but it does not do a very good job of identifying responsive documents. Cooperation may be the better policy, but it's no lariat of truth, unless further steps are taken.

One reason for the poor performance of these search terms might be that they were based on the attorney's expectations for the words that ought to be in the collection and ought to distinguish between responsive and nonresponsive documents, rather than for the words that actually were there. For example, people often misspell words in emails. A name like "Brian," for example, might be spelled "Brain," or "Bryan." "Believe" might be misspelled as "beleive." Document authors, especially email authors, may use nonstandard abbreviations. They can be very creative in the words they choose to use. There are over 200 synonyms, for example, for the "think." Did they "buy," "purchase," "acquire," or just "get" a new car, for example? Groups all have specialized jargon. Lawyers may have one way to talk about issues in the case, but the document authors rarely think or write like lawyers.

Identifying the right terms to search for can be helped by looking at the words that are actually in the ESI. One easy way to do this is to simply print out or display an index of the collection. This list is sometimes called a word wheel. Then, the parties can select terms from this list, rather than trying to make up a list from their imagination.

Clustering
A more powerful technique is to use semantic clustering to group documents with similar content. Using one of a number of techniques, the computer groups together similar documents and finds a word or a phrase that describes that group. A quick examination of these groups is often enough to determine whether the documents they contain are likely to be responsive or not. The labels can be used as potential search terms. Some of these clusters may be obviously responsive, some obviously nonresponsive and others may require looking at a few of the documents to figure out.

There is an added benefit if a given document can be in more than one cluster because documents can be about more than one topic. An email might say something like, "I'm bringing pizza to the party on Saturday, and by the way, the money that we stole is now in our Swiss bank account." If that email were clustered into only the pizza party cluster, no one would ever see it again and that information would be lost.

The clusters can also be used to select or cull. You can scan down a list of clusters and determine whether the documents in that cluster are likely to be responsive or not. Any document in a potentially responsive document could then be reviewed further, but documents that never appear in a responsive cluster can be set aside.

Here are a few of the cluster labels derived from a subset of the Enron data:

- ibm
- 2q
- bal month
- transactions
- software
- luis gasparini
- coaches
- generators
- lon ect
- hou ees
- el paso
- ebner daniel
- officials
- epmi long term northwest
- bonus
- game

If the issue involves financial information, then documents in 2q, epmi long term, transactions, and bonus clusters may be relevant. Terms like these would make good search terms, and you know that they are actually used in the documents and that there are enough of them to make up a cluster. Documents in the coaches, software, and game clusters are unlikely to be relevant. Because the same document can appear in more than one cluster, it will still be selected for review if it appears in at least one responsive cluster, no matter how many nonresponsive clusters it appears in.

Concept Selection
Concept searching is another tool that can help to amplify your powers. Concept searching identifies the meanings of words using any of a number of different technologies, including ontologies, thesauri, latent semantic indexing (LSI), and language modeling. Ontologies and thesauri are usually created by experts, who program word relations into the system. These knowledge engineers identify that the word "car" is related to the word "vehicle," for example. Systems based on one of these approaches contain only those relationships that have been explicitly programmed into them.

LSI and language modeling derive the meanings of words automatically from the context in which those words are used. These systems reflect the actual patterns of word use and are capable of discovering unanticipated relationships.

Concept searching works by expanding the query that the user submits to include the original term and additional, related terms. Concept searching identifies documents as relevant, even if the exact query term happens not to be in the document, because it searches not just for the original term submitted by the user, but for it and related terms. This approach tends to push the most relevant documents, the ones with the query term and the most context, to the top of the results list and to add a small number of related documents, which do not happen to have the query term, to the tail end. Concept search tends to mitigate the difficulty of guessing the right terms to search for, because it learns what terms are related to which. It helps to find documents based on their meaning, rather than solely on the presence of specific words.

Conclusion: Using These Superpowers
The eDiscovery superpowers described above can really help to reduce the cost and burden of eDiscovery. Categorization can help to make the review much more reliable by using the machine to learn the decision patterns of an expert and then using those decisions as recommendations for more detailed review. Think of it as a form of Vulcan mind meld that transfers the expertise of the best expert available to the rest of the review staff.

Documents identified by the categorizer can be reviewed first. It prioritizes processing so that the most likely documents can be completed first. It allows the reviewers to quickly gain exposure to the documents most likely to be relevant and learn from the examples what makes them relevant.

After the review, a sample of the decisions made by the human reviewers can be fed back to the categorizer and it can be retrained. If the reviewers were consistent in their review, then there should be few discrepancies between the categories assigned by the computer and the categories assigned by the reviewers. These discrepancies can then be examined to resolve those inconsistencies.

By examining the actual words used in the documents, the quality of keyword selection can be greatly improved. It then becomes easier to negotiate sensibly about which keywords to use. It becomes easier to demonstrate that you have met the requirement of the Federal Rules to conduct a reasonable search of the ESI. Most importantly, you increase your chances of finding the responsive documents without over-burdening the collection with irrelevant ones.

Semantic clusters help to gain a quick overview of what the collection is about and also suggest key terms and phrases that can be used to identify responsive documents. They provide a quick method of determining which documents merit further immediate review and which can be safely set aside. After review, if some documents in a cluster are marked responsive and others are not, it may be useful to examine why not.

Concept selection gives you x-ray vision into the meaning of your collection. It lets you identify what the words mean in this particular context and to identify documents based on their meaning. It helps to highlight the documents that are most about a specific concept from the specific point of view of the context. For example, among the Enron emails, the word "osprey" is not used to refer to the bird, or the aircraft but to one of the off-books partnerships that got Enron into so much trouble.

Armed with these superpowers you are now ready to combat your adversaries in the world of eDiscovery. Go put on your tights and join the Justice League.

Labels: , , , , , ,

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home