The eDiscovery Paradigm Shift: 12/7/08

Tuesday, December 9, 2008

Concept Search vs. Keyword Search in eDiscovery

Having grown up in the enterprise class solutions world with relational databases, I am very comfortable with SQL query based searching. And, over the past several years, with my focus on the litigation market and litigation technology, I have now become very familiar with keyword searching against scanned and OCR'd document files. However, with with my passion for "leading edge" technology and/or solutions that can meet the demands of a market going through a paradigm shift, I have not been overly excited or impressed with the state of search technology in eDiscovery.

However, that has all changed with the emergence of conceptual search technology. As such, I have spent a tremendous amount of time researching conceptual search and how it compares from both a technology standpoint and from as business value standpoint. And, although I have come to the early conclusion that there is room and a need for all three, I have also determined that there is still a tremendous amount of confusion in regards to concept search vs. keyword search technology and the best use of both.

Therefore, in an effect to keep the followers of my blog informed, I have been following a series of excellent posts on the ediscovery 2.0 blog discussing conceptual search. Following is the full text of the latest post titled "Concept Search Versus Keyword Search in Electronic Discovery" by Will Uppington:

In my last post, I started a discussion on the myths surrounding concept search. The first myth I dispelled was the “concept search is concept search” myth. The myth is that there is an agreed upon definition of concept search. In actuality, when people in e-discovery use the term concept search, they don’t always mean the same thing. Frequently they are not actually talking about concept search technology at all and are actually talking about concept or content categorization technology, which is very different. The second myth that needs dispelling is that concept search is better than keyword search.

The thinking behind this myth goes something like this:

Keyword search has a lot of problems. It is prone to being over-inclusive, i.e., finding some non-relevant documents, and under-inclusive, i.e., not finding some relevant documents. Concept search technologies are new and interesting and using these technologies you can find documents that keyword search can’t find. Therefore, concept search must be better than keyword search.

Let’s examine this thinking. The first two statements are accurate. Keyword search is not perfect and can produce over- and under-inclusive results. And concept search and content categorization technologies can both help identify documents that keyword search technologies might not find. However, the conclusion that concept search is better than keyword search is not valid and doesn’t follow from these two statements. Why?

In order to answer this question, we first need to go back to the difference between concept search and content categorization. Because these are different technologies, we really need to separately compare concept search versus keyword search and content categorization versus keyword search. Let’s start with content categorization and keyword search.

The issue with this comparison is that keyword search and content categorization do different things. Keyword search can be used in many ways in e-discovery. The two most common are: (1) analysis or case assessment: finding the hot documents and understanding the matter by determining who knew what, when, how and why, etc., and (2) culling: removing non-responsive documents and/or identifying potentially privileged documents in order to reduce a large, starting set of documents to a smaller set before review.
Content categorization, on the other hand, has historically been used within the review phase of e-discovery. Categorization can help reviewers to better understand the documents they are reviewing and thus potentially increase the speed of review. Practitioners with whom I have worked also find that categorization can be useful during analysis by helping to understand a matter and identify potentially important keywords.

However, content categorization has not been used as part of culling. First, culling needs to be transparent. You need to be able to get agreement with or at least explain to the opposing side and the court exactly how you have culled the data set. If you cull based on categories of documents that have been generated by a proprietary, black-box algorithm, it’s going to be difficult to gain agreement on or explain your culling methodology. This is why the typical method of culling is still to use keyword search and either agree on the set of search terms with the opposing side or to use e-discovery search best practices to perform keyword searches on your own.

Second, content categorization has its own issues when it comes to being over- and under-inclusive. There is no guarantee that your group of documents that have been categorized as being related to, for example, a company’s hiring policies include all of the documents in your matter related to hiring policies or that they do not include some documents that may not really be related to hiring policies. Content categorization, like keyword search and virtually every information retrieval technology, is not perfect.

So what about concept search technology? Surely, concept search technology is better than old, boring keyword search. Well, actually it’s not that clear-cut. The problem with concept search technology is that while it might find more relevant documents than plain keyword search, it will also likely find more false positives. Imagine searching for documents containing “terminate” in an employment matter and your concept search technology automatically searching for “fire”, “dismiss”, etc. as well. You’ll find more documents related to the termination of employees, but you’ll also find a lot more non-relevant documents concerning house fires, the fire department, etc.

So concept search can help address the under-inclusive problem with keyword search, (though it won’t solve it) and can be helpful during analysis. But it can often increase the over-inclusive problem. In addition, today’s concept search technologies share the transparency problem with concept categorization. These technologies have largely been designed as “black boxes”, which as I have discussed in the past, makes sense for Enterprise search but not for e-discovery search, and, as a result, could also be potentially difficult to explain and defend. For these reasons, concept search technology isn’t used very much in e-discovery today. In order for its use to become widespread, it will need to become more transparent. But that’s a topic for another day.

The bottom line here is that despite all the hype, concept search and content categorization technologies do not solve all the challenges of e-discovery search. Both of these technologies can be very useful and the technology behind them is always improving. However, as most of the experienced practitioners I work with already know, these technologies are generally better thought of as supplements to keyword search, not replacements. The important question is not whether to use one technology over the other but which technology is best suited to your objectives and how best to use all the available technologies to achieve the desired goal.

Labels: Conceptual Search, culling, eDiscovery, ediscovery 2.0, Keyword Search, Will Uppington

The Fog is Lifting on Concept Search in eDiscovery

As I have been reporting on my Blog for the last 6 months, Concept search is becoming a very hot topic in the world of eDiscovery. Brought to the forefront by cases such as Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md. May 29, 2008), many in the litigation industry as now moving beyond initial interest in conceptual search to wanting/demanding a more in depth understanding of the technology and the value proposition vs. legacy boolean and keyword search technology. Further, and in some respects in response to this new appreciation for conceptual search, numerous new pure play conceptual search technology vendors are emerging along with most, if not all of the robust eDiscovery vendors announcing that they now also have conceptual search capabilities.

Unfortunately, I believe that there is still tremendous confusion in regards to what conceptual search actually does and when it provides the best return on investment. Therefore, in an effort to educate the followers on my Blog, I would like to recommend a series of blog posts titled "Demystifying Concept Search in Electronic Discovery" by Will Uppington on the e-discovery 2.0 Blog. The full text of Will's post is as follows:

Concept or content search continues to be a hot topic within the e-discovery community. There’s a continuous stream of articles that discuss it. Some that point out the positive. Others that point out the limitations. The courts have also gotten involved in the discussion. Judge Grimm refers to concept search in e-discovery in Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md. May 29, 2008). Judge Facciola discusses concept search in Disability Rights Council of Greater Washington v. Washington Metropolitan Transit Authority, 242 F.R.D. 139 and other opinions. Despite (or maybe because of) all the commentary on this topic, I find that while a lot of people think that concept search in e-discovery is good, many are not fully sure of exactly what concept search is, and how it is practically useful in e-discovery. It’s pretty clear that after several years of commentary and hype, concept search has become something of a buzzword associated with many myths and misconceptions. In an effort to better understand what concept search is and how it can help in e-discovery, I want to dispel two of the most common myths I have heard.

The “Concept Search is Concept Search” Myth
The first myth around concept search actually revolves around what it is. In my experience, people tend to lump two different technologies together when talking about concept search: concept search and concept categorization. It’s very common, for example, to see commentators say concept search even when what they are really talking about is concept categorization. To make matters more confusing, people also use a plethora of other names including content search, content clustering or concept clustering when what they really mean is concept categorization.

So, what are the differences between concept search and concept categorization? First, let’s start with concept search. Concept search technologies find documents containing “concepts”. I think that the Sedona Conference’s “Best Practices Commentary on the Use of Search & Information Retrieval Methods in E-Discovery“, provides a good definition of “concept” when used in a search context: “the combination of [a] query term and the additional terms identified by the thesaurus.” In other words, concept search technologies find documents containing a specified term plus additional terms with similar meanings derived from a thesaurus.

Concept categorization, on the other hand, is actually not a search technology at all. Concept categorization technologies do not “find” documents. Rather, they categorize or group documents based on their similarity. There are many different ways to group documents based on similarity. Techniques include statistical (which assesses similarity based on word frequency), Bayesian classification (which weights words differently depending on factors in addition to statistical frequency, such as where the terms appear in a document), and semantic indexing (which takes into account the fact that many words used in a similar context may have a similar meaning). It would take more time to describe these technologies in detail but the Sedona commentary has a good summary of these different technologies if you are interested in learning more.

As should now be apparent, these technologies are very different and using the same words to describe them is confusing. It’s why it’s not surprising that a lot of the users of e-discovery services and software don’t have a strong understanding of what these technologies are or what benefits they can actually provide in practice. Dispelling the myth that they can be lumped together is a critical first step in any conversation about concept search and how it can help in e-discovery. This leads us to a second myth, that Concept Search is better than Keyword Search. I’ll discuss this in my next blog post.

Labels: Conceptual Search, Enterprise Search, Inc., Inc. v. Creative Pipe, Judge Grimm, The Sedona Conference, Victor Stanley

This Page

The eDiscovery Paradigm Shift

Tuesday, December 9, 2008

Concept Search vs. Keyword Search in eDiscovery

The Fog is Lifting on Concept Search in eDiscovery

Contributors

Links

Previous Posts

Archives