The eDiscovery Paradigm Shift: The Fog is Lifting on Concept Search in eDiscovery

Tuesday, December 9, 2008

The Fog is Lifting on Concept Search in eDiscovery

As I have been reporting on my Blog for the last 6 months, Concept search is becoming a very hot topic in the world of eDiscovery. Brought to the forefront by cases such as Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md. May 29, 2008), many in the litigation industry as now moving beyond initial interest in conceptual search to wanting/demanding a more in depth understanding of the technology and the value proposition vs. legacy boolean and keyword search technology. Further, and in some respects in response to this new appreciation for conceptual search, numerous new pure play conceptual search technology vendors are emerging along with most, if not all of the robust eDiscovery vendors announcing that they now also have conceptual search capabilities.

Unfortunately, I believe that there is still tremendous confusion in regards to what conceptual search actually does and when it provides the best return on investment. Therefore, in an effort to educate the followers on my Blog, I would like to recommend a series of blog posts titled "Demystifying Concept Search in Electronic Discovery" by Will Uppington on the e-discovery 2.0 Blog. The full text of Will's post is as follows:

Concept or content search continues to be a hot topic within the e-discovery community. There’s a continuous stream of articles that discuss it. Some that point out the positive. Others that point out the limitations. The courts have also gotten involved in the discussion. Judge Grimm refers to concept search in e-discovery in Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md. May 29, 2008). Judge Facciola discusses concept search in Disability Rights Council of Greater Washington v. Washington Metropolitan Transit Authority, 242 F.R.D. 139 and other opinions. Despite (or maybe because of) all the commentary on this topic, I find that while a lot of people think that concept search in e-discovery is good, many are not fully sure of exactly what concept search is, and how it is practically useful in e-discovery. It’s pretty clear that after several years of commentary and hype, concept search has become something of a buzzword associated with many myths and misconceptions. In an effort to better understand what concept search is and how it can help in e-discovery, I want to dispel two of the most common myths I have heard.

The “Concept Search is Concept Search” Myth
The first myth around concept search actually revolves around what it is. In my experience, people tend to lump two different technologies together when talking about concept search: concept search and concept categorization. It’s very common, for example, to see commentators say concept search even when what they are really talking about is concept categorization. To make matters more confusing, people also use a plethora of other names including content search, content clustering or concept clustering when what they really mean is concept categorization.

So, what are the differences between concept search and concept categorization? First, let’s start with concept search. Concept search technologies find documents containing “concepts”. I think that the Sedona Conference’s “Best Practices Commentary on the Use of Search & Information Retrieval Methods in E-Discovery“, provides a good definition of “concept” when used in a search context: “the combination of [a] query term and the additional terms identified by the thesaurus.” In other words, concept search technologies find documents containing a specified term plus additional terms with similar meanings derived from a thesaurus.

Concept categorization, on the other hand, is actually not a search technology at all. Concept categorization technologies do not “find” documents. Rather, they categorize or group documents based on their similarity. There are many different ways to group documents based on similarity. Techniques include statistical (which assesses similarity based on word frequency), Bayesian classification (which weights words differently depending on factors in addition to statistical frequency, such as where the terms appear in a document), and semantic indexing (which takes into account the fact that many words used in a similar context may have a similar meaning). It would take more time to describe these technologies in detail but the Sedona commentary has a good summary of these different technologies if you are interested in learning more.

As should now be apparent, these technologies are very different and using the same words to describe them is confusing. It’s why it’s not surprising that a lot of the users of e-discovery services and software don’t have a strong understanding of what these technologies are or what benefits they can actually provide in practice. Dispelling the myth that they can be lumped together is a critical first step in any conversation about concept search and how it can help in e-discovery. This leads us to a second myth, that Concept Search is better than Keyword Search. I’ll discuss this in my next blog post.

Labels: Conceptual Search, Enterprise Search, Inc., Inc. v. Creative Pipe, Judge Grimm, The Sedona Conference, Victor Stanley

This Page

The eDiscovery Paradigm Shift

Tuesday, December 9, 2008

The Fog is Lifting on Concept Search in eDiscovery

0 Comments:

Post a Comment

Contributors

Previous Posts