The eDiscovery Paradigm Shift: Concept Search vs. Keyword Search in eDiscovery

Tuesday, December 9, 2008

Concept Search vs. Keyword Search in eDiscovery

Having grown up in the enterprise class solutions world with relational databases, I am very comfortable with SQL query based searching. And, over the past several years, with my focus on the litigation market and litigation technology, I have now become very familiar with keyword searching against scanned and OCR'd document files. However, with with my passion for "leading edge" technology and/or solutions that can meet the demands of a market going through a paradigm shift, I have not been overly excited or impressed with the state of search technology in eDiscovery.

However, that has all changed with the emergence of conceptual search technology. As such, I have spent a tremendous amount of time researching conceptual search and how it compares from both a technology standpoint and from as business value standpoint. And, although I have come to the early conclusion that there is room and a need for all three, I have also determined that there is still a tremendous amount of confusion in regards to concept search vs. keyword search technology and the best use of both.

Therefore, in an effect to keep the followers of my blog informed, I have been following a series of excellent posts on the ediscovery 2.0 blog discussing conceptual search. Following is the full text of the latest post titled "Concept Search Versus Keyword Search in Electronic Discovery" by Will Uppington:

In my last post, I started a discussion on the myths surrounding concept search. The first myth I dispelled was the “concept search is concept search” myth. The myth is that there is an agreed upon definition of concept search. In actuality, when people in e-discovery use the term concept search, they don’t always mean the same thing. Frequently they are not actually talking about concept search technology at all and are actually talking about concept or content categorization technology, which is very different. The second myth that needs dispelling is that concept search is better than keyword search.

The thinking behind this myth goes something like this:

Keyword search has a lot of problems. It is prone to being over-inclusive, i.e., finding some non-relevant documents, and under-inclusive, i.e., not finding some relevant documents. Concept search technologies are new and interesting and using these technologies you can find documents that keyword search can’t find. Therefore, concept search must be better than keyword search.

Let’s examine this thinking. The first two statements are accurate. Keyword search is not perfect and can produce over- and under-inclusive results. And concept search and content categorization technologies can both help identify documents that keyword search technologies might not find. However, the conclusion that concept search is better than keyword search is not valid and doesn’t follow from these two statements. Why?

In order to answer this question, we first need to go back to the difference between concept search and content categorization. Because these are different technologies, we really need to separately compare concept search versus keyword search and content categorization versus keyword search. Let’s start with content categorization and keyword search.

The issue with this comparison is that keyword search and content categorization do different things. Keyword search can be used in many ways in e-discovery. The two most common are: (1) analysis or case assessment: finding the hot documents and understanding the matter by determining who knew what, when, how and why, etc., and (2) culling: removing non-responsive documents and/or identifying potentially privileged documents in order to reduce a large, starting set of documents to a smaller set before review.
Content categorization, on the other hand, has historically been used within the review phase of e-discovery. Categorization can help reviewers to better understand the documents they are reviewing and thus potentially increase the speed of review. Practitioners with whom I have worked also find that categorization can be useful during analysis by helping to understand a matter and identify potentially important keywords.

However, content categorization has not been used as part of culling. First, culling needs to be transparent. You need to be able to get agreement with or at least explain to the opposing side and the court exactly how you have culled the data set. If you cull based on categories of documents that have been generated by a proprietary, black-box algorithm, it’s going to be difficult to gain agreement on or explain your culling methodology. This is why the typical method of culling is still to use keyword search and either agree on the set of search terms with the opposing side or to use e-discovery search best practices to perform keyword searches on your own.

Second, content categorization has its own issues when it comes to being over- and under-inclusive. There is no guarantee that your group of documents that have been categorized as being related to, for example, a company’s hiring policies include all of the documents in your matter related to hiring policies or that they do not include some documents that may not really be related to hiring policies. Content categorization, like keyword search and virtually every information retrieval technology, is not perfect.

So what about concept search technology? Surely, concept search technology is better than old, boring keyword search. Well, actually it’s not that clear-cut. The problem with concept search technology is that while it might find more relevant documents than plain keyword search, it will also likely find more false positives. Imagine searching for documents containing “terminate” in an employment matter and your concept search technology automatically searching for “fire”, “dismiss”, etc. as well. You’ll find more documents related to the termination of employees, but you’ll also find a lot more non-relevant documents concerning house fires, the fire department, etc.

So concept search can help address the under-inclusive problem with keyword search, (though it won’t solve it) and can be helpful during analysis. But it can often increase the over-inclusive problem. In addition, today’s concept search technologies share the transparency problem with concept categorization. These technologies have largely been designed as “black boxes”, which as I have discussed in the past, makes sense for Enterprise search but not for e-discovery search, and, as a result, could also be potentially difficult to explain and defend. For these reasons, concept search technology isn’t used very much in e-discovery today. In order for its use to become widespread, it will need to become more transparent. But that’s a topic for another day.

The bottom line here is that despite all the hype, concept search and content categorization technologies do not solve all the challenges of e-discovery search. Both of these technologies can be very useful and the technology behind them is always improving. However, as most of the experienced practitioners I work with already know, these technologies are generally better thought of as supplements to keyword search, not replacements. The important question is not whether to use one technology over the other but which technology is best suited to your objectives and how best to use all the available technologies to achieve the desired goal.

Labels: Conceptual Search, culling, eDiscovery, ediscovery 2.0, Keyword Search, Will Uppington

5 Comments:

At December 9, 2008 at 3:08 PM , Unknown said...: Thanks for sharing Charles - and some additional material/thoughts in this area that I have found very informative can be seen at the OrcaTec website - (Herb Roitblat) - at http://www.orcatec.com/whitepapers/.

Additionally, in considering the article by Mr. Uppington - I might suggest the following thoughts:

1) He does not make the case that keyword search is better, only that it might not be worse. That is not a strong argument.

2) He makes the case that transparency is good, but who would disagree with that? Attorneys have to be able to understand what they have done.

3) His argument is of the form--they don't solve all of the problems, so why bother. In fact, his conclusion completely undermines any argument he might want to raise: Concept search is a supplement to keyword search, but surprise, surprise, concept search already includes keyword search.

These considerations are by no means original - but may add some context to the article - as well as to your points.
At December 9, 2008 at 3:09 PM , Unknown said...: Thanks for sharing Charles - and some additional material/thoughts in this area that I have found very informative can be seen at the OrcaTec website - (Herb Roitblat) - at http://www.orcatec.com/whitepapers/.

Additionally, in considering the article by Mr. Uppington - I might suggest the following thoughts:

1) He does not make the case that keyword search is better, only that it might not be worse. That is not a strong argument.

2) He makes the case that transparency is good, but who would disagree with that? Attorneys have to be able to understand what they have done.

3) His argument is of the form--they don't solve all of the problems, so why bother. In fact, his conclusion completely undermines any argument he might want to raise: Concept search is a supplement to keyword search, but surprise, surprise, concept search already includes keyword search.

These considerations are by no means original (gleaned from discussions with others) - but may add some context to the article - as well as to your points.
At December 9, 2008 at 4:11 PM , Charles Skamser said...: I appreciate your comments about Mr. Uppington not taking a very strong stance on keyword vs. conceptual search. In reality, I think that when they are used approprirately, it would be difficult to compare key word searching with conceptual searching because in most cases they would be used in very diffent situations. In my opinion key word searching is best suited to find documents that literally contain the key words that you are searching for without regards to context. An example of this would be an initial review of docuements looking for an instance of a custodian's name or company. An example of conceptual search may be to search the keyword resutls set for conceptually similar documents (i.e. documents with a high instance of additional relevant keywords within the result set). Utilizing both technologies would enable you to cull your data set down to a more relevant set than just using one of the other.

Obviously, even in my simple example, if you don't really have any true keywords to start with and are searching documents based upon date parameters, conceptual search would be a much better place to start as it would provide you with of potential keywords. In this case, unless you wanted to guess which keywords might work, conceptual search would be a more appropriate approach.

Finally, I agree 100% regarding your comment about transparency.
At December 9, 2008 at 7:37 PM , Shirley said...: Microsoft Office SharePoint Server 2007 is the Microsoft enterprise search solution for organizations that want to increase productivity and reduce information overload by providing their employees, partners, and customers the ability to find relevant content in a wide range of repositories and formats.

With actionable search results that respect security permissions, Office SharePoint Server 2007 lets users go beyond documents and across repositories to unlock information, find people, and locate expertise in the enterprise.

For more information about this solution, you can visit at http://www.nsynergy.com/Products/SharePoint/Pages/Enterprise_Search.aspx.
At December 10, 2008 at 10:52 AM , Herbert L Roitblat, Ph.D said...: There are many technologies that could be called "concept search." There are multiple ways to generate concepts. Among these are ontology/taxonomy/thesaurus-based methods, where knowledge engineers construct lists of related terms and there are neural network, language modeling, and statistical systems that learn about how words are related in the context of a set of documents. Both approaches have limitations and benefits, The first group takes a lot of effort to build up a set of relationships, the second approach does not require any preconstruction of relationships. The first will give you exactly what you expect, the second will sometimes give surprising results and occasionally uninteresting results. There are other differences as well, but that is not my point. Neither of these can or should be a black box. Neither is really very difficult to explain.
Both approaches work by expanding a search query to include additional terms. (clustering approaches are a bit different.) Whatever terms a user enters, the system finds related terms and adds them to the query. These additional terms focus the search and allow additional documents to be retrieved. This focus aspect is often lost in our discussions of concept search. You can see a clear example of the focus added by concept search in our Web search engine, Truevert (www.truevert.com). Truevert uses the same concept search technology as in our Information Discovery Toolkit and applies it green search on the Internet. A search for "CFL" (http://www.truevert.com/search?query=cfl) returns green results--compact fluorescent light bulbs--rather than pages about the Canadian Football League. It understands what CFL means from the green pages that were used to train it. More subtly, a search for "meat" (http://www.truevert.com/search?query=meat) returns pages about organic meat and about the environmental impact of raising meat.
Concept search returns focused relevant results. When trained on the documents in the litigation, concept search will return a ranked list of documents, where the top-ranked ones are the most focused on the query as understood from the point of view of the document collection or the ontology. Concept search helps to highlight the most significant use of a term in a collection, it provides better education to the reviewers, and it brings together documents on the same topic to enable reviewers to better recognize the meaning of terms and their relationships in context.
A second problem that concept search is intended to address is the difficulty guessing the right words to search for. People use an amazing variety of words to say the same thing. Concept search allows users to find additional documents that may be about the same topic even if they use different words. In the second group of concept search tools, they learn that words do not occur at random. Instead, a document that has the word lawyer in it is likely to also have the words attorney, case, matter, judge, court, and so on in it. Conversely, documents that have attorney, etc. are likely to be about the same topic as those that contain the word lawyer.
Finally, it is not that there are different situations where one might want to use concept search, nor is it the case that it is some independent process. Rather, in most tools, concept search includes and subsumes word search. You get both along with better ranking, more focus, and an improved ability to get the documents you really want.

You can learn more at www.orcatec.com.

Herb Roitblat

This Page

The eDiscovery Paradigm Shift

Tuesday, December 9, 2008

Concept Search vs. Keyword Search in eDiscovery

5 Comments:

Post a Comment

Contributors

Previous Posts