The ease with which we have long been able to retrieve information from the World Wide Web (WWW) using increasingly efficient and high quality search engines underscores a less-than-impressive performance from search engines serving the enterprise environment. Off-the-shelf tools that let organizations retrieve their enterprise information just do not give us the same experience as Google or Bing. But what if you could build your own enterprise information retrieval system by leveraging open source tools and platforms?
In this blog, we will explore the feasibility of doing just that.
A Tougher Job
The reason enterprise searches fall short is that organizational information retrieval (IR) is actually a much more difficult task. Any organizational IR system must handle a variety of document formats, unique dictionaries and specific meta-data. For example, the frequency of the term RecoverPoint will be different between WWW documents and EMC documents. Furthermore, an organizational IR system has to function in a highly redundant data environment with numerous duplications and files hierarchies. For example, different organizational units may store the same document under different directories, tagged with different meta-data. In addition, organization-specific frontend issues, such as interpreting users’ queries and presenting the most relevant results, need to be considered for optimizing the results and user experience.
Vendors in the information retrieval market provide tools that help people find what they need in enterprise environments by discovering information, indexing it, and communicating it back to the user in an efficient manner. There are dozens of these tools today. However, recent polls and reviews suggest that there is no “magic solution” that will work for every business problem or for every data environment. Some vendors provide excellent user experience but less capability of processing natural language; others work perfectly well – but only for a specific business domain (e.g. manufacturing). In some cases, connectivity options are poor, preventing the digestion of some data sources. In other cases, there is a limited cloud support (where solutions cannot be connected to or won’t work on the cloud) or lack of mobile functionality.
In general, it is clear that off-the-shelf solutions do not meet every business or organization need. If a business has specialized requirements, building its own tool may be more suitable and appropriate for addressing its needs, preventing scalability issues and functions overload. However, it appears that the “Build vs. Buy” question is not taken into account when it comes to the information retrieval market. Probably the main reason is that the ‘build’ approach is considered as being difficult and time-consuming, requiring deep expertise for the development and maintenance of such an ‘owned’ retrieval system.
How to Build It Yourself
In this blog post, we won’t pretend we can give a decisive answer, but we are going to demonstrate the feasibility and for this purpose, we use an example of the system we developed within two weeks as a part of our professional training within our Data Science-as-a-Service team. The system aims at storing textual documents within a database and to retrieve relevant documents based on users’ queries. In next paragraphs, we describe how this system was built and implemented, first outlining its general components and then describing how those components have been implemented using open source tools.
As presented in figure 1, the system runs two concurrent sub-processes: Data collection and information retrieval. Data collection includes the extraction of documents’ textual content including its pre-processing and the conversion of the document into a vector representation. The information retrieval sub-process enables users to retrieve relevant documents based on their textual queries.
The data collection process was implemented using Spring Integration (SI), an open source Java framework that motivates highly modular software development. We found SI framework very suitable for most data collection tasks as it already has some relevant boilerplate components. For example, the connection to an external data source can be gained by using “Inbound channel adapter,” an SI component that connects a message channel to a single sender (e.g. SQL server, local file system, HTTP, TCP, Twitter stream etc.).
Since SI enforces separation between business logic and integration logic, the document parsing and representation steps becomes much easier. In an IR system, documents are regularly represented by “vector of terms.” Vector of terms (in our case a word or a phrase) is a mapping between a term and a counter that represents the number of its occurrences in the given document. The document parsing and representation steps include a stemming process in which each word is conflated to its dictionary form, so users who search information about “building” will also get documents that contain “build” and “built.”
One of the goals of our system is to provide scalable processing in corporate environment. Spring Integration itself is less suitable for this purpose. However, it is easy to transfer any code written in SI to Spring-XD, a distributed and extendible system for ingestion. Using Spring-XD provides the capability to scale out the implementation described above.
The collected data and its representation vectors are stored in an in-memory Gemfire database. In parallel with continuous data collection sub-process, users are able to retrieve information from Gemfire by inputting desired terms. As in most information retrieval systems, we rank documents by estimating their relevance for the user’s query. The ranking relies upon a numeric score based on the Okapi ranking function, broadly used by search engines. The Okapi score takes into account factors such as term frequency in document, term frequency in query, number of documents that contain the term, documents length, etc. As developers, we can choose alternate algorithms for our retrieval task. For example, if the query contains a known topic, the engine can present relevant documents based on topic modelling.
To test the system, we used it for stack overflow documents and it has shown very encouraging results from both a performance and a document relevance perspective.
To conclude, implementing a simple information retrieval system by using only open source tools is feasible in a relatively short period of several days (two weeks in this case). In addition, the system is well modular so its storage architecture and retrieval algorithms can be replaced with minimal development efforts. Therefore, the “Build vs. Buy” question is reasonable and should be considered by organizations when it comes to the information retrieval requirements.
Think about it… Are you satisfied with the results of your organizational IR system? We hope this blog has provided you with some information to consider and will help you in choosing the best approach for your own IR solution.
Tags: Analytics, Big Data, data science, source:itb