Monday, August 15, 2011

Content Analytics : Integrating Natural Language Processing(NLP) in your ECM solutions

Natural Language Processing (NLP) is a branch of Artificial Intelligence that deals with the interaction between computers and human(natural) languages. The recently held Jeopardy game show involving IBM's Watson competing with humans is a great example of a NLP system in real life. NLP technology can be leveraged successfully in solving problems in the unstructured  content management space within Enterprises.
Enterprises deal with a mountain of unstructured content in their day to day business processes.  Until recently, organizations ingest the huge volume of data into a traditional content management system like IBM FileNet and EMC Documentum, indexed with a few index fields like account number, names and location. Invariably most of the indexing is  performed using human-centric manual indexing due to the wildly unstructured nature of the content. The only searchable information from the unstructured content objects is through the index fields. This methodology of dealing with content is no longer sufficient to meet requirements for today's enterprises where there is need to ensure that content is in motion and widely used across all departments.


To address this ever growing problem, my organization AlphaCloud Labs decided to embrace NLP using machine learning. The paradigm of machine learning is different from that of most prior attempts at language processing. Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules. The machine-learning paradigm calls instead for using general learning algorithms — often, although not always, grounded in statistical inference — to automatically learn such rules through the analysis of large corpora of typical real-world examples. A corpus (plural, "corpora") is a set of documents (or sometimes, individual sentences) that have been hand-annotated with the correct values to be learned.


Our NLP solutions address the following requirements:

  • Information retrieval (IR): This is concerned with storing, searching and retrieving information. It is a separate field within computer science (closer to databases), but IR relies on some NLP methods (for example, stemming). Some current research and applications seek to bridge the gap between IR and NLP.
  • Information extraction (IE): This is concerned in general with the extraction of semantic information from text. This covers tasks such as named entity recognitioncoreference resolutionrelationship extraction, etc.
  • Question answering: Given a human-language question, determine its answer. Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?").
  • Automatic summarization: Produce a readable summary of a chunk of text. Often used to provide summaries of text of a known type, such as articles in the financial section of a newspaper.
  • Named entity recognition (NER): Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization).
  • Optical character recognition (OCR): Given an image representing printed text, determine the corresponding text.
As you can infer from the above, NLP can be widely leveraged to solve unstructured information / content analytics related use cases in an enterprise. Typical use cases would be Risk Management in Financial Services and Insurance, Electronic Medical Records in Health Care, Sentiment Analysis for your Social Media and Brand management . Our NLP solution is built on open standards based architecture that can be easily integrated into existing Content Management Systems like Filenet, Content Manager, Sharepoint and Documentum. One of our clients recently requested to analyze 5TB of image data and provide a searchable interface looking for patterns and meaning into the documents. We are seeing more trends towards this area of content analytics targeting unstructured content.

Hope this blog has provided an introduction to how organizations can plan and create a strategy to leverage NLP for content analytics.




Friday, July 8, 2011

Content at Rest vs Content in Motion - What's good for your enterprise?

Enterprise's without a clear information management strategy invariably mange unstructured content by storing them in file systems,mail servers and desktops. Content objects are rarely searched and accessed. Once an employee switches jobs, the device and user account hosting  the content is erased as part of the off-boarding process leading to information destruction and loss of data. This is typically the life cycle of content at rest Content at rest = Risk + Static Enterprise. On the contrary, agile enterprises with a clear information strategy have content objects in motion.Content in Motion supports business intelligence, analytics, enable collaboration and workflow and deliver true customer service. Such organizations lead the marked in innovation, rank high  in customer satisfaction and are forward looking in terms of return on investments to share holders.Content in Motion =  Reward + Agile Enterprise. Craig Rhinehart from IBM  alluded to the Content at Rest vs Content in Motion  theory in detail in his blog http://craigrhinehart.wordpress.com/2011/05/26/content-at-rest-or-content-in-motion-which-is-better/.

Organizations can transform Content at Rest to Content in Motion through the following actions:

  1. Identify Content : Inventory your current  content objects by systematically identifying all of your content objects across departments
  2. Defensibly Dispose: Dispose content objects that are erroneous, duplicates and unused. Examples include stranded sharepoint sites, legacy file formats and documents no longer valid. Ensure that you keep objects that are relevant to current business process and legal and compliance requirements
  3. Content Analytics : Work with you line of business leaders to study and analyze the content, understanding trends and patterns and incorporating the results in the decision making process.
  4. Customer Service : Ensure that customer service associates are brought into the loop to use the intelligent content on day to day activities
The above mentioned steps need to be a repeatable process within organizations to put content into motion. I hope it's obvious by now that content in motion is always the winner in any organization. I have borrowed ideas for this blog from Craig Rhinehart from IBM and would like to thank him for his excellent blog post. I hope this fuels more organizations in putting their content into motion...Watch this space for my next blog on Content Analytics using Machine Learning and Natural Language Processing to enable content into motion