Natural Language Processing (NLP) is a branch of Artificial Intelligence that deals with the interaction between computers and human(natural) languages. The recently held Jeopardy game show involving IBM's Watson competing with humans is a great example of a NLP system in real life. NLP technology can be leveraged successfully in solving problems in the unstructured content management space within Enterprises.
Enterprises deal with a mountain of unstructured content in their day to day business processes. Until recently, organizations ingest the huge volume of data into a traditional content management system like IBM FileNet and EMC Documentum, indexed with a few index fields like account number, names and location. Invariably most of the indexing is performed using human-centric manual indexing due to the wildly unstructured nature of the content. The only searchable information from the unstructured content objects is through the index fields. This methodology of dealing with content is no longer sufficient to meet requirements for today's enterprises where there is need to ensure that content is in motion and widely used across all departments.
To address this ever growing problem, my organization AlphaCloud Labs decided to embrace NLP using machine learning. The paradigm of machine learning is different from that of most prior attempts at language processing. Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules. The machine-learning paradigm calls instead for using general learning algorithms — often, although not always, grounded in statistical inference — to automatically learn such rules through the analysis of large corpora of typical real-world examples. A corpus (plural, "corpora") is a set of documents (or sometimes, individual sentences) that have been hand-annotated with the correct values to be learned.
Our NLP solutions address the following requirements:
Enterprises deal with a mountain of unstructured content in their day to day business processes. Until recently, organizations ingest the huge volume of data into a traditional content management system like IBM FileNet and EMC Documentum, indexed with a few index fields like account number, names and location. Invariably most of the indexing is performed using human-centric manual indexing due to the wildly unstructured nature of the content. The only searchable information from the unstructured content objects is through the index fields. This methodology of dealing with content is no longer sufficient to meet requirements for today's enterprises where there is need to ensure that content is in motion and widely used across all departments.
To address this ever growing problem, my organization AlphaCloud Labs decided to embrace NLP using machine learning. The paradigm of machine learning is different from that of most prior attempts at language processing. Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules. The machine-learning paradigm calls instead for using general learning algorithms — often, although not always, grounded in statistical inference — to automatically learn such rules through the analysis of large corpora of typical real-world examples. A corpus (plural, "corpora") is a set of documents (or sometimes, individual sentences) that have been hand-annotated with the correct values to be learned.
Our NLP solutions address the following requirements:
- Information retrieval (IR): This is concerned with storing, searching and retrieving information. It is a separate field within computer science (closer to databases), but IR relies on some NLP methods (for example, stemming). Some current research and applications seek to bridge the gap between IR and NLP.
- Information extraction (IE): This is concerned in general with the extraction of semantic information from text. This covers tasks such as named entity recognition, coreference resolution, relationship extraction, etc.
- Question answering: Given a human-language question, determine its answer. Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?").
- Automatic summarization: Produce a readable summary of a chunk of text. Often used to provide summaries of text of a known type, such as articles in the financial section of a newspaper.
- Named entity recognition (NER): Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization).
- Optical character recognition (OCR): Given an image representing printed text, determine the corresponding text.
As you can infer from the above, NLP can be widely leveraged to solve unstructured information / content analytics related use cases in an enterprise. Typical use cases would be Risk Management in Financial Services and Insurance, Electronic Medical Records in Health Care, Sentiment Analysis for your Social Media and Brand management . Our NLP solution is built on open standards based architecture that can be easily integrated into existing Content Management Systems like Filenet, Content Manager, Sharepoint and Documentum. One of our clients recently requested to analyze 5TB of image data and provide a searchable interface looking for patterns and meaning into the documents. We are seeing more trends towards this area of content analytics targeting unstructured content.
Hope this blog has provided an introduction to how organizations can plan and create a strategy to leverage NLP for content analytics.