Named Entity Recognition (NER) is an application of Natural Language Processing (NLP) that processes and understands large amounts of unstructured human language. Also known as entity identification, entity chunking and entity extraction. NER extraction is the first step in answering questions, retrieving information and topic modeling. There are many models for implementing NER depending on the application need. Here we compare Spacy pre-trained and Stanford NER models.
NER has various applications in multiple industries such as news and media, search engines, and content recommendations.
Use Cases of Named Entity Recognition:
An NER System is capable of discovering entity elements from raw data and determines the category the element belongs to. The system reads the sentence and highlights the important entity elements in the text. NER might be given separate sensitive entities depending on the project. This means that NER systems designed for one project may not be reused for another task.
For a general entity such as name, location, organization, date and pre-trained library, Stanford NER and Spacy can be used. But for a domain specific entity, an NER model can be trained with custom training data requires lots of human efforts and time. There are a few other approaches such as, Feedforward Neural Networks for NER, BILSTM, CNNS, and Residual Stack BILSTMS with Biased Decoding which can be used to perform NER using deep learning.
This is implemented in Java and is based on linear chain CRF (Conditional Random Field) sequence models. For various applications, custom models can be trained with labeled data sets.
It has three models:
Stanford NER stated that Conditional Random Field (CRF) is a sequence modeling algorithm that assumes features are dependent on each other, and also considers future observations while learning a new pattern. This combines the best of HMM (Hidden Markov Model) and MEMM (Maximum Entropy Markov Model). In terms of performance, it is one the best methods for entity recognition problems.
Example:
Result
Spacy is an open source library for natural language processing written in Python and Cython, and it is compatible with 64-bit CPython 2.7 / 3.5+ and runs on Unix/Linux, macOS/OS X and Windows. Spacy provides a Tokenizer, a POS-tagger and a Named Entity Recognizer and uses word embedding strategy. The advantage of Spacy is having Pre-trained models in several languages: English, German, French, Spanish, Portuguese, Italian, Dutch, and Greek.
The current architecture of a Spacy model hasn’t been published yet, but the above video explains how the models work, with particular focus on NER.
The Spacy NER system contains a word embedding strategy using sub word features and "Bloom" embed, and a deep convolution neural network with residual connections. The system is designed to give a good balance of efficiency, accuracy and adaptability.
Figure: Spacy NER Architecture. Language processing pipeline
Example:
Output:
Comparing results from Spacy as well as NLTK implementation of Stanford NLP concludes that both can be used for NER to achieve good results. Spacy has support for word vectors, so it's fast and accurate. It is recommended to use Spacy NER for production over Stanford NER. For customizing the process of NER, both models can be used. This requires data labeling and annotation which means giving tags to entities.