Entity Recognition with Machine Learning

Posted by Daniela Di Noi on 1/21/20 9:39 AM

Find me on:

In this pilot study, Tom Magerman - MSc Business and Information Systems Engineering, PhD at UCLL (University Colleges Leuven Limburg) - and our team at Xenit,  want to check the effectiveness of applying automated named entity recognition on a set of miscellaneous documents in a corporate document management system. The idea is to use machine learning (artificial intelligence) techniques to identify and extract natural person names from documents. 


This results in a list of person names and the position of where those person names appear in the document, and that information can be added to the document as meta-data.

We are interested in person name extraction as this is instrumental for the automated classification of GDPR-sensitivity of documents. Indeed, the presence of natural person names is an important factor to derive GDPR-sensitivity, and having such meta-data available helps developing tools for automated GDPR-classification of documents in a corporate document management system.

Having those names available as meta-data is also instrumental in improving the quality of corporate search and retrieval systems; the presence of such meta-data can be used to automatically structure information and reveal links between documents. E.g. in an insurance context, this kind of metadata allows to centralise information around customers (all contracts of a person; all claims a person is involved in) and derive patterns and relationships (e.g. the same person popping up in multiple claims related to a limited set of contracts).


  Machine Learning Data classification



The objectives are twofold.

  • On the one hand, we want to find out whether machine learning-based named entity recognition methods and tools can indeed correctly identify private person names out of the box (using machine learning models that are readily available of the shelf, i.e. models trained on a large set of general documents). To do so, we compare the pre-trained model for English of OpenNLP (en-ner-person.bin) and Stanford CoreNLP (english.all.3class.caseless.distsim.crf.ser.gz).
  • On the other hand, we want to find out to what extent custom training, i.e. training on a particular document set, can improve the results and how training efforts (the number of cases available for training) influence the effectiveness. To do so, we compile training sets with an increasing number of documents from a custom corporate document set, derive classification models using OpenNLP and Stanford CoreNLP based on those training sets, and compare results.

For all models, both effectiveness (how many person names are correctly identified) as efficiency (how much calculation time does it take to get results) is compared.

Discover the results




Topics: GDPR, Machine Learning, Artificial Intelligence, data recognition

About Xenit 


Xenit is a Belgium-based IT company, focusing  on content services solutions, and covering all document-related business processes, from data migration to digital archive to hybrid/cloud hosting solution, to help organizations get control of their information. Premier Partner and System Integrator of Alfresco Digital Business Platform, Xenit has more than 10 years of experience in Alfresco Content and Process Services.


Subscribe to Email Updates

Recent Posts

Posts by Topic

see all