In this pilot study, Tom Magerman - MSc Business and Information Systems Engineering, PhD at UCLL (University Colleges Leuven Limburg) - and our team at Xenit, want to check the effectiveness of applying automated named entity recognition on a set of miscellaneous documents in a corporate document management system. The idea is to use machine learning (artificial intelligence) techniques to identify and extract natural person names from documents.
This results in a list of person names and the position of where those person names appear in the document, and that information can be added to the document as meta-data.
We are interested in person name extraction as this is instrumental for the automated classification of GDPR-sensitivity of documents. Indeed, the presence of natural person names is an important factor to derive GDPR-sensitivity, and having such meta-data available helps developing tools for automated GDPR-classification of documents in a corporate document management system.
Having those names available as meta-data is also instrumental in improving the quality of corporate search and retrieval systems; the presence of such meta-data can be used to automatically structure information and reveal links between documents. E.g. in an insurance context, this kind of metadata allows to centralise information around customers (all contracts of a person; all claims a person is involved in) and derive patterns and relationships (e.g. the same person popping up in multiple claims related to a limited set of contracts).
OBJECTIVES OF THE PILOT
The objectives are twofold.
- On the one hand, we want to find out whether machine learning-based named entity recognition methods and tools can indeed correctly identify private person names out of the box (using machine learning models that are readily available of the shelf, i.e. models trained on a large set of general documents). To do so, we compare the pre-trained model for English of OpenNLP (en-ner-person.bin) and Stanford CoreNLP (english.all.3class.caseless.distsim.crf.ser.gz).
- On the other hand, we want to find out to what extent custom training, i.e. training on a particular document set, can improve the results and how training efforts (the number of cases available for training) influence the effectiveness. To do so, we compile training sets with an increasing number of documents from a custom corporate document set, derive classification models using OpenNLP and Stanford CoreNLP based on those training sets, and compare results.
For all models, both effectiveness (how many person names are correctly identified) as efficiency (how much calculation time does it take to get results) is compared.