Automatic Document Classification with Machine Learning and AI

Posted by Rawia Benhmida on 12/10/19 10:53 AM



Nowadays modern businesses are leveraging machine learning (ML) based solutions to help automate operations and making the whole process of document management faster and more effective. 

The latest systems are incorporating artificial intelligence (AI) to “read” documents like a human, to identify and classify the type of document and extract key data. Such systems can efficiently, accurately convert the varied content those sources contain into standard data types, scanning for relevant information and feeding that information into common data stores.

The importance of cognitive capture

Automated data capture technology already increases workplace efficiency and decreases business costs—but “intelligent” capture is even more powerful, leveraging AI and robotic process automation (RPA) to bring additional benefits to enterprises. Cognitive capture also uses natural language understanding to recognise phrases and determine the “emotion” of the document.

When business information is trapped in unstructured documents, it remains essentially invisible. Until now, most companies simply scan these documents, indexed them with a date and document number and stored them in a repository.

Only now, with these new cognitive systems, has data capture made it possible to capture all documents—not just structured layouts—and make that data actionable with a minimum of human intervention (IBM - The essential buyer’s guide to data capture and automation).

The most valuable data capture solution is one that can help:

  • Classify: the software learns to recognise different types of documents after being given a few variations and examples.
  • Extract:  the software trains itself to understand context, such as what an invoice number is not and what should (or shouldn’t) be around the number, so there’s a high degree of accuracy in the extraction.
  • Validate: advanced search capabilities can validate extracted data from a document with existing information in another system.

Every industry has its own unique document types, every organisation handles documents in its own unique way according to its policies and procedures. Xenit is now working closely with customers, and investing in automatic document classification, as well as data capture and extraction. 


A real case of data classification using ML/AI

One of our customers is an international company for ready to use building product systems for waterproofing, building repairs, tile laying, and industrial floor coatings. They had a long and impractical process involving multiple manual steps to integrate documents into their system. This laborious process was repeated several times in one day. Our client’s goal was to automate the documents classification.

The daily workflow of the processing of their documents consisted in :

  • Starting by selecting the right template for the document;
  • Adding some metadata to the template like the project number, project manager, client, among many other things;
  • Printing document pending signatures;
  • Acquiring necessary signatures;
  • Scanning the document to PDF and assign it to the project manager;
  • Based on the document type it then should be placed under the right department folder.

Machine Learning Data Classification


The huge amount of documents that needed to be processed and classified, led to a high probability of human errors due to the complexity of the above process. 


To overcome this challenge, our team suggested to use Machine Learning models to automatically classify documents to a set of predefined categories (i.e. two documents belong to the same category if they are typically related).


The solution

The Company provided us with almost 20 thousand documents with their respective templates that were stored in an encrypted drive to protect private information. Our team built a scalable, deployable model to perform document classification of this set of documents and extracted information from them.

In these 20 thousand PDF documents, there were 7 different classes. The model extracts the text of each PDF document and applies vectorization on it. Count Vectorisation involves counting the number of occurrences of each word appears in a document (i.e distinct text such as an article, book, even a paragraph!).

The vectorized representation is then inputted to the model for prediction. The vectorization includes two popular NLP (Natural Language Processing) approaches:

  • Extraction of BoW (bag of words), and
  • TFIDF (term frequency-inverse document frequency)



Once these features were extracted, and after splitting the data into a test set and validation set, we applied different models for classification on the data (logistic regression, support vector machines, and neural networks)

This NLP approach showed to be very effective with simple vectorized. Comparing TFIDF with BoW, it was possible to see that TFIDF was able to generate better results in terms of metrics mentioned in the glossary below, and it was also more stable. The prediction time and training time for the models (not including the time taken to extract the text from the PDF document) was negligible for the model application.

The models’ metrics showed very promising results: slight variance for accuracy, precision, recall and F1 score between models but the best result was recorded when using logistic regression using TFIDF with an accuracy 0.945, precision 0.966, recall 0.919, F1 score 0.942.

By taking advantage of the existing documents and by looking rigorously into a functional problem, we were able to provide a machine learning-based solution using a basic but yet effective way to extract  data, build a model, and deliver a classification.

Eliminating manual processing could be a turning point inside a company. Implementing such a solution could reduce administration overhead, accelerate the process of document delivery resulting in improved customer satisfaction.




  • Bag Of Words (BoW): it consists in extracting the unique words of each document to account for all the unique words in the whole set of documents.
  • TFIDF: It works in a similarly to BoW. But here, the word count (or term frequency) is also counter weighted by how common the word is in the entire set of documents.
  • Accuracy: it answers the following question: How many documents did we correctly classified out of all the documents?
  • Precision: it answers the following question: How many of those who we classified as Type1 (name of a cluster) are actually Type1?
  • Recall:  it answers the following question: Of all the document who are Type1, how many of those we correctly predict?
  • F1: is the harmonic mean (average) of the precision and recall.




Topics: Alfresco, ECM

About Xenit 


Xenit is a Belgium-based IT company, focusing  on content services solutions, and covering all document-related business processes, from data migration to digital archive to hybrid/cloud hosting solution, to help organizations get control of their information. Premier Partner and System Integrator of Alfresco Digital Business Platform, Xenit has more than 10 years of experience in Alfresco Content and Process Services.


Subscribe to Email Updates

Recent Posts

Posts by Topic

see all