GDPR - Challenge 1 : Identify documents containing personal data

Posted by Daniela Di Noi on 1/9/18 3:06 PM

Find me on:

In the previous post we have introduced 7 challenges for securing your company’s documents. In this post we will elaborate on the first challenge: Identify documents that contain personal data and label them appropriately.

Several companies may be well on the way to define how to handle GDPR compliance for structured data. But many companies still haven't come up with a good way to handle GDPR compliance for unstructured data.

Let us take a step back and explain to you the differences between structured data and unstructured data.

Structured data is highly organized information, usually databases or, for instance, spreadsheets with titled columns or rows which can easily be ordered, processed and detectable via search options or algorithms. Structured data is relatively simple to enter, store (in database), query, and analyze, but it must be strictly defined in terms of field name and type (e.g. alpha, numeric, date, currency), and as a result is often restricted by character numbers or specific terminology.

Unstructured data is more like human language. It is not stored into a relational database and the information does not have a pre-defined data model or is not organized in a pre-defined manner. Typical example of unstructured data is text.


Icon_01.pngStructured data


Unstructured data

  • Stored in databases
  • Accessed through applications
  • More control = Easier to protect
  • Stored on Local or Network Drives / Cloud / USB keys
  • Easily copied and distributed
  • Less control = more difficult to protect


As we have seen, unstructured data are found everywhere, in emails and documents, stored on local or network drives, in the Cloud or on USB Keys. If we want to adequately protect personal data stored in these documents, the first hurdle to take is to know what kind of personal data is in which document. As documents are created or modified all the time, this classification has to be a recurring process. The results of the classification process must be stored in a such a way, security tools are able to use this information.

If we look at Alfresco, every change in a document – e.g. creation or modification – can trigger a policy that calls for the classification process. Then, the results of the process can be stored as metadata of the document. How this metadata has to be used will be the topic of a next article (e.g. “Challenge 3-  Monitor and control access to the documents”).

For now, let’s go deeper into the functionality of the classification process. This process scans every individual file for personal data, regardless of the file type.

The most basic algorithm will search for structured data-types: National Number, Bank Account Numbers, Credit Card Numbers. This algorithm can be enhanced with company-specific structured data-types. For instance, Client Numbers that contain a check-digit.


Xenit Use Case: Screening 3 million of documents and 8 million of emails for sensitive data.




A more advanced algorithm takes company-specific data as input and scans documents and files for this data. For instance, the names and addresses from the client database are used as input and documents that contain this information are flagged.

The most advanced algorithms use artificial intelligence such as natural language processing to search for certain data-types in plain text. This type of algorithm will return a probability. A threshold will have to be defined and tested on a representative random sample of documents.

Next to searching the content of the document, the type of document and the place it is stored can also be used: for instance, all documents in the folder ‘employment contracts’.


Unstructured data process-1.png

The second part of the challenge was ‘label the documents appropriately’. We have already mentioned that this information can be stored as metadata in Alfresco. We will keep whether the document contains personal data, if so, which type of personal data, the content found and, if possible, the link to the individual person.

As you can see, by using the classification algorithm we transform unstructured personal data into semi-structured personal data. The advantage is that we can implement more strict security on the documents and control how and by whom these documents are used. The disadvantage is that we have created more data that has to be protected.

We will go into this topic in the next article: “GDPR - Challenge 2: State of the art security of the documents and the metadata.” Stay tuned.

The series is not legal advice for your company to use in complying with EU data privacy laws like the GDPR. Instead, it provides background information to help you better understand the GDPR. 


Topics: GDPR, Content Services, Alfresco, Handling Documents, Compliance, Edit Online, Alfred, Edit Offline, PRIVACY, PROCESSING, Security, breaches, sensitive data, governance, personal data, document

About Xenit 


Xenit delivers Products and Solutions to create Return on Content, on top of Alfresco, the Digital Business Platform. 

Our platform, Alfred, is a blueprint content services architecture with prefabricated components, to unlock the value of Alfresco.

  • Alfred Desktop is a desktop application for Alfresco, that acts as Alfresco and looks like Microsoft Explorer
  • Alfred Finder is a web application to find and retrieve documents on Alfresco, preview them and edit metadata
  • Alfred Edge is an API Gateway, a single point of entry to Alfresco that simplifies and decouples your architecture
  • Alfred Archive is a secure, durable and extremely low cost storage service for data archiving and long-term backup.


Subscribe to Email Updates

Recent Posts