GDPR - Challenge 1 : Identify documents containing personal data

Posted by Daniela Di Noi on 1/9/18 3:06 PM

Find me on:

In the previous post we have introduced 7 challenges for securing your company’s documents. In this post we will elaborate on the first challenge: Identify documents that contain personal data and label them appropriately.

Several companies may be well on the way to define how to handle GDPR compliance for structured data. But many companies still haven't come up with a good way to handle GDPR compliance for unstructured data.

Let us take a step back and explain to you the differences between structured data and unstructured data.

Structured data is highly organized information, usually databases or, for instance, spreadsheets with titled columns or rows which can easily be ordered, processed and detectable via search options or algorithms. Structured data is relatively simple to enter, store (in database), query, and analyze, but it must be strictly defined in terms of field name and type (e.g. alpha, numeric, date, currency), and as a result is often restricted by character numbers or specific terminology.

Unstructured data is more like human language. It is not stored into a relational database and the information does not have a pre-defined data model or is not organized in a pre-defined manner. Typical example of unstructured data is text.


Icon_01.pngStructured data


Unstructured data

  • Stored in databases
  • Accessed through applications
  • More control = Easier to protect
  • Stored on Local or Network Drives / Cloud / USB keys
  • Easily copied and distributed
  • Less control = more difficult to protect


As we have seen, unstructured data are found everywhere, in emails and documents, stored on local or network drives, in the Cloud or on USB Keys. If we want to adequately protect personal data stored in these documents, the first hurdle to take is to know what kind of personal data is in which document. As documents are created or modified all the time, this classification has to be a recurring process. The results of the classification process must be stored in a such a way, security tools are able to use this information.

If we look at Alfresco, every change in a document – e.g. creation or modification – can trigger a policy that calls for the classification process. Then, the results of the process can be stored as metadata of the document. How this metadata has to be used will be the topic of a next article (e.g. “Challenge 3-  Monitor and control access to the documents”).

For now, let’s go deeper into the functionality of the classification process. This process scans every individual file for personal data, regardless of the file type.

The most basic algorithm will search for structured data-types: National Number, Bank Account Numbers, Credit Card Numbers. This algorithm can be enhanced with company-specific structured data-types. For instance, Client Numbers that contain a check-digit.


Xenit Use Case: Screening 3 million of documents and 8 million of emails for sensitive data.




A more advanced algorithm takes company-specific data as input and scans documents and files for this data. For instance, the names and addresses from the client database are used as input and documents that contain this information are flagged.

The most advanced algorithms use artificial intelligence such as natural language processing to search for certain data-types in plain text. This type of algorithm will return a probability. A threshold will have to be defined and tested on a representative random sample of documents.

Next to searching the content of the document, the type of document and the place it is stored can also be used: for instance, all documents in the folder ‘employment contracts’.


Unstructured data process-1.png

The second part of the challenge was ‘label the documents appropriately’. We have already mentioned that this information can be stored as metadata in Alfresco. We will keep whether the document contains personal data, if so, which type of personal data, the content found and, if possible, the link to the individual person.

As you can see, by using the classification algorithm we transform unstructured personal data into semi-structured personal data. The advantage is that we can implement more strict security on the documents and control how and by whom these documents are used. The disadvantage is that we have created more data that has to be protected.

We will go into this topic in the next article: “GDPR - Challenge 2: State of the art security of the documents and the metadata.” Stay tuned.

The series is not legal advice for your company to use in complying with EU data privacy laws like the GDPR. Instead, it provides background information to help you better understand the GDPR. 


Topics: Alfresco, Content Services, Handling Documents, Edit Online, Edit Offline, GDPR, Compliance, personal data, Security, sensitive data, PROCESSING, PRIVACY, governance, breaches, Alfred, document

About Xenit 


Xenit is a Belgium-based IT company, focusing  on content services solutions, and covering all document-related business processes, from data migration to digital archive to hybrid/cloud hosting solution, to help organizations get control of their information. Premier Partner and System Integrator of Alfresco Digital Business Platform, Xenit has more than 10 years of experience in Alfresco Content and Process Services.


Subscribe to Email Updates

Recent Posts

Posts by Topic

see all