PDF/A: how to make your PDF files searchable and GDPR-compliant

One of the most commonly used file formats is the Portable Document Format (PDF) ever since its coming into existence in 1993. It’s a way to send read-only documents that preserve the layout of a text. Even though it is one of the most common document formats, many people don’t know that there are actually many different types of PDF files, one of the most important for long term archiving is PDF/A. The PDF/A standard was born in 2005 and has a few benefits over standard PDF documents. In this blog we will tell you more about what PDF/A is, which versions there are and what the benefits are.

Do you have a need to make your PDF documents searchable? And do you wish to learn more about PDF/A? Sit tight and scroll on down, we’ll tell you all about:

What does PDF/A entail?

PDF/A is an ISO-standardized version of the PDF, tailormade for the archiving and long-term preservation of electronic documents. The A, actually stands for “Archiving”. ISO-standards are agreed upon by experts that describe the best way of doing something. They identify a set of characteristics for electronic documents that ensures the documents can be reproduced in the same exact manner with various software, now and in the future. This is something normal PDF documents can not guarantee and therefore normal PDFs can be a compliance issue for long term data storage.

A key element is that PDF/A documents are 100% self-contained. All metadata is embedded in the file and this includes all content (text, raster images and vector graphs), fonts, and information on colors. A PDF/A document cannot rely on data from external sources (font programs and data streams), but can include hyperlinks to external documents. PDF/A locks features unsuitable for long-term archiving, such as font linking and encryption. 

PDF/A comes in many different variations, created by mixing different PDF/A standards and conformance levels. Each PDF/A standard has a different mix of available features and image compression technologies that help with the preservation of content.  

Which versions of PDF/A are there?

The first part of the standard was published in 2005 and consisted of two levels:

PDF/A-1b – Level B (basic) conformance
PDF/A-1a – Level A (accessible) conformance
PDF/A-2u – Level U (unicode) conformance

Level B is the least complex one and is commonly used for archiving. Level A entails a few additional requirements that make it better suitable for visually impaired and easier to search through. The downside is that it is not always possible to create a Level A document from a specific source and it takes more time (more complicated) to create PDF/A-1a documents. Below are the extra Level A requirements:

  • Language specification
  • Hierarchical document structure
  • Tagged text spans and descriptive text for images and symbols

Because technology is improving quickly, new versions of PDF/A have been developed over time. PDF/A-1 is the original PDF/A standard, both the most commonly used and the most restrictive. Because it is based on an older PDF standard, PDF 1.4, it does not support JPEG 2000, attachments or layers. Level A conformance was intended to increase the accessibility for physically impaired users by allowing support software, such as screen readers, to better interpret a file’s contents. 

The second part of the standard was published in 2011. PDF/A-1 files do not necessarily conform to PDF/A-2 and vice versa. This part contains the following new features and is now commonly used:

  • Digital signatures
  • JPEG 2000 and JBIG2 image compression
  • Transparency effects and layers
  • Option to archive sets of documents in a single file
  • Embedding of OpenType fonts
  • Conformance level u (Unicode) allows enables text can be reliably searched and copied, without the file having to conform to other a-level requirements

Level U (Unicode) was introduced along with PDF/A-2 and provides character mappings to Unicode.

Part 3 has one new feature: it permits any file format (XML, CSV, CAD, Word, Excel, etc). as an attachment, but is not widely spread yet. 

Part 4 is expected to be published sometime this year (2020).

What are the benefits of PDF/A? 

There are many advantages of using PDF/A as opposed to, for instance, the traditional PDF file format. As to not make this blog annoyingly long, we will list what we think are the five most important ones.

1 – PDF/A documents are fully text searchable: The answer to many people’s headaches: PDF/A documents are fully searchable! This feature may help save numerous hours of manual labor. The text is preserved in the document, even text extracted with optical character recognition (OCR). The PDF/A file saves both the extracted text and the scanned image.

2 – PDF/A takes up relatively little storage space: Although PDF/A documents contain more information than images (such as TIFF), the PDF/A files are usually smaller due to the use of efficient compression algorithms.

3 – PDF/A documents stay valid forever: Existing PDF/A documents don’t need to be migrated when new standard amendments are introduced by the ISO committee. They will always stay compliant because the ISO cannot withdraw the PDF/A standard. This guarantees that you have a safe and usable document archive and you will not lose any data or become incompliant. 

4 – Digital signatures guarantee security: Combining PDF/A with digital signatures ensures that PDF documents have not been altered and that they are authentic. For long-term archiving, this means optimal legal security.

5- PDF/A are widely accepted

In Europe and Asia, PDF/A is already widely used for long-term archiving, by governments, organizations and businesses alike. The demand for this standard is growing in North America in certain branches. The PDF Association is very important in supporting PDF/A.

PDF/A and GDPR compliant archives 

We can talk about benefits as long as we want, but we should also consider legal restrictions. On May 25th, 2018, the EU’s General Data Protection Regulation (GDPR), the first global data protection law, came into effect. The goal of the regulation is to respect people’s privacy and be transparent as an organization, but also to guarantee free movement of data within the European internal market. GDPR compliance applies to EU companies and companies in Canada and the United States that work with the personal data of EU citizens. It basically means that when you ask for and save customers’ personal details, you need to have their consent and not keep the data longer than necessary. This data could be anywhere and could already be in your archives, but you just might not know because your archive is either not digital or not searchable. Converting or creating searchable PDF/A files is therefore very relevant to ensuring GDPR compliant archives. Combine this with automated anonymization or pseudonymisation and you are sure to only store the right data. So how to become GDPR-compliant?

1 – Convert any paper documents into digital files.
2 – Make your PDFs searchable (below more on how Klippa can help you do this)
3 – Identify & anonymize all sensitive data that you are not allowed to store
4 – Use the PDF/A format for safe long term archiving purposes

Common use cases of PDF/A

Below are some examples of use cases:

Digitization: Insurance companies that want to say goodbye to printed invoices and create a digital archive so they can quickly search them when necessary, thereby improving productivity.
Digital documents: Legal firms that wish to convert their legal documents to PDF/A for archiving and compliance purposes.
Searchability: Ensuring that you convert all your documents and PDFs to searchable PDFs will make it much easier and less time consuming to find data in your archives.
Documentation: Banks that offer a new service can refer to the exact terms and conditions of old services.
Collaboration: Engineers who share drafts of a document and store the finished version in PDF/A for long term accessibility.
Email/mail: Healthcare providers that want to automatically archive all communications with patients in order to access them quickly.

How to create PDF/A and searchable PDFs

If you want to create a single PDF/A file, you could just use Microsoft Word to do so. Creating PDF/A files automatically on a large scale is technically quite complex. If you are very technical and are looking for a way to do it yourself, check out the PDF Association. If you lack technical experience, don’t wish to spend a lot of time on investigating how it works, or have large volumes of documents that need converting, we can automate the process for you. We can convert all of your scans, images or PDFs to any version of PDF/A. Even your entire archive. With our service, you can make your entire database of files searchable and safe to store, without data corruption. With the traditional PDF format you can’t guarantee that if you try to open it in five years time, it will still work. With searchable PDF/A, you can.

How to validate PDF/A files

It is hard to judge a book by its cover. The same goes for PDF files. If it’s hard to validate a document by looking at it, how can you be sure that a file is actually a PDF/A file and that it conforms to the standard? PDF/A validators are the answer. These are (online) tools that verify if all the elements of the standards have been met. A good source is VeraPDF

