Printing information on paper and storing this in folders or boxes is something people have been doing for hundreds of years. When the personal computer (and printers) started to become popular, it became very easy to print any information you want on paper. Contracts, invoices, tickets, resumes and many more things have been printed on a massive scale in the last 20 years. But with the personal computer, it also became much more easy to store information digitally. And in the last 5 years cloud storage solutions such as Dropbox and Google Drive made this even more convenient. Slowly we are starting to shift to fully digital storage of information. But because a lot of information is already on paper, this has led to the field of document data extraction. But what is it? How does it work? And how can Klippa help you with document data extraction?
What is document data extraction?
It’s a technology that enables (parts of) information from paper documents to be extracted and stored in a structured format, based on scanning or making pictures of the documents (even with a mobile phone). But what is a structured data format? A structured data format is a consistent and easily understandable format to help computers understand and communicate data. Instead of large text files, text is split up and marked with identifiers to mark important information. A little bit similar to highlighting a text on paper with a marker to create an abstract. The identifiers are then put into for example a CSV, JSON, XLSX or XML. Below you can see an example of a JSON format:
[ {
“Merchant”: “Nelson”,
“Date”: “20-01-2019”,
“Amount”: “20”,
“Currency”:”EUR”
}]
How does document data extraction work?
Extracting information from paper documents takes a view steps. The first step is converting the paper document into a digital document like for example a PDF or JPG. This is usually done with a scanning device or a mobile phone. As soon as the document is digital, you have an image of the document, but there is no information a computer can read. To a computer its just an image and not text. We will use the image of a receipt below to demonstrate the next steps. But this could also be an invoice, contract, passport, a utility bill or many more things.


To convert this picture to text, OCR-technology is being used. OCR stands for optical character recognition. This technology converts the picture of a document back to an unstructured text file. The quality of the picture, lighting and the distance to the document from the scanning point all influence the result and accuracy for the conversion. After the OCR conversion we have a text document, but for a computer that is not yet understandable. Besides, in many cases just a few values are actually relevant, not the entire document. Think about the total amount on an invoice or the signatures and dates on a contract. The next step is to use a smart parsing system that can read the text, identify important information and extract the right information to store this in a database. From the database it is then easily convertible into your preferred data format. In the image below you can see how our systems outline important information before extraction.


We just took you through the process of extracting information from a receipt. If you need document data extraction software, you could now decide to build it yourself. But in many cases it’s much more efficient in both time and money to use specialised third parties. Klippa is a company specialising in this type of work. At Klippa we provide very flexible OCR APIs to extract data from any type of document you like, without having to build templates yourself. Input can be many types of documents like TXT, JPG, PNG, PDF or more. The output of our OCR API is also very flexible. We prefer to communicate via JSON, but for example XML, CSV or XLSX are also possible. With an API-key you can be up and running within a day!
Let’s talk about your use case!
At Klippa we love to work on interesting document data extraction use cases. We have done projects for companies all around the world in over 10 different languages and every type of file you can imagine. If you have an interesting challenge for us or would like to request an API-key, shoot us a message via chat, mail or give us a call.