Companies all over the world are working with receipts. In some cases for warranty and in other cases for administrative purposes. Over the last couple of years receipts have also been identified as a valuable source of data for loyalty and data analytics companies. More and more companies in these areas are contacting us to ask if our OCR technology can extract line item data on receipts. The answer luckily is yes! In this blog we will give some more information on how we extract line item data from receipts.
So what are line items on receipts?
The Klippa OCR software can turn any image into a structured text document that can be used for data analytics. For administrative tasks the merchant, dates, amounts and VAT values are usually relevant. For loyalty and data companies it is much more relevant to know what products are bought, in what products combinations and for what prices. When we are talking about data extraction on receipt line items we are talking about this type of information. It’s information on the bread someone bought for €1 at the grocery store combined with the two cartons of milk that each cost €1,5.
So how does it work?
Performing text mining on receipt line items is a step by step process. As soon as a picture of a receipt comes in it is being processed by multiple Klippa systems. First of all, we try to determine the document quality based on lighting, size and resolution. If the document has proper quality it is then converted into a raw text file using OCR. This text file is completely unstructured and can be compared to a notepad text. At this stage it’s still hard for a computer to understand what the line items are and what the other information is. The next step is our document classification algorithm based on AI. This system determines based on the receipt content whether it actually is a receipt, or perhaps an invoice or a payment slip. The document type is relevant for the last processing steps. In this last step we convert all the raw text into structured information with our intelligent software. We label every piece of text in the document giving meaning to the text. As soon as we have labeled all the information it can now be shared using XML, JSON or CSV. The merchant name, dates, times, amounts, VAT values, line items and more are all labeled separately. Using these 4 steps we have converted a picture of a document into structured data, ready for data analytics and loyalty purposes. In the visual below you can see 3 out of the 4 steps:


For what receipts does Klippa OCR work?
Good question! There are multiple ways to extract line item data from receipts. One is a template based solution and the other a universal solution. Working with templates means that you have to build a fixed parser for every type of receipt that you would like to analyse. The benefit here is that the quality can be very good if you only have one or just a few different merchants in your document set. The problem here is when you are working with many different merchants in your system. Because almost every shop uses their own receipt layout, working with templates can become very time consuming. At Klippa we usually prefer to work with a universal solution based on machine learning. The accuracy lies around 95%, far above the market average. Our universal solution can process any type of receipt in Europe within 2 seconds. From grocery stores to electronic stores! Depending on your use case we will always find the best solution.
Does it only work for receipts?
Next to line items, Klippa can capture many different data fields from many different document types. Line item data extraction on invoices for example, but also data extraction on contracts or identity document OCR.
Next steps
If you are interested in implementing our OCR API or camera SDK for OCR and data extraction you can always reach out to us. Do you have another OCR or machine learning problem you would like to have solved? Challenge us!