How to use document classification and sorting with machine learning and OCR

How to use document classification and sorting with machine learning and OCR

In many organizations, both corporate and governmental, incoming documents used to be processed in mailrooms. These documents could range from tax statements and fines to customer service letters and invoices. Each of them had to be routed to the appropriate department. Eventually, they were manually processed and ended up in a big archive.

Since the majority of these organizations have digitized their systems over the past decade, the amount of paper documents received is getting smaller every year. Many organizations have already moved to digital mailrooms, case management systems and archives and receive most documents by email. Some work with large scanners, digitizing the remaining part of the process.

Receiving and converting documents to a digital format, however, is just the first step in reducing errors and improving operational efficiency. Classifying the content of documents, sorting documents, routing documents to the right department and making sure documents are available in searchable text, are valuable next steps that can be automated and implemented into your document processing setup.

In this blog, we will reveal how you can do this with our document classification solution.


The secret is algorithms

Klippa has created machine learning algorithms that are trained with a set of more than 1 million documents. The algorithms extract many document characteristics such as file formats, file sizes and layouts.

The software extracts the content of documents using Optical Character Recognition (OCR), and performs text analysis and statistics using NLP to determine topic clusters. It identifies patterns within sets of document types that enable it to match unknown documents to one of these sets.

For any unknown document that has to be classified, the characteristics are extracted and fed to the algorithms. An algorithm is basically a mathematical formula, so the result will be a certain score. We call this a similarity score. This score is compared to all the document categories in the dataset that the model was trained with. The best match between the document score and the category score is the most likely candidate for classification.

The visual below gives a simplified example of a document classification flow:

It is possible to achieve more than 99% accuracy using automated document classification, while a single sorting action takes around 1/10th of a second. Manual classification is much slower, people take at least a few seconds to sort documents. Besides lacking speed, people generally aren’t more than 95% accurate, depending on the complexity of a sorting task.

So if we are talking about large volumes, let’s say sorting 100.000 documents a month, manually sorting will take 20 times longer and result in 5% more mistakes. This will easily cost a large organization thousands of euros per month, while an algorithm would only cost you a fraction of that.


Classification works for almost any type of document

Any feature (characteristic) that a person could identify can be classified by our software, plus a little more. The most important prerequisite is that there is enough data to train a model to understand the differences between certain features.

In that regard, machine learning algorithms are not that different from humans. They learn about the differences between, for example, an invoice and a payment reminder through one thing: experience.

This is what the Klippa software can do for you:

  • File type classification
  • Document type classification
  • Document language classification
  • Country of origin classification
  • Merchant classification
  • Line item classification
  • Classification of risk or urgency
  • Classification of privacy-sensitive data

File type classification

If you don’t know what files you have in your mailroom or archive, the first step is to quickly identify every single stored file. You can think about file types like PDFs, Word documents, Excel sheets, emails, images, scans or any other type.

Document type classification

Document types can be classified and sorted as well. For example, you can classify invoices, receipts, contracts, customer service letters, bills of lading, purchase orders, delivery slips, bank statements, identity documents, salary slips and many more. Klippa can classify more than 30 different types of documents.

Document language classification

Also the document language can be classified and sorted. Each document can get a label with ‘English’, ‘Dutch’, ‘Spanish’ or any other language. This can be very useful if you have documents in multiple languages and you are looking for a specific one.

Country of origin classification

Some documents, such as shipping labels or passports, contain information about the country of origin. This can be used to label the documents for sorting purposes. Think of country labels such as ‘The Netherlands’, ‘United Kingdom’ or region labels such as ‘Europe’.

Merchant classification

Merchants are important when processing receipts and invoices. It can give you information about the type of store the purchase was made. Category labels can be used to classify the type of store (e.g. hardware store, supermarket, electronics store or pharmacy).

Line item classification

Classification of line items (i.e. product purchases) is also an option. With a smart algorithm that learned from analyzing 500.000 receipts and invoices, Klippa can classify products into more than 20 categories, such as ‘Food & drinks’, ‘Electronics’, ‘Alcoholic’, ‘Transportation’ and more. This can be used to determine tax return eligibility, loyalty point distribution, customer analytics and other things.

Receipt OCR

Classification of risk or urgency

Risk or urgency classifications can be important when trying to set priorities in large volume customer support applications. Complaint letters or emails from customers who are angry or are planning to start legal action can be classified as ‘high priority,’ while a support question with regards to a functionality as ‘low priority’.

Classification of privacy-sensitive data

In some industries, it’s important to identify and classify documents containing sensitive data due to GDPR- or other privacy-related regulations. You can think of documents like passports, ID cards, drivers licenses, credit cards, contracts, and so on. Klippa’s OCR API can automatically detect and label these documents for you. It’s even possible to automatically anonymize them by removing or blacklining specific lines on a document.


The benefits of document classification and sorting

Which benefits of OCR and AI-based document classification apply to you, depends on your situation. In general all the benefits boil down to two things:

  • Increased operational efficiency → Increasing processing speed and reducing processing cost
  • Improved compliance → Reducing errors and finding indicators of risks in large collections of data

If you are replacing manual document sorting by a classification-based document sorting solution, you can easily reduce your operational cost by 70%.


Next steps

If your organization has any challenges with regards to efficient document processing, Klippa is here to help. We are happy to advise on best practices, demonstrate the capabilities of our software or just get to know each other. Below you can find an online demo scheduler that might be the next step in your digital transformation.

You can also read more about organizing, labelling and anonymizing archives in one of our other blogs.

 Schedule a free online demonstration

A clear overview of Klippa in only 30 minutes.

Works with AZEXO page builder