Fine-tuning Tesseract OCR for German Invoices

Denis Gontcharov Blog, Data Science

Management Summary

OCR (Optical Character Recognition) is a major challenge for many companies. The OCR market is comprised of various open source and commercial providers. A well-known open source tool for OCR is Tesseract, which is provided by Google. Tesseract is currently available in version 4, which performs OCR extraction using recurrent neural networks. However, the OCR performance of Tesseract is still volatile and depends on various factors. A particular challenge is the application of Tesseract to documents that are composed of different structures, e.g. texts, tables and images. Invoices, for example, are such a type of document, and OCR tools from all vendors continue to underperform on this document type.

This article demonstrates how fine-tuning the Tesseract OCR engine on a small sample of data can already significantly improve OCR performance on invoice documents. The presented process is not only applicable to invoices but to any type of document.

A use case is defined aimed at a correct extraction of all text (words and numbers) from one fictional yet realistic German invoice. It’s fictively assumed that the extracted information is destined for downstream accounting purposes. Therefore, a correct extraction of numbers and the Euro-symbol is considered to be critical.

The OCR performance of two Tesseract models for the German language is compared: the standard (non-fine-tuned) model and its fine-tuned variant. The standard model is downloaded from the Tesseract OCR GitHub repository. The fine-tuned model is created using the steps outlined in this article. A second German invoice similar to the first one is used for fine-tuning. Both the standard model and fine-tuned model are evaluated on the first German invoice to ensure a fair comparison.

The OCR performance of the standard model on numbers is poor. Certain numbers are falsely recognized as other numbers. This is especially true for numbers that look similar like the number 1 and 7. The Euro-symbol is falsely recognized in 50% of the cases, making the result unsuitable for any downstream accounting application.

The fine-tuned model shows a similar OCR performance for German words. However, the OCR performance on numbers improves significantly. All numbers and every Euro-symbol is extracted correctly. 

It is concluded that fine-tuning can yield a large improvement for a minimal amount of effort and training data. This fact makes Tesseract OCR with its open-source licensing an attractive solution compared to propriety OCR software. Final recommendations are offered for fine-tuning Tesseract LTM models given a real use case for which more training data is available.

Download the Tesseract Docker Container

The entire fine-tuning process of the Tesseract LSTM model is discussed in detail below. Since the installation and application of Tesseract can be quite complicated, we have prepared a Docker Container that already contains all necessary installations. By using the container, you can follow all steps below.

    Salutation*

    Name*

    Last Name*

    Company*

    Business E-Mail*

    Introduction

    Tesseract 4 with its LSTM engine works reasonably well out-of-the-box for plain text pages.

    There are however certain challenging scenarios for which an off-the-shelf model performs poorly. Examples include texts written in exotic type fonts, images with backgrounds and text in tables.  Luckily, Tesseract provides a way to fine-tune the LSTM engine to improve its OCR performance on a specific use case.

    In this article the OCR (Optical Character Recognition) performance of an off-the-shelf Tesseract LSTM model is benchmarked on a German invoice. Next this this model is fine-tuned on a second German invoice. The OCR performance of both models is compared, and further improvements are suggested.

    Why OCR on invoices remains challenging

    Even though OCR is considered to be a solved problem, extracting a large corpus of text without any mistakes remains challenging. This is especially true for OCR on invoice documents which, compared to a book-type text, face three additional problems:

    1. colored backgrounds and table structures pose a challenge for page segmentation
    2. invoices typically contain rare characters such as the EUR or USD sign
    3. numbers can’t be verified against a language dictionary

    In addition, the margin for error is small: for accounting applications an exact extraction of numeric data is paramount for all subsequent process steps.

    The first problem can generally be resolved by selecting a suitable page segmentation mode of the fourteen that are provided by Tesseract. The latter two problems can be resolved by fine-tuning the LSTM engine on examples of similar invoices.

    Use case objective and data

    Two similar example invoices are considered in the article. The first invoice shown in Figure 1 will be used to evaluate the OCR performance of both the standard and the fine-tuned Tesseract model. Special attention is devoted to the correct extraction of numbers for accounting purposes. The second invoice shown in Figure 2 will be used as training data to fine-tune Tesseract.

    Invoices are mostly written in a very readable type font like “Arial”. To illustrate the benefits of fine-tuning, the initial OCR problem is made more challenging by considering invoices written in the font “Impact”. This is a font for which Tesseract struggles to resolve certain characters.

    It will be shown that after fine-tuning on a very small amount of data, Tesseract will yield very satisfactory results in spite of this challenging font.

    Figure 1: Evaluation invoice on which both the standard and fine-tuned Tesseract models will be evaluated
    Figure 2: Training invoice on which the Tesseract OCR LSTM model will be fine-tuned

    Using the Tesseract 4.0 Docker container

    The set up for fine-tuning the Tesseract LSTM engine currently only works on Linux and can be a bit tricky. Therefore, a Docker container with pre-installed Tesseract 4.0, along with the compiled training tools and scripts, is provided with this article.

    Load the Docker image from the provided archive file / Pull the container image using the provided link:

    docker load -i docker/tesseract_image.tar

    Once image is built, run the container in detached mode:

    docker run -d --rm --name tesseract_container tesseract:latest

    Finally, access the running container’s shell to replicate the commands in this article:

    docker exec -it tesseract_container /bin/bash

    General improvements of OCR Performance

    There are three ways in which Tesseract’s OCR performance can be improved before resorting to fine-tuning the LSTM engine.

    1. Image preprocessing

    Invoice images may have a skewed orientation if they weren’t properly aligned on the scanner. Rotated images should be deskewed to improve Tesseract’s line segmentation performance.

    In addition, scanning may introduce image noise which should be removed by a denoising algorithm. Note that by default Tesseract performs thresholding using Otsu’s algorithm to binarize grayscale images into black and white pixels.

    A thorough treatment of image preprocessing is beyond the scope of this article and is not necessary to obtain satisfactory results in the given use case. The Tesseract documentation provides a convenient overview.

    2. Page segmentation

    During page segmentation Tesseract attempts to identify rectangular regions of text.

    Only these regions are selected for OCR in the next step. It’s therefore critical to capture all regions with text lest information be lost.

    Tesseract allows to choose from 14 different page segmentation methods that can be viewed by using the following the command:

    tesseract --help-psm

    The default segmentation method expects a page of text similar to a book page. However, this mode fails to identify all text regions on an invoice because of its additional tabular structure. A better segmentation method is given by option 4: Assume a single column of text of variable sizes.

    To illustrate the importance of a suitable page segmentation method, consider the result of using the default method “Fully automatic page segmentation, but no OSD” in Figure 3:

    Figure 3: Page segmentation using the default method fails to determine all text regions

    Note that the text „Rechnungsinformationen:”, “Pos.” and “Produkt” were not segmented. In Figure 4 a more suitable method results in a perfect page segmentation.

    3. Use of dictionaries, word lists and patterns for text

    The LSTM models used by Tesseract were trained on large amounts of text in one specific language. This command shows the languages that are currently available for Tesseract:

    tesseract --list-langs 

    Additional language models can be obtained by downloading the corresponding language.tessdata and placing it in the tessdata folder of the local Tesseract installation. The Tesseract repository on GitHub provides three variants of language models: normal, fast and best. Only the fast and best variants are suitable for fine-tuning. As their name implies, they are the fastest and most accurate variants of models respectively. Other models have also been trained for specific use cases like exclusively recognizing digits and punctuation and are listed in the references.

    As the language of the invoices in this use case are German, the Docker image belonging to this article comes with the deu.tessdata model.

    For a chosen language, Tesseract’s word list can be further expanded or limited to certain words or even characters. This subject lies outside the scope of this article as it’s not necessary to obtain satisfactory results in this use case.

    Setup for the fine-tuning process

    Three file types must be created for fine-tuning:

    1. tiff files

    Tagged Image File Format or TIFF is an uncompressed image file format (as opposed to JPG or PNG which are compressed file formats). TIFF files can be obtained from PNG or JPG formats by a conversion tool. Although Tesseract can work with PNG and JPG images, the TIFF format is recommended.

    2. box files

    To train the LSTM model Tesseract relies on so called box files with the “.box” extension. A box file contains the recognized text along with the coordinates of the bounding box in which the text is situated. Box files contain six columns representing symbol, left, bottom, right, top and page.

    P 157 2566 1465 2609 0
    r 157 2566 1465 2609 0
    o 157 2566 1465 2609 0
    d 157 2566 1465 2609 0
    u 157 2566 1465 2609 0
    k 157 2566 1465 2609 0
    t 157 2566 1465 2609 0
      157 2566 1465 2609 0
    P 157 2566 1465 2609 0
    r 157 2566 1465 2609 0
    e 157 2566 1465 2609 0
    i 157 2566 1465 2609 0
    s 157 2566 1465 2609 0
      157 2566 1465 2609 0
    ( 157 2566 1465 2609 0
    N 157 2566 1465 2609 0
    e 157 2566 1465 2609 0
    t 157 2566 1465 2609 0
    t 157 2566 1465 2609 0
    o 157 2566 1465 2609 0
    ) 157 2566 1465 2609 0
      157 2566 1465 2609 0
    

    Each individual character is situated on a separate line in the box file. The LSTM model accepts either the coordinates of individual characters or a whole text line. In the example box file above the text “Produkt Preis (Netto)” is located on the same line. All characters have the same coordinates, namely the coordinates of the bounding box around that text line. Using line-level coordinates is considerably easier and will be provided by default when the box file is generated with the following command:

    cd /home/fine_tune/train
    tesseract train_invoice.tiff train_invoice --psm 4 -l best/deu lstmbox

    The first argument is the image file, the second the box file name. The language parameter -l instructs Tesseract to use the German model for OCR. The parameter –psm instructs Tesseract to use page segmentation method number four.

    Unavoidably, the generated box files OCR will contain errors in the symbol column. Each symbol in the training box file must therefore be verified by hand. This is a tedious process given that the box file of the train invoice contains nearly a thousand lines (one for each character in the invoice). To simplify the correction, the Docker container provides a Python script that draws the bounding boxes along with the OCR text on the original invoice image for easier comparison. The result is shown in Figure 4. The Docker container already contains the corrected box files indicated by the suffix “_correct”.

    Figure 4: Extracted text using the standard German model “deu”

    3. lstmf files

    During fine-tuning Tesseract extracts text from the tiff file using OCR and verifies its prediction using the coordinates and the symbol in the box file. Tesseract does not rely on the tiff and box file directly, but expects an lstmf file constructed from both previous files. Note that in order to create the lstmf file the tiff and box files must have the same name, for example train_invoice.tiff and train_invoice.box.

    The following command generates an lstmf file for the train invoice:

    cd /home/fine_tune/train
    tesseract train_invoice.tiff train_invoice lstm.train 

    All lstmf files destined for training must be specified by their relative path in a text file called deu.training_files.txt. In this use case only one lstmf file will be used for training so the deu.training_files.txt contains just one line: eval/train_invoice_correct.lstmf.

    It’s recommended to create an lstfm file for the eval invoice as well. This way the model performance can be evaluated during model training.

    cd /home/fine_tune/eval
    tesseract eval_invoice_correct.tiff eval_invoice_correct lstm.train

    Evaluating the standard LSTM model

    OCR predictions from the standard German model “deu” will serve as a benchmark. An accurate overview of the standard German model’s OCR performance can be obtained by generating a box file for the eval invoice and visualizing the OCR text using the Python script mentioned earlier. This Python script that generates the file ‘eval_invoice_ocr deu.tiff’ is located under /home/fine_tune/src/draw_box_file_data.py in the Docker. It expects the path to a tiff file, the corresponding box file and a name for the output tiff file. The OCR text extracted by the standard German model is saved as eval/eval_invoice_ocr_deu.tiff and shown in Figure 1.

    At first glance the text extracted by OCR looks good. The model correctly extracts German characters such as ä, ö ü and ß. In fact, there are only three occasions where words contain errors:

    OCRTruth
    Jessel GmbH 8 CoJessel GmbH & Co
    11 Glasbehälter1l Glasbehälter
    Zeki64@hloch.comZeki64@bloch.com

    The German model performs well on common German words but has difficulties with singular symbols such as “&”and “l” and words such as “bloch” that are not present in the model’s word list.

    Prices and numbers in general are a different story. Here the errors are numerous.

    OCRTruth
    159,16159,1€
    1%7%
    1305.8161305.81€
    227.66227.6€
    341.51347.57€
    1115.161115.7€
    242.86242.8€
    1456.861456.8€
    51.4654.1€
    1954.719€1954.79€

    Note that the standard German model failed to extract the Euro-symbol € in 9 of 18 occurrences. This represents an error rate of 50%.

    Fine-tuning the standard LSTM model

    The LSTM model will now be fine-tuned on the training invoice shown in Figure 2. Next the OCR performance will be evaluated on the evaluation invoice shown in Figure 1 that was used for benchmarking the standard German model.

    To fine-tune the LSTM model it must first be extracted from the deu.traineddata. The following command extracts the LSTM model from the standard German into the directory lstm_model:

    cd /home/fine_tune
    combine_tessdata -e tesseract/tessdata/best/deu.traineddata lstm_model/deu.lstm

    Now all necessary files are obtained for fine-tuning. The files are also present in the Docker container:

    1. The training files train_invoice_correct.lstmf and deu.training_files.txt in the train directory.
    2. The evaluation files eval_invoice_correct.lstmf and deu.training_files.txt in the eval directory.
    3. The extracted LSTM model deu.lstm in the lstm_model directory.

    The Docker container contains the script src/fine_tune.sh that launches the fine-tuning process. Its contents are:

    /usr/bin/lstmtraining \
     --model_output output/fine_tuned \
     --continue_from lstm_model/deu.lstm \
     --traineddata tesseract/tessdata/best/deu.traineddata \
     --train_listfile train/deu.training_files.txt \
     --eval_listfile eval/deu.training_files.txt \
     --max_iterations 400

    This command fine-tunes the extracted deu.lstm model on the train_invoice.lstmf file specified in train/deu.training_files.txt. Fine-tuning the LSTM model requires language-specific information that is contained in the deu.tessdata folder. The eval_invoice.lstmf file specified in eval/deu.training_files.txt will be used to compute OCR performance metrics during training. Fine-tuning will stop after 400 iterations. The total training duration takes less than two minutes.

    The following command runs the script and logs the output to a file:

    cd /home/fine_tune
    sh src/fine_tune.sh > output/fine_tune.log 2>&1

    The contents of the log file after training are shown below:

    src/fine_tune.log
    Loaded file lstm_model/deu.lstm, unpacking...
    Warning: LSTMTrainer deserialized an LSTMRecognizer!
    Continuing from lstm_model/deu.lstm
    Loaded 20/20 lines (1-20) of document train/train_invoice_correct.lstmf
    Loaded 24/24 lines (1-24) of document eval/eval_invoice_correct.lstmf
    
    2 Percent improvement time=69, best error was 100 @ 0
    At iteration 69/100/100, Mean rms=1.249%, delta=2.886%, char train=8.17%, word train=22.249%, skip ratio=0%, New best char error = 8.17 Transitioned to stage 1 wrote best model:output/deu_fine_tuned8.17_69.checkpoint wrote checkpoint.
    -----
    2 Percent improvement time=62, best error was 8.17 @ 69
    At iteration 131/200/200, Mean rms=1.008%, delta=2.033%, char train=5.887%, word train=20.832%, skip ratio=0%, New best char error = 5.887 wrote best model:output/deu_fine_tuned5.887_131.checkpoint wrote checkpoint.
    -----
    2 Percent improvement time=112, best error was 8.17 @ 69
    At iteration 181/300/300, Mean rms=0.88%, delta=1.599%, char train=4.647%, word train=17.388%, skip ratio=0%, New best char error = 4.647 wrote best model:output/deu_fine_tuned4.647_181.checkpoint wrote checkpoint.
    -----
    2 Percent improvement time=159, best error was 8.17 @ 69
    At iteration 228/400/400, Mean rms=0.822%, delta=1.416%, char train=4.144%, word train=16.126%, skip ratio=0%, New best char error = 4.144 wrote best model:output/deu_fine_tuned4.144_228.checkpoint wrote checkpoint.
    -----
    Finished! Error rate = 4.144

    During training Tesseract saves a model checkpoint after every iteration. The performance of the model at this checkpoint is tested on the evaluation data and compared against the current best score. If the score improves, i.e. the character error decreases, a labeled copy of the checkpoint is saved. The first number of the checkpoint’s label represents the character error and the second number the training iteration.

    The last step that remains is to re-assemble the fine-tuned LSTM model so that once again a “traineddata” model is obtained. Assuming the checkpoint at the 139th iteration is desired, the following command converts a chosen checkpoint “deu_fine_tuned4.647_181.checkpoint” into a fully functional Tesseract model “deu_fine_tuned.traineddata”:

    cd /home/fine_tune
    /usr/bin/lstmtraining \
     --stop_training \
     --continue_from output/deu_fine_tuned4.647_181.checkpoint \
     --traineddata tesseract/tessdata/best/deu.traineddata \
     --model_output output/deu_fine_tuned.traineddata

    This model must be copied into the tessdata of the local Tesseract installation to make it available to Tesseract. This has already been done in the Docker container.

    Verify that the fine-tuned model is available in Tesseract:

    tesseract --list-langs

    Evaluating the fine-tuned LSTM model

    The fine-tuned model will be evaluated analogously to the standard model: a box file of the evaluation invoice is created, and the OCR text is displayed on the evaluation invoice image using the Python script.

    The command to generate the box files must be modified to use the fine-tuned model “deu_fine_tuned” instead of the standard model “deu”:

    cd /home/fine_tune/eval
    tesseract eval_invoice.tiff eval_invoice --psm 4 -l deu_fine_tuned lstmbox

    The OCR text extracted by the fine-tuned model is shown in Figure 5 below.

    Figure 5: OCR using the fine-tuned German model “deu_fine_tuned”

    As with the standard German model, the performance on words remains good but not perfect. To improve the performance on rare words the model’s word list could be expanded to include specific jargon.

    OCRTruth
     Jessel GmbH 8 CoJessel GmbH & Co
    1! Glasbehälte1l Glasbehälter
    Zeki64@hloch.comZeki64@bloch.com

    More importantly, the OCR performance on numbers has improved significantly:The fine-tuned model extracted all numbers and every occurrence of the € sign correctly.

    OCRTruth
    159,1€159,1€
    7%7%
    1305.81€1305.81€
    227.6€227.6€
    347.57€347.57€
    1115.7€1115.7€
    242.8€242.8€
    1456.8€1456.8€
    54.1€54.1€
    1954.79€1954.79€

    Conclusion and further improvements

    In this article it was demonstrated that the performance on a difficult problem such as OCR on German invoices written in the challenging font “impact” is greatly improved by fine-tuning on just one example invoice. The ability to fine-tune on a specific use case combined with its open-source licensing makes Tesseract OCR version 4 with its LSTM engine an attractive solution to tackle challenging OCR problems.

    It might be tempting to run the fine-tuning for more iterations to improve the accuracy even further. In this use case the number of iterations was deliberately limited because only one training invoice was used. More iterations increase the risk of overfitting the LSTM model on certain symbols which increases the error rate of other symbols.  In practice though, it’s desirable to increase the number of iterations on the condition that sufficient training data is provided. Nevertheless, finding the optimal number of iterations is more an art than a science. The final OCR performance should always be verified on a different yet representative set of evaluation data.

    Über den Autor
    Avatar

    Denis Gontcharov

    I’m a data scientist at STATWORX. Writing about data science helps me to test my understanding of a topic. I enjoy reading Russian classics, American science-fiction and blogs.

    ABOUT US


    STATWORX
    is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.