By Dmytro Sharapov, CV engineer @It-Jim
3592

Tutorial for Installing Tesseract

You’ve undoubtedly seen it before… It’s widely used to process everything from scanned documents to the handwritten scribbles on your tablet PC and Google Translate. And today you’ll create your first app for text recognition.

What is OCR?

Optical Character Recognition, or OCR, is the process of electronically extracting text from images and reusing it in a variety of ways such as document editing, free-text searches, or compression. In this tutorial, you’ll learn how to install Tesseract, an open-source OCR engine maintained by Google.

How to Install Tesseract for Microsoft Visual Studio?

Step 1:

To install Tesseract you need to install the following programs:

 

git

https://git-scm.com/
 

slik-svn

https://www.sliksvn.com/en/download
 

visual-studio

https://www.visualstudio.com

Step 2:

What’s next? That’s right, create a folder where we want to install Tesseract. This can be any directory on your computer, for example: “D:\Tesseract-files”.
After that, run GIT CMD and move to Tesseract`s folder. Your GIT command line should look like this:

Installation Tesseract. Picture 1

Fig. 1. GIT CMD example

Step 3:

Now you need to copy the entire dependency from the GitHub repository to your computer. To do this, we write the following command in GIT CMD:
git clone git://github.com/pvorb/tesseract-vs2013.git. In the console GIT CMD you will see something like this:

Installation Tesseract. Picture 2

Fig. 2. Clone tesseract-vs2013.git

After executing this command, you will see the following in the console:

Installation Tesseract. Picture 3

Fig. 3. Clone tesseract-vs2013 done

Step 4:

For the next step, run VS2013 developer command Prompt. It is in: {directory of MS VS}\Common7\Tools\Shortcuts\Developer Command Promt VS2013. And move to D:\Tesseract-files\tesseract-vs2013.

Installation Tesseract. Picture 4

Fig. 4. Command promt for VS2013

Now we can perform building using the command msbuild build.proj: 

Installation Tesseract. Picture 5

Fig. 5. Start performing build

After this step, the VS2013 can be closed.

Step 5:

Reopen GIT CMD and check folder and check the working directory. Must be “D:\Tesseract-files\”.  After that, gets the latest source using SVN (print in GIT CMD):   svn checkout https://github.com/svn2github/Tesseract.git.

Installation Tesseract. Picture 6

Fig. 6. Checkout Tesseract

After performing this procedure, the new folder appears in a folder D:\Tesseract-files\ which name is Tesseract.git\.
Move in GIT CMD to D:\Tesseract-files\Tesseract.git\trunk and apply the patch provided in tesseract-vs2013 (print in cmd): svn patch D:\Tesseract-files\tesseract-vs2013\vs2013+64bit_support.patch

Installation Tesseract. Picture 7

Fig. 7. Patch provided in tesseract-vs2013

Copy both directory (lib and include) from D:\Tesseract-files\tesseract-vs2013\release into D:\Tesseract-files\Tesseract.git\trunk\
Open D:\Tesseract-files\Tesseract.git\trunk\vs2013\tesseract.sln with Visual Studio 2013.

Step 6:

Open Property pages of libtesseract304 and in Configuration Properties->C/C++->General->Additional Include Directories  add D:\Tesseract-files\Tesseract.git\trunk\include\  and D:\Tesseract-files\Tesseract.git\trunk\include\ leptonica\; In Property  pages open Linker->General->Additional Library Directories add D:\Tesseract-files\Tesseract.git\trunk\lib\x64\;
It is necessary to repeat this operation for Debug and Release. Build the project in Release and Debug.

Step 7:

What would Tesseract recognized the text he needs training files. They can be found in: https://github.com/tesseract-ocr/tessdata. Download the necessary files and copy them to D: \Tesseract-files\Tesseract.git\trunk\ tessdata\

Step 8:

Copy tesseract`s .dll files to necessary project from D:\Tesseract-files\Tesseract.git\lib copy libtesseract304.dll (or libtesseract304d.dll) to Release (or Debug) folder in necessary project (In this folder must be exe file).From D:\Tesseract-files\tesseract-vs2013\lib\x64 (or X64) copy liblept171.dll (or liblept171d.dll) to Release (or Debug) folder in necessary project (In this folder must be exe file).

Connect Tesseract into project (is necessary for Debug and for Release).

Set properties of necessary project:

  in C/C++ –> General –> Additional Include Directories:
D:\Tesseract-files\Tesseract.git\trunk\
D:\Tesseract-files\Tesseract.git\trunk\ccmain
D:\Tesseract-files\Tesseract.git\trunk\ccstruct
D:\Tesseract-files\Tesseract.git\trunk\ccutil
D:\Tesseract-files\Tesseract.git\trunk\leptonica
D:\Tesseract-files\Tesseract.git\trunk\api
D:\Tesseract-files\Tesseract.git\trunk\include

In Linker –> General –> Additional Library Directories:
D:\Tesseract-files\Tesseract.git\lib\x64
D:\Tesseract-files\Tesseract.git\lib\

In Linker –> Input –> Additional Dependencies:

for Debug

libtesseract304d.lib
liblept171d.lib

for Release

libtesseract304d.lib
liblept171d.lib.

Step 9:

So, create new console application and paste this code:

#include “baseapi.h”

#include “allheaders.h”

int main()

{

                char *outText;

                tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

                // Initialize tesseract-ocr  with English, without specifying tessdata path

                if (api->Init(“D:\\Tesseract-files\\Tesseract.git\\trunk”, “eng”)){

                               fprintf(stderr, “Could not initialize tesseract.\n”);

                               exit(1);

                }

                // Open input image

                Pix *image = pixRead(“yout_image.tif”);

                api->SetImage(image);

                // set list of allowed characters

                api->SetVariable(“tessedit_char_whitelist”, “abcdefghijklmnoprstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-.;,:/0123456789”);

                // Get OCR result

                outText = api->GetUTF8Text();

                printf(“OCR output:\n%s”, outText);

                // Destroy used object and release memory

                api->End();

                delete[] outText;

                pixDestroy(&image);

return 0;

}

Then build and compile the project.

As a result, you will get:

Installation Tesseract. Picture 8

Fig.8. Input image

 

Installation Tesseract. Picture 9

Fig. 9. Output result

 

Congratulation! You installed and started your first text recognition program!

Tesseract Library Configuration
Tagged on: