ADVERTISEMENTS
Linux Apps Tools

gImageReader – Extract Text from Images and PDF’s in Linux

gimagereader Text Extractor
Written by Divine Okoi
ADVERTISEMENTS

gImageReader is a free and open-source PDF reader with the ability to extract text from images and PDFs. It is built as a simple Gtk/Qt front-end to Tesseract-OCR, an open-source OCR engine for recognizing texts and patterns in documents and images using Artificial Intelligence.

On its own, Tesseract is a command-line tool that is restricted to usage by Linux users familiar enough with their terminals. Thanks to gImageReader, everyone can now take advantage of the engine’s OCR efficiency.

gImageReader works by scanning texts from PDF or picture file in any of the several languages that it supports thanks to the existence of Unicode characters. It features a simple, well-organized customizable user interface through which you can carry out spellcheck and translation tasks.

Features in gImageReader

  • Free and open-source software. Source code available on GitHub.
  • Available on GNU/Linux and Windows platforms.
  • Themeable UI with familiar editing layout.
  • Import PDF documents and images from disk, scanning devices, screenshots, and clipboard.
  • Generate PDF documents from hOCR documents.
  • Manual or automatic recognition area definition.
  • Process multiple imaged and documents in batches.
  • Recognize to hOCR documents or to plain text.
  • Recognized text displayed next to images.
  • Post-process the recognized text, including spellchecking.

gImageReader is easy to use and supports working with soft copy documents as well as snapshots of uploaded media e.g. screenshots. You even have the option to select the area of text that you’re interested in and extra only the text you need. Ultimately, gImagereader functions as both a PDF reader and a text extraction tool. Goof stuff.

  Master PDF Editor - A Cross-Platform Multifunctional PDF Editor

Install gImageReader on Linux

ADVERTISEMENTS

In order to use gImageReader to its fullest, you must manually install Tesseract language packs so that you can properly analyze images and files. The package is called ‘Tesseract-ocr-eng‘ and it is available from the software manager in Debian and Fedora distros.

If you’re running Ubuntu, you can simply add the PPA and run the install command using the commands below:

$ sudo add-apt-repository ppa:sandromani/gimagereader
$ sudo apt update
$ sudo apt install gimagereader

On Debian, Fedora, and OpenSUSE install it from the package manager.

$ sudo apt install gimagereader     [On Debian]
$ sudo dnf install gimagereader     [On Fedora]
$ sudo zypper install gimagereader  [On OpenSuse]

Don’t feel left out if you’re running Arch Linux or any of its derivatives. The AUR has got you covered. And if you would rather rebuild the app from source, instructions are in its GitHub repository Wiki link.

Are you one to extract printed text from images? You can even take snapshots of selected areas with your phone and upload them to your laptop. What’s even cooler is its multi-language support – which although isn’t perfect, is already one of the best options in the community right now.

gImageReader is among the best PDF readers in the open-source world especially with its OCR capability so give it a try and see just how you like it.

  Peruse: A Comic Book Reader for Linux Desktops

As usual, you are welcome to share your experiences with the app with us if you have any. And to add other suggestions in the comments section below.

ADVERTISEMENTS

About the author

Divine Okoi

Divine Okoi is a cybersecurity postgrad with a passion for the open-source community. With 700+ articles covering different topics in IT, you can always trust him to inform you about the coolest tech.