Content-based Image Retrieval using Tesseract OCR Engine and Levenshtein Algorithm
Date
2021
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
IJACSA
Abstract
—Image Retrieval Systems (IRSs) are applications
that allow one to retrieve images saved at any location on a
network. Most IRSs make use of reverse lookup to find images
stored on the network based on image properties such as size,
filename, title, color, texture, shape, and description. This paper
provides a technique for obtaining full image document given
that the user has some portions of the document under search. To
demonstrate the reliability of the proposed technique, we designed
a system to implement the algorithm. A combination of Optical
Character Recognition (OCR) engine and an improved text matching algorithm was used in the system implementation. The
Tesseract OCR engine and Levenshtein Algorithm was integrated
to perform the image search. The extracted text is compared to
the text stored in the database. For example, a query result is
returned when a significant ratio of 0.15 and above is obtained.
The results showed a 100% successful retrieval of the appropriate
file base on the match even when partial query images were
submitted.
Description
Research Article
Keywords
Image Retrieval Systems, image processing, Optical Character Recognition (OCR), text matching algorithm, Tesseract OCR engine, Levenshtein Algorithm