A Proposal of Printed Table Digitization Algorithm with Image Processing

oleh: Chenrui Shi, Nobuo Funabiki, Yuanzhi Huo, Mustika Mentari, Kohei Suga, Takashi Toshida

Format: Article
Diterbitkan: MDPI AG 2022-12-01

Deskripsi

Nowadays, <i>digital transformation (DX)</i> is the key concept to change and improve the operations in governments, companies, and schools. Therefore, any data should be digitized for processing by computers. Unfortunately, a lot of data and information are printed and handled on paper, although they may originally come from digital sources. Data on paper can be digitized using an <i>optical character recognition (OCR)</i> software. However, if the paper contains a table, it becomes difficult because of the separated characters by rows and columns there. It is necessary to solve the research question of “how to convert a printed table on paper into an <i>Excel</i> table while keeping the relationships between the cells?” In this paper, we propose a <i>printed table digitization algorithm</i> using image processing techniques and OCR software for it. First, the target paper is scanned into an image file. Second, each table is divided into a collection of <i>cells</i> where the topology information is obtained. Third, the characters in each cell are digitized by OCR software. Finally, the digitalized data are arranged in an <i>Excel</i> file using the topology information. We implement the algorithm on <i>Python</i> using <i>OpenCV</i> for the image processing library and <i>Tesseract</i> for the OCR software. For evaluations, we applied the proposal to 19 scanned and 17 screenshotted table images. The results show that for any image, the <i>Excel</i> file is generated with the correct structure, and some characters are misrecognized by OCR software. The improvement will be in future works.