How to segregate data from a single cell in a scan pdf file.

Rakshit

2 Mar 2022

I am assuming your table is not containing any images or logos.

You can do it using Tabula API easily I think, here are the steps:

Pass your pdf as an arg to the tabula API to get a table in the form of dataframe as returning a response from tabula.
If your pdf contains multiple tables, each table will be returned as one dataframe.
Table will be returned in a list of dataframes, You need to use pandas to work with dataframe.

Follow below code snippet,

import pandas as pd
import tabula
file = "your_scan_file_name.pdf"
path = <enter your directory path here\> + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
print(df)

If you want to extract particular tables you need coordinates of that table

for file in files:
 path = path = '<enter your directory path here\>' + file
 df = tabula.read_pdf(path, area=(234.019,38.991,313.638,555.396), pages=1)
 print(df)

Hope this approach will help you for sure, let me know If it is not working for you!

How to segregate data from a single cell in a scan pdf file.

Comments