AKY King of Life

2 Mar 2022

How to segregate data from a single cell in a scan pdf file.


I can't share the original scan pdf file. But there are columns but it is a non grid pdf file. My problem is I want to extract the data and segregate the data in different columns. 



2 Mar 2022

I am assuming your table is not containing any images or logos.

You can do it using Tabula API easily I think, here are the steps:

  1. Pass your pdf as an arg to the tabula API to get a table in the form of dataframe as returning a response from tabula.
  2. If your pdf contains multiple tables, each table will be returned as one dataframe.
  3. Table will be returned in a list of dataframes, You need to use pandas to work with dataframe. 

Follow below code snippet,

import pandas as pd
import tabula
file = "your_scan_file_name.pdf"
path = <enter your directory path here\> + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = True)

If you want to extract particular tables you need coordinates of that table

for file in files:
 path = path = '<enter your directory path here\>' + file
 df = tabula.read_pdf(path, area=(234.019,38.991,313.638,555.396), pages=1)

Hope this approach will help you for sure, let me know If it is not working for you!

© 2024 Copyrights reserved for