user image

AKY King of Life
Published in : 2022-03-02

How to segregate data from a single cell in a scan pdf file.

Python

I can't share the original scan pdf file. But there are columns but it is a non grid pdf file. My problem is I want to extract the data and segregate the data in different columns. 

Comments

Rakshit Date : 2022-03-02

Best answers

34

Best answers

34

I am assuming your table is not containing any images or logos.

You can do it using Tabula API easily I think, here are the steps:

  1. Pass your pdf as an arg to the tabula API to get a table in the form of dataframe as returning a response from tabula.
  2. If your pdf contains multiple tables, each table will be returned as one dataframe.
  3. Table will be returned in a list of dataframes, You need to use pandas to work with dataframe. 

Follow below code snippet,

import pandas as pdimport tabulafile = "your_scan_file_name.pdf"path = <enter your directory path here\> + filedf = tabula.read_pdf(path, pages = '1', multiple_tables = True)print(df)

If you want to extract particular tables you need coordinates of that table

for file in files: path = path = '<enter your directory path here\>' + file df = tabula.read_pdf(path, area=(234.019,38.991,313.638,555.396), pages=1) print(df)

Hope this approach will help you for sure, let me know If it is not working for you!

Leave a comment

Join us

Join our community and get the chance to solve your code issues & share your opinion with us

Sign up Now

Related posts

The table does not exist error while inserting into database?
Publish date: 2022-03-04 | Comments: 1

Python

Webscraping England Hockey Python BeautifulSoup
Publish date: 2022-03-04 | Comments: 0

Python