How to Extract Tables from PDF files and save them as CSV using Python

Snapshot of our Final CSV…

Step 1 — Install Camelot

  • To install the Camelot library, run the following command in your terminal.
pip install "camelot-py[cv]"

Step 2 — Importing required libraries

  • For our today’s use case, we just need to import the Camelot library.
import camelot

Step 3 — Reading the PDF file.

  • Download the pdf file.
  • Here we are simply using camelot.read_pdf function to read our PDF file and extract tables from it automatically.
  • If our PDF has more than 1 page, we can also specify the page numbers from which we need to read the CSVs.
  • Also if our PDF file is password protected we can pass the password of the file as the parameter to the read_pdf function.
tables = camelot.read_pdf('table.pdf')
# tables = camelot.read_pdf('table.pdf', pages='1,2,3,5-7,8')
# tables = camelot.read_pdf('table.pdf', password='*******')

Step 4 — Let’s extract tables from PDF files

  • As we already know that our PDF File is having just one table so we will just do tables[0].df, means print the 0th element(table) in our tables as a dataframe.
  • When you are working with multiple tables simply run a for-loop.
#Access the ith table as Pandas Data frame
tables[0].df

Step 5 — Save the table in CSV format

tables.export('found_table.csv', f='csv')

Step 6 — Visualizing the conversion metrics

tables[0].parsing_report
  • Read more about the advance usage of camelot library here.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Abhishek Sharma

Abhishek Sharma

Data Scientist || Blogger || machinelearningprojects.net || Contact me for freelance projects on asharma70420@gmail.com