From PDF to Data: Extracting Text from Online Documents with Python

Introduction

PDF files are a staple in today’s digital world due to their consistent formatting across all devices. They’re commonly employed in everything from business reports and legal contracts to educational materials and research papers, providing a reliable way to present complex, structured information. As PDFs preserve fonts, images, and layouts, they are ideal for official documents and visual-heavy content. However, their static nature also means they aren’t inherently easy to extract or analyze, making specialized tools essential for unlocking the data within them for broader use in data analysis, research, and automation.

Whether you’re a data analyst, a researcher, or just someone who needs to extract text from PDFs, learning how to automate this process in Python can save you countless hours. In this post, I’ll walk you through a Python script that downloads a PDF from a web link, extracts specific text, and filters it for relevant information. Let’s dive into the details of how this works.

Why Extract from PDFs?

PDFs are great for presenting formatted documents but can be a headache if you need to extract and analyze the data inside them. PDFs are commonly used for official reports, research papers, forms, and brochures—often containing valuable insights hidden within their pages. By using Python to extract text from PDFs, you can efficiently analyze this data and integrate it into broader workflows.

I recently came across the need for this when I was dabbling with some data in a data science competition.

PDF Use Case

The goal in this challenge is to explore the application of unsupervised machine learning methods on emergency department visit narratives about older (age 65+) adult falls. Ultimately, insights gained through such analyses can help inform policies and interventions to reduce older adult falls.

I chose to look at this data to practice some unsupervised techniques and see if I could identify trends in the data.

First, I decided to do a little preliminary research on what the CDC currently provides to help prevent older adult falls. I came across this PDF, which gives families a check list to go through to ensure they are preventing falls at home.

Do to the nature of the document, I decided that perhaps the groupings in this document I could use to cluster the narratives for further analysis. So I want to extract the groups and their respective elements.

1. Python libraries for Extracting from PDF

These libraries work together to download, read, and process text from a PDF file.

requests – This library allows the code to send HTTP requests to download content from the web, in this case, a PDF file.

2. io – The io module provides the BytesIO class, which is used here to handle the PDF data in a file-like binary stream, allowing it to be read by the PDF processing library.

3. PyPDF2 (or in some setups, pypdf) – This library is used for reading and extracting text from PDF files. In your code, PdfReader reads the PDF content from the BytesIO stream and allows extraction of text from specific pages.

4. re – This is the regular expressions library, which is used for pattern matching. In this code, re.findall() finds all uppercase words by matching a specific pattern ([A-Z][A-Z]+) in the extracted text.

First, you will need to download the appropriate libraries:

import io
import requests
import PyPDF2
from pypdf import PdfReader
import re

Then, let’s get the PDF file:

2. Downloading the PDF File

To begin, you first need to download the PDF file from a specific URL. We’re using the requests library, which sends an HTTP GET request to retrieve the file’s content:

# Get PDF
url = 'https://www.cdc.gov/steadi/pdf/STEADI-Brochure-CheckForSafety-508.pdf'
req = requests.get(url)
f = io.BytesIO(req.content)

The code defines a url that points to a PDF document online.

It uses requests.get(url) to download the content.

The file content is saved into a byte stream with io.BytesIO, allowing Python to handle the PDF as a file-like object.

3. Extracting Text from the PDF

Once the PDF is loaded, we can use the PyPDF2 library to read its contents:

reader = PdfReader(f)
contents = reader.pages[1].extract_text().split('\n')

PdfReader reads the PDF file f.

This code opens the PDF and extracts text from the second page (page index starts at 0) and stores in contents. The split('\n') breaks the text into lines, making it easier to filter and analyze each line individually.

4. Cleaning the Extracted Text

PDFs can have extra spaces, line breaks, and unwanted formatting. Here, we remove empty strings and unnecessary spaces:

# filter out empty strings
remove_ls = ["", " ", "  ", ". "]
contents2 = [i for i in contents if i not in remove_ls]

# Strip spaces
test = [x.strip() for x in contents2]

This creates a clean list, contents2, containing only relevant text lines, and removes any extra spaces from the start and end of each line.

5. Finding and Isolating Uppercase Words

Now that we have clean text, let’s look for specific patterns. In this example, we focus on finding uppercase words, which may represent headings, labels, or important terms like the groupings in the document:

words = []
for word in test:
    i = re.findall(r'[A-Z][A-Z]+', word)
    words.append(i)

# Remove empty lists
list_of_lists = [i for i in words if i != []]

This regex pattern ([A-Z][A-Z]+) matches groups of uppercase letters, helping to isolate potentially meaningful labels, and then adds them to the words list. The list comprehension named list_of_lists helps filter out any empty lists that could have been created.

6. Flattening and Further Filtering

With our list of uppercase words, we can remove any redundant labels or combine related words. Here, we merge specific terms like “Stairs” and “Steps” into “Stairs & Steps” and filter out unnecessary labels:

# flatten list
flat_list = [item for row in list_of_lists for item in row]

# Merge Stairs and steps
flat_list[0:2] = [' & '.join(flat_list[0:2])]

remove_ls = ['INDOORS', 'OUTDOORS']
labels = [i for i in flat_list if i not in remove_ls]

print(labels)

['STAIRS & STEPS', 'FLOORS', 'KITCHEN', 'BEDROOMS', 'BATHROOMS']

The result is a list of labels or keywords that are relevant to our analysis.

This “flattens” the list by merging all sub-lists in list_of_lists into a single list, flat_list.
The first two items in flat_list are joined with an ampersand(&), so “Stairs” and “Steps” become "Stairs & Steps".
The list remove_ls is defined to filter out specific words ('INDOORS' and 'OUTDOORS').
labels stores all words except those specified in remove_ls.

Final Thoughts

With Python, extracting and analyzing data from PDFs is easier than ever. You can customize this code to target specific terms, perform sentiment analysis, or even turn this process into a batch pipeline for multiple PDFs.

The code provided is not web scraping in the typical sense. Web scraping usually involves extracting data from HTML web pages by parsing and navigating the HTML structure (using libraries like BeautifulSoup). Check out this post I wrote using this library. This process allows you to capture structured information from a website’s content, such as tables, text, and images.

In this case, the code is not navigating or extracting data from an HTML structure. Instead, it:

Downloads a PDF file from a specified URL.
Extracts text from the PDF file’s content.
Processes the text by filtering, formatting, and searching for specific patterns.

Because this code directly downloads a document (PDF) rather than extracting structured data from a web page, it’s more accurately described as PDF extraction or processing rather than web scraping. The only similarity is that it downloads content from the web, but it doesn’t involve navigating a web page’s structure or parsing HTML.

The Data Gem