Web Scraping Underwater: Extracting Text from Episodes

Many people grew up watching SpongeBob SquarePants. Even if you didn’t, you still know the whimsical cast and their colorful escapades. I have always thought it had some of the most unique dialogue and clever writing. Due to this, I thought there might be a treasure trove of text data I could use to practice different techniques including NLP and web scraping.

I decided to dive into the underwater world of Bikini Bottom and scrape episode transcripts to analyze how the show’s themes and character dynamics evolve over time.

Why scrape transcripts?

Transcripts serve as a valuable resource for analyzing a show’s narrative structure, character interactions, and emotional arcs. By extracting dialogue, I can perform sentiment analysis, track character word counts, and even study the evolution of language and themes across seasons. This data-driven approach provides insights into how the series has changed and what makes it resonate with audiences.

The Web Scraping Process

To gather the transcripts, let’s build a Python scraper using libraries like Beautiful Soup and Requests. The web scraping process involves navigating the site, identifying the relevant HTML elements containing episode transcripts, and extracting the text while ensuring that we adhere to ethical web scraping practices. Here’s a high-level overview of the steps involved:

Identify the Source: Locate reliable website that hosts transcripts for every episode.
Set Up the Scraper: Using Beautiful Soup, we will write code to request web pages and parse the HTML content.
Extract Data: We will focus on extracting key details, including episode ID, name, URL, and the full transcript.
Store the Data: Finally, we can save the extracted data into a structured format (e.g., a CSV file) for further analysis.

Identify the Source

We need a website that we can scrape for the transcripts for every episode. The web site https://spongebob.fandom.com/wiki/List_of_transcripts is a viable option since it has every episode listed in order, each with a link to the respective transcript.

Navigating around the website, we can see where there are seasons and episodes, and next to each episode there is a link that says “View transcript”. Clicking on this link takes you to a new page where there is additional information about the episode followed by it’s full transcript.

Snippet of the site where there are links to each episode transcript so we can use Python for web scraping

Now, we need it to go through each episode and scrape only the content we need. On each link to view the transcript, there are some unnecessary items that we may not need for now, such as the notes before the episode starts.

Snippet of part of web site that we are using python for web scraping

Set Up the Scraper

In order to successfully scrape this website and get the data we need, let’s map out exactly how the process should work.

We need a function to get the URLs from the List of Transcripts page in order to scrape each site for the respective transcript. In order to do this, we will use the requests and beautifulsoup libraries in Python. For a more detailed description on how to get these started, read the post scraping a website with beautifulsoup in python.

We will need to use the function get_urls from the List of Transcripts, then scrape each individual URL to get the respective transcript. We will define this function in the code below.

Then, we will create a data frame of the extracted data and save to CSV to for later use.

import requests
from bs4 import BeautifulSoup, SoupStrainer
from urllib import request

def get_urls(url="https://spongebob.fandom.com/wiki/List_of_transcripts"):
    """
    Function to get all URLs from List of Transcripts page to scrape each transcript
    """

    page = requests.get(url)

    soup = BeautifulSoup(page.content, "html.parser")
    
    links = []
    for link in soup.findAll('a', attrs={'href': re.compile("\/transcript")}):
        links.append(link.get('href'))
    
    link_list = ["https://spongebob.fandom.com/" + x for x in links]
    
    for link in link_list:
        ep_names = [x[34:] for x in link_list]
    
    return link_list

link_list=get_urls()

 'https://spongebob.fandom.com//wiki/Nicktoons_Freeze_Frame_Frenzy/transcript',
 'https://spongebob.fandom.com//wiki/Nicktoons_Unite!/transcript',
 'https://spongebob.fandom.com//wiki/Nicktoons:_Battle_for_Volcano_Island/transcript',
 'https://spongebob.fandom.com//wiki/Nicktoons:_Attack_of_the_Toybots/transcript',
 'https://spongebob.fandom.com//wiki/Nicktoons:_Android_Invasion/transcript',
 'https://spongebob.fandom.com//wiki/SpongeBob_SquarePants_featuring_Nicktoons:_Globs_of_Doom/transcript',
 'https://spongebob.fandom.com//wiki/Nickverse/transcript',
 'https://spongebob.fandom.com//wiki/Nickelodeon_All-Star_Brawl_2/transcript']

Notice that the last few links are shows that we do not care about because they are not standard episodes of the show.

So we are going to create a function to stop at the last episode we want. Manually checking for this, it should be season 14, episode 294a Single-Celled Defense, which is the latest episode transcript as of this writing.

This snippet of code will be called within get_urls():

def remove_strings_after_specific_element(input_list, specific_string):
    """
    Function to end the scraping of URLs after the last URL of the regular seasons
    (Used in get_urls function)
    """
    if specific_string in input_list:
        index = input_list.index(specific_string)
        del input_list[index + 1:]

def get_urls(url="https://spongebob.fandom.com/wiki/List_of_transcripts"):
    """
    Function to get all URLs from List of Transcripts page to scrape each transcript
    """

    page = requests.get(url)

    soup = BeautifulSoup(page.content, "html.parser")
    
    links = []
    for link in soup.findAll('a', attrs={'href': re.compile("\/transcript")}):
        links.append(link.get('href'))
    specific_string = '/wiki/Single-Celled_Defense/transcript'
    remove_strings_after_specific_element(links, specific_string)
    
    link_list = ["https://spongebob.fandom.com/" + x for x in links]
    
    for link in link_list:
        ep_names = [x[34:] for x in link_list]
    
    return link_list

and this should return a list of strings that contain the URLs for every desired episode transcript.

Now, we need to get each transcript. So we will define a function to request each URL in the list:

def get_transcript(row):
    '''
    Function to get only the transcript text from the given URL and return the final output as a string
    '''
    url = row
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    soup_sp = " ".join([a.get_text().strip() for a in soup])
    soup_sn = soup_sp.replace('\n', ' ')
    soup_st = soup_sn.replace('\t', '')
    
    begin_str='which aired on '
    ending_str= 'Categories'
    res = soup_st.split(begin_str, 1)
    res2 = ''.join(res[-1])
    end = res2.split(ending_str)[0]
    time.sleep(1)
    return end

Extract Data by Web Scraping

Now, let’s create a data frame of the extracted data and create some extra columns. Then, for the column of the transcripts, apply get_transcript() :

# create dataframe of the URLs
links_df = pd.DataFrame(link_list, columns=['URL'])

# Get name of episode + '/transcript' and add it to dataframe
links_df['ep_name'] = [link[34:] for link in link_list]

# Create column for transcript and apply get_transcript
links_df['transcript'] = links_df['URL'].apply(get_transcript)

Store Data

After we get our desired data frame, we can save it to CSV or Excel format for later use:

# Save dataframe to CSV file
links_df.to_csv('transcripts.csv')

What’s Next?

With the transcripts now in hand, I plan to conduct various analyses to uncover interesting trends and insights. From sentiment analysis to character dialogue dynamics, there’s so much to explore. Stay tuned for more posts where I’ll share my findings and the fascinating ways SpongeBob SquarePants has evolved over the years!

Please leave a comment if there are any suggestions on the process I use or the code, as well as any ideas for further analysis!

The Data Gem