abstract business code coder

Stripping Strings in Python

Introduction

When working with strings in Python, cleaning up unwanted whitespace or characters is often necessary. Whether you’re processing user input, reading data from a file, or cleaning text for analysis, string manipulation is a key part of the process. Python provides a set of powerful string methods, including strip(), lstrip(), and rstrip(), to help strip unwanted characters efficiently. These methods allow you to tidy up your strings by stripping leading, trailing, or both leading and trailing characters, making them helpful for data cleaning tasks.

In this post, we’ll dive into how these strip functions work, when to use each, and practical examples to demonstrate their utility in real-world scenarios.

Strip() Functions Overview

  • strip(): Removes leading and trailing whitespace (or specified characters). In other words, removes characters from both ends of a string.
  • lstrip(): Removes leading whitespace (or specified characters). Specifically, removes characters from the left side of a string.
  • rstrip(): Removes trailing whitespace (or specified characters). Specifically, removes characters from the right side of a string.

Practical Examples

Using strip()

Consider that we are building a form that takes user input such as their name and address. Because of this, it leaves room for human error such as accidentally adding extra spaces. This could cause issues later when trying to use the data for analysis. This is where we can use strip() to remove extra spaces.

name = "  Rodney Dangerfield   "
clean_name = name.strip()
print(clean_name)  # Output: "Rodney Dangerfield"

In this case, strip() effectively removed the leading and trailing whitespace.

Now what if instead of extra spaces, the user accidentally inputs specific characters such as @ or ***? Here, we can specify the characters to remove.

name = "**Rodney Dangerfield**"
clean_code = name.strip("*")
print(clean_code)  # Output: "Rodney Dangerfield"

Notice how it removed all instances of “*” from the string.

Using lstrip() and rstrip()

Maybe the user input their name with their title such as “Mr.” or “Ms.” and we want to remove that. Here we can use lstrip().

name = "Mr. Rodney Dangerfield"
clean_code = name.lstrip("Mr. ")
print(clean_code)  # Output: "Rodney Dangerfield"

Or if they put a suffix such as “Jr.” or “III”, we can use rstrip().

name = "Rodney Dangerfield III"
clean_code = name.lstrip(" III")
print(clean_code)  # Output: "Rodney Dangerfield"

When the strip functions fail

Unfortunately, sometimes these functions aren’t enough. For example, if you want to remove some specific characters, including letters, it will remove every instance of those letters, potentially removing some that we may want to keep. I recently ran into this issue when cleaning some text.

I built a scraper to scrape websites that contain transcripts to episodes of SpongeBob SquarePants, and put them in a data frame. Right now, the episode name for each episode is saved to a column we are calling “ep_name” which is extracted from the end of the URL used to get the transcript. So, the text for each episode has the format “/episode_name/transcript” and we want to remove the slashes and the substring “transcript.” I will first try using strip(), lstrip() and rstrip() to clean the column.

transcripts['ep_name_l'] = transcripts['ep_name_transcript'].str.lstrip('/')
transcripts['ep_name_r'] = transcripts['ep_name_transcript'].str.rstrip('/transcript')
transcripts['ep_name_strip'] = transcripts['ep_name_transcript'].str.strip('/transcript')

transcripts.head()
dataframe of transcript with strip functions output

Clearly, none of these worked as expected! lstrip() only took the first slash, rstrip() took off the “/transcript” but also took off some extra characters like in the last line. Strip() took both slashes and “transcript”, but again in the last line it took off more than we wanted.

In order to do this properly, I will instead apply the split() function:

transcripts['ep_name_split'] = transcripts['ep_name_transcript'].str.split('/')
transcripts.head()

This function splits the string based on the delimiter provided, and save each section before and after into a comma separated list. In order to get just the middle portion, we will specify which portion of the list we would like, and we will also put these within the apply() function and use a lambda function.

transcripts['ep_name_fcn'] = transcripts['ep_name_transcript'].apply(lambda x: x.split('/')[1])

transcripts.head()
split function for episode name in data frame

Finally, we have the episode name we want in its own column within the data frame. To read more about this project in particular, please refer to the series of posts on the project with the first being Web Scraping Underwater: Extracting SpongeBob SquarePants Episodes.

Conclusion

Text cleaning and string manipulation are fundamental skills for any Python programmer, especially when working with messy data. The strip(), lstrip(), and rstrip() functions provide simple but powerful ways to clean strings, making your data ready for further analysis and easier to work with. Whether you’re removing unwanted whitespace, trimming special characters, or cleaning up text fields, these functions are essential tools in your Python toolkit.

The real power of these tools comes from their versatility. Once you understand how and when to use them, youโ€™ll find that they simplify and speed up your data cleaning tasks. Keep experimenting with them on different datasets, and youโ€™ll quickly see how they can enhance your workflow.

While these functions may seem trivial or not worth the effort, especially when compared to advanced text analysis techniques, these straightforward string functions are essential when it comes to text preprocessing and cleaning. This is a critical step in the data science workflow to perform more complex tasks like sentiment analysis or other natural language processing techniques. Cleaning text by removing unnecessary spaces or unwanted characters ensures that your data is consistent and error-free, setting a strong foundation for the more advanced work that follows.

Leave a Reply

Your email address will not be published. Required fields are marked *