5 min to read
Text Cleaning in Pandas
Read more about the backstory of this project and its dream here
tldr; I participated in one of the projects of Open Taggle - to use A.I. to generate a poem from a person’s feelings. The feelings are captured using IoT to read EGG waves. I come in this project as the person to develop the A.I. algorithm to generate the poems.
The data we use is a scrapped list from the Poetry Foundry’s website available for download on Kaggle.
When loaded into a pandas dataframe looks something like this:
Here is what I cover in this post:
- Removing opening and trailing unwanted characters for all rows in a column
- Removing duplciates
- Split a row value with multiple new lines in it into different rows
- Remove non-ascii characters from text in columns
1. Removing opening and trailing unwanted characters for all rows in a column
As you can see each line in the Poem column starts with \r\r\n. Each of these lines finishes with this character sequence too.
Here is a column-wise way to remove them
df # the pandas dataframe df.Poem # the Poem column in the dataframe df.Poem.str.lstrip('\r\r\n') # removes \r\r\n fromt the beginning of the string for each row in the Poem column df.Poem.str.rstrip('\r\r\n') # removes \r\r\n fromt the end of the string for each row in the Poem column
2. Removing duplciates
Some poems are repeated several times and in order to remove these duplicates, we need to call:
df = df.drop_duplicates()
3. Split a column into multiple columns
The Poem column contains the whole poem with all lines included. In order to automatically generate a new poem which contains many carefully chosen lines from multiple poems - we need to separate the value of each Poem into Lines. The character sequence \r\r\n is also used as a new line delimiter - e.g. as a separator between lines.
The goal here then is to split each Poem in its constituent lines and keep all other associated information to the Poem, associated to its lines as well.
The first link I listed in the section below called Reading is the one that helped me to this. It explains the approach with a simple example.
a. Create a dataframe in the following way:
b. Split all the values in column var1 at , and create a separate row for each string:
s = a.var1.str.split(",").apply(pd.Series, 1).stack()
c. remove the extra index column which is created by stack()
s.index = s.index.droplevel(-1)
Note: Keeping the original index is very important here. As you can see the index value for a, b and c is 0 as they all originate from the initial row with index 0. As we want the values of var2 to be associated in the original way to the values of a, b and c, it is important that all the values a, b and c have index 0 so that pandas join can attach to them the same row values for the other columns as if they were still in one row - a,b,c
d. give a name of the only column in the series s
s.name = 'var 1 separated'
e. add the column s to the dataframe and verify results
I kept the var1 column in the dataframe for clarity.
Putting it altogether, this is the code to separate the poems into lines.
lines = df.Poem.str.split("\r\r\n").apply(pd.Series, 1).stack() lines.index = lines.index.droplevel(-1) lines.name = 'Lines' df = df.join(lines)
This is how the dataset looks like with the lines printed into new rows:
4. Remove non-ascii characters
I discovered that some values in Poem and some values in Title contains non-ascii characters such as \ufeff. To remove them from each value in these two columns, the method remove_non_ascii() is applied to the columns
def remove_non_ascii(text): # removes all non-ascii characters return ''.join(i for i in text if ord(i)<128) df.Poem = df.Poem.apply(remove_non_ascii) df.Title = df.Title.apply(remove_non_ascii)
I find these articles very useful:
- Split strings in a column at comma and separate into multiple columns [the answer which starts with Similar question as: pandas: How do I split text in a column into multiple rows?]