Text Cleaning in Pandas

Featured image

Read more about the backstory of this project and its dream here

tldr; I participated in one of the projects of Open Taggle - to use A.I. to generate a poem from a person’s feelings. The feelings are captured using IoT to read EGG waves. I come in this project as the person to develop the A.I. algorithm to generate the poems.

Poem Data

The data we use is a scrapped list from the Poetry Foundry’s website available for download on Kaggle.

When loaded into a pandas dataframe looks something like this: Raw Poetry Foundry Data

Here is what I cover in this post:

  1. Removing opening and trailing unwanted characters for all rows in a column
  2. Removing duplciates
  3. Split a row value with multiple new lines in it into different rows
  4. Remove non-ascii characters from text in columns

1. Removing opening and trailing unwanted characters for all rows in a column

As you can see each line in the Poem column starts with \r\r\n. Each of these lines finishes with this character sequence too.

Here is a column-wise way to remove them

df # the pandas dataframe
df.Poem # the Poem column in the dataframe

df.Poem.str.lstrip('\r\r\n') # removes \r\r\n fromt the beginning of the string for each row in the Poem column
df.Poem.str.rstrip('\r\r\n') # removes \r\r\n fromt the end of the string for each row in the Poem column

2. Removing duplciates

Some poems are repeated several times and in order to remove these duplicates, we need to call:

df = df.drop_duplicates()

3. Split a column into multiple columns

The Poem column contains the whole poem with all lines included. In order to automatically generate a new poem which contains many carefully chosen lines from multiple poems - we need to separate the value of each Poem into Lines. The character sequence \r\r\n is also used as a new line delimiter - e.g. as a separator between lines.

The goal here then is to split each Poem in its constituent lines and keep all other associated information to the Poem, associated to its lines as well.

The first link I listed in the section below called Reading is the one that helped me to this. It explains the approach with a simple example.

a. Create a dataframe in the following way:

a=pd.DataFrame({"var1":"a,b,c d,e,f".split(),"var2":[1,2]})

Example Dataframe a

b. Split all the values in column var1 at , and create a separate row for each string:

s = a.var1.str.split(",").apply(pd.Series, 1).stack()

The result s

c. remove the extra index column which is created by stack()

s.index = s.index.droplevel(-1)

The result s without the extra index column

Note: Keeping the original index is very important here. As you can see the index value for a, b and c is 0 as they all originate from the initial row with index 0. As we want the values of var2 to be associated in the original way to the values of a, b and c, it is important that all the values a, b and c have index 0 so that pandas join can attach to them the same row values for the other columns as if they were still in one row - a,b,c

d. give a name of the only column in the series s

s.name = 'var 1 separated' 

e. add the column s to the dataframe and verify results

a.join(s)

The rowed-out column added to a

I kept the var1 column in the dataframe for clarity.

Putting it altogether, this is the code to separate the poems into lines.

lines = df.Poem.str.split("\r\r\n").apply(pd.Series, 1).stack()
lines.index = lines.index.droplevel(-1)
lines.name = 'Lines'
df = df.join(lines)

This is how the dataset looks like with the lines printed into new rows:

Raw and Clean Poetry Foundry Data

4. Remove non-ascii characters

I discovered that some values in Poem and some values in Title contains non-ascii characters such as \ufeff. To remove them from each value in these two columns, the method remove_non_ascii() is applied to the columns

def remove_non_ascii(text): # removes all non-ascii characters
    return ''.join(i for i in text if ord(i)<128)

df.Poem = df.Poem.apply(remove_non_ascii)
df.Title = df.Title.apply(remove_non_ascii)

Reading

I find these articles very useful: