Girl Talk

Using Natural Language Processing to Decode Song Lyrics by Women

I developed "Girl Talk" as my final project for the Data Analysis bootcamp at Ironhack Lisbon, during a dystopian week of social isolation in March 2020. I worked in Python and all the code is available on GitHub. Feedback is very welcome!

Some major inspiration for this work came from the following articles:


What about music made by women?

It's hardly a coincidence that none of the articles above is focused on women. While the ones by the Pudding focus on hip-hop as a whole, it is noteworthy that women remain a smaller slice of the rappers' pie. It's not a rap-only problem, though. As the BBC points out, the gender gap in the music industry has been widening this last decade in a multitude of ways - from accolades to charts, from collaborations to festival bookings.

For this project I wanted to pay a tribute to music made by women. What better way to do it other than listening to their work and talent?

Quick note: I tried to make the text as simple and clear as possible while still providing some technical details. Regardless, things might get a bit geeky.

Getting the data and cleaning the data

For this project, I looked at lyrics by 236 female-only and female-led musical acts. The list encompasses mostly solo artists, some girls bands and a few bands whose lead singers are women.

I used Genius's API and LyricsGenius Python client to gather the lyrics. In an effort to reduce duplicates, I excluded edits, remixes, demos, bootlegs and live versions filtering them out of the data retrieved through the API. Since some of the songs were not properly tagged, I tried to minimize repeated lyrics with some manual labour (using string operations).

My initial list of artists also included Céline Dion, Gloria Estefan and Shakira. After a few trials, I decided it was better to remove them, because the model was identifying French and Spanish lyrics as separate topics. This was good in the sense that was a proof that the model was recognizing text patterns, but bad for my goal. Since language tags were not available, there was no easy way to drop non-English lyrics and so I decided to leave their records out of the sample. In an effort to avoid having many songs in a language other than English, I also dropped three Spanish-sung albums: "Como Ama Una Mujer" and "Por Primera Vez" by Jennifer Lopez, and "Mi Reflejo" by Christina Aguilera. Despite my efforts, it is obvious that some songs in other languages sneaked through.

Finally, I got rid of outliers. I dropped the 2% lyrics with the smallest word count as these were mostly interludes, and the 1% with the largest word count, as those were actually not songs (book excerpts, press conferences and even Beyoncé's "Lemonade" film script). In the end, I analyzed 29772 songs with natural language processing.

How many words does a song have?

The furthest right on the plot, the more words an artist uses on average per song. English spoken word performer Kate Tempest leads the way, but the podium is only complete with two rappers: Megan Thee Stalion and Little Simz. Rap and spoken word are anchored on the power of words and lyricism, so it comes as no surprise that most of the top positions in this category are taken by rappers: Missy Elliott, Ciara and Iggy Azalea are all on the top 10.

Around the 2000 words per song mark, we also find multiple girls bands: Spice Girls, Destiny's Child and The Pussycat Dolls. The fact that their songs had to display the vocal talents of their various members (even if Beyoncé and Nicole Scherzinger had more proeminent features) might be an explanation for this.

On the bottom end there are some indie, rock, jazz and electronic acts. Pianists like Diana Krall, Dinah Washington and Norah Jones; indie rock sensations such as Anna Calvi, Cat Power and Sharon Van Etten; electronic legends like Björk, Goldfrapp and Portishead all fall below 800 words per song. Jazz pioneer Billie Holiday sings the shortest songs - probably due to the fact that her career happened before the record industry boom of the 60s.

Does this mean that pop artists have longer lyrics? Not quite. With their catchy choruses, pop songs rely heavily on repetition, which means that the word count can be skewed in that way.

Top 10 Artists Word Count

(average, descending)

Artist Word Count per Song
Kate Tempest2393.33
Megan Thee Stallion2303.36
Little Simz2225.39
Missy Elliott2052.59
Spice Girls2046
Ciara2044.75
Iggy Azalea2041.66
Destiny’s Child2028.39
Jennifer Lopez2017.89
The Pussycat Dolls2002.25

Bottom 10 Artists Word Count

(average, ascending)

Artist Word Count per Song
Billie Holiday571.68
Crystal Castles610.88
Portishead630.52
Anna Calvi667.02
Sarah Vaughan671.95
Dinah Washington681.87
Nadine Shah690.78
Goldfrapp694.44
Diana Krall706.35
PJ Harvey715.87

The widest dictionary

My first approach to assess the diversity in vocabulary was to do a unique word count by artist. This happened after removing a list of stop words. These words convey meaning in the human world, but are not useful in a computer-processing context. Most pronouns, prepositions and conjunctions fall in this category.

After plotting, it seemed that, at least in some cases, bigger discographies translated into a richer lexicon. For instance, Joan Baez appears just slightly ahead of Nicki Minaj, leading the score, but she has almost twice as many songs. Dolly Parton and Barbra Streisand also appear fairly isolated in the following positions, but they are also the artists with more songs under scrutiny (643 and 553 respectively). On the bottom positions we see Naomi Scott, who is yet to release an album, and a few others who have only published an LP so far, such as Kacy Hill, Kelsey Lu and Sudan Archives.

You can explore these results on the plot below. The furthest to the right, the biggest the word count. The widest the circle, the biggest the song count.

Top 10 Artists Unique Word Count

(absolute values, descending)

Artist Unique Words
Joan Baez7574
Nicki Minaj7567
Barbra Streisand6274
Dolly Parton5963
Joni Mitchell5662
Ani DiFranco5529
Tori Amos5393
Azealia Banks5210
Lana Del Rey5121
Taylor Swift4922

Bottom 10 Artists Unique Word Count

(absolute values, ascending)

Artist Unique Words
Naomi Scott568
Kacy Hill592
Kelsey Lu601
Sudan Archives679
Alabama Shakes688
London Grammar712
Shura737
Laura Mvula781
Anna Wise796
Kadhja Bonet800

In order to provide a more balanced view of the actual range of each act's dictionary, I did a simple ratio of unique words per number of songs. Surprise, surprise! Not only the leading positions changed, but most of the top 10 and bottom 10 are completely different.

The ratio below accounts for the number of unique words per song analyzed. We see Kate Tempest once again on the top spot with a ratio of 78 unique words per song. With the exception of multi-instrumentalist Joanna Newsom and R&B sisters VanJess, most of the names on the top 10 are rappers - that's 7 out of 10. Notoriously, Azealia Banks is the only artist who sits confortably on the top 10 twice, both in absolute unique words and in ratio of unique words per song. Another interesting point is Dolly Parton moving from the top 10 to the bottom 10, when we replace the absolute values with this ratio.

You can explore these results on the plot below. The furthest to the right, the highest the ratio of unique words per song. The widest (and closer to a yellow shade) the dot is, the biggest the unique word count.

Top 10 Artists Unique Words per Song

(ratio, descending)

Artist Word/Song Ratio
Kate Tempest 78.48
Noname 67.19
Ivy Sole 65.59
Joanna Newsom 57.27
Lizzo 51.35
Sampa The Great 48.43
Lauryn Hill 45.30
Little Simz 45.06
VanJess 44.95
Azealia Banks 44.53

Bottom 10 Artists Unique Words per Song

(ratio, ascending)

Artist Word/Song Ratio
Aretha Franklin 8.33
Billie Holiday 8.35
The Supremes 9.19
Diana Ross 9.15
Dolly Parton 9.27
Kylie Minogue 9.96
Sarah Vaughan 10.33
Dionne Warwick 10.60
Janet Jackson 10.93
Etta James 11.11

So, what are women singing about?

The technical explanation

For topic modelling, I used Latent Dirichlet Allocation (LDA), which is a machine learning model for classifying documents. In very simplistic terms, the machine will use the Dirichlet distribution to identify patterns in the text like sequences of terms. The NLP part was done using NLTK and Gensim libraries.

After pre-processing the text with adequate tokenization, stemming and lemmatization, I used a randomized sample of 80% of my data set (passed twenty times) to train my model. To test the accuracy of the model, I used the c_v Measure coherence score. At its latest stage, the average score of my model was 45-48%. During the tests, the lowest score was 43% and the highest was 54%. The chart below displays an accuracy of 46,5%, which is far from good and may indicate that I need a bigger data set.

The topics

While the main words for each topic (and their weight) vary slightly depending on the sample, the topics were mostly the same throughout the trials: breakups, female empowerment, posse tracks (judging by the slang, I think it's safe to assume those are hip-hop tracks), parties, christmas and seduction themes seem to always appear. The most distant topic on the the distance map always corresponds to words in Spanish - the sneaky songs I didn't manage to remove.


Future improvements (or things I'd like to do)

This was literally an education project. From improving the coherence score of my model, to having a bigger data set, there's a lot of room for improvement here. For starters, the data should be manually verified. While the Genius community of contributors, editors and moderators do an incredible work keeping things tidy, there was a lot of both missing and duplicate information. For instance, I would like to have done a time series to check if there were more unique words over time, but date was only available for less than half of the full data set. Further categorization like music genres and nationalities could also be interesting to see how language has evolved in different music scenes.

Thank you for reading!

Feel free to share it and press the icons below to find me in other places on the web.