Girl Talk
Using Natural Language Processing to Decode Song Lyrics by Women
I developed "Girl Talk" as my final project for the Data Analysis bootcamp at Ironhack Lisbon, during a dystopian week of social isolation in March 2020. I worked in Python and all the code is available on GitHub. Feedback is very welcome!
Some major inspiration for this work came from the following articles:
- The Pudding's feature "The Largest Vocabulary In Hip-Hop" - an exploration of the lyrical richness of a wide variety of rappers
- Another Pudding's article: "The Words That Are Most Hip-Hop" - a deep dive into the language of that music genre (detailed by rapper)
- This one by Brandon Punturo, who used natural language processing to understand Drake's lyrics
What about music made by women?
It's hardly a coincidence that none of the articles above is focused on women. While the ones by the Pudding focus on hip-hop as a whole, it is noteworthy that women remain a smaller slice of the rappers' pie. It's not a rap-only problem, though. As the BBC points out, the gender gap in the music industry has been widening this last decade in a multitude of ways - from accolades to charts, from collaborations to festival bookings.
For this project I wanted to pay a tribute to music made by women. What better way to do it other than listening to their work and talent?
Quick note: I tried to make the text as simple and clear as possible while still providing some technical details. Regardless, things might get a bit geeky.
Getting the data and cleaning the data
For this project, I looked at lyrics by 236 female-only and female-led musical acts. The list encompasses mostly solo artists, some girls bands and a few bands whose lead singers are women.
I used Genius's API and LyricsGenius Python client to gather the lyrics. In an effort to reduce duplicates, I excluded edits, remixes, demos, bootlegs and live versions filtering them out of the data retrieved through the API. Since some of the songs were not properly tagged, I tried to minimize repeated lyrics with some manual labour (using string operations).
My initial list of artists also included Céline Dion, Gloria Estefan and Shakira. After a few trials, I decided it was better to remove them, because the model was identifying French and Spanish lyrics as separate topics. This was good in the sense that was a proof that the model was recognizing text patterns, but bad for my goal. Since language tags were not available, there was no easy way to drop non-English lyrics and so I decided to leave their records out of the sample. In an effort to avoid having many songs in a language other than English, I also dropped three Spanish-sung albums: "Como Ama Una Mujer" and "Por Primera Vez" by Jennifer Lopez, and "Mi Reflejo" by Christina Aguilera. Despite my efforts, it is obvious that some songs in other languages sneaked through.
Finally, I got rid of outliers. I dropped the 2% lyrics with the smallest word count as these were mostly interludes, and the 1% with the largest word count, as those were actually not songs (book excerpts, press conferences and even Beyoncé's "Lemonade" film script). In the end, I analyzed 29772 songs with natural language processing.
How many words does a song have?
The furthest right on the plot, the more words an artist uses on average per song. English spoken word performer Kate Tempest leads the way, but the podium is only complete with two rappers: Megan Thee Stalion and Little Simz. Rap and spoken word are anchored on the power of words and lyricism, so it comes as no surprise that most of the top positions in this category are taken by rappers: Missy Elliott, Ciara and Iggy Azalea are all on the top 10.
Around the 2000 words per song mark, we also find multiple girls bands: Spice Girls, Destiny's Child and The Pussycat Dolls. The fact that their songs had to display the vocal talents of their various members (even if Beyoncé and Nicole Scherzinger had more proeminent features) might be an explanation for this.
On the bottom end there are some indie, rock, jazz and electronic acts. Pianists like Diana Krall, Dinah Washington and Norah Jones; indie rock sensations such as Anna Calvi, Cat Power and Sharon Van Etten; electronic legends like Björk, Goldfrapp and Portishead all fall below 800 words per song. Jazz pioneer Billie Holiday sings the shortest songs - probably due to the fact that her career happened before the record industry boom of the 60s.
Does this mean that pop artists have longer lyrics? Not quite. With their catchy choruses, pop songs rely heavily on repetition, which means that the word count can be skewed in that way.
Top 10 Artists Word Count
(average, descending)
Artist | Word Count per Song |
---|---|
Kate Tempest | 2393.33 |
Megan Thee Stallion | 2303.36 |
Little Simz | 2225.39 |
Missy Elliott | 2052.59 |
Spice Girls | 2046 |
Ciara | 2044.75 |
Iggy Azalea | 2041.66 |
Destiny’s Child | 2028.39 |
Jennifer Lopez | 2017.89 |
The Pussycat Dolls | 2002.25 |
Bottom 10 Artists Word Count
(average, ascending)
Artist | Word Count per Song |
---|---|
Billie Holiday | 571.68 |
Crystal Castles | 610.88 |
Portishead | 630.52 |
Anna Calvi | 667.02 |
Sarah Vaughan | 671.95 |
Dinah Washington | 681.87 |
Nadine Shah | 690.78 |
Goldfrapp | 694.44 |
Diana Krall | 706.35 |
PJ Harvey | 715.87 |
The widest dictionary
My first approach to assess the diversity in vocabulary was to do a unique word count by artist. This happened after removing a list of stop words. These words convey meaning in the human world, but are not useful in a computer-processing context. Most pronouns, prepositions and conjunctions fall in this category.
After plotting, it seemed that, at least in some cases, bigger discographies translated into a richer lexicon. For instance, Joan Baez appears just slightly ahead of Nicki Minaj, leading the score, but she has almost twice as many songs. Dolly Parton and Barbra Streisand also appear fairly isolated in the following positions, but they are also the artists with more songs under scrutiny (643 and 553 respectively). On the bottom positions we see Naomi Scott, who is yet to release an album, and a few others who have only published an LP so far, such as Kacy Hill, Kelsey Lu and Sudan Archives.
You can explore these results on the plot below. The furthest to the right, the biggest the word count. The widest the circle, the biggest the song count.
Top 10 Artists Unique Word Count
(absolute values, descending)
Artist | Unique Words |
---|---|
Joan Baez | 7574 |
Nicki Minaj | 7567 |
Barbra Streisand | 6274 |
Dolly Parton | 5963 |
Joni Mitchell | 5662 |
Ani DiFranco | 5529 |
Tori Amos | 5393 |
Azealia Banks | 5210 |
Lana Del Rey | 5121 |
Taylor Swift | 4922 |
Bottom 10 Artists Unique Word Count
(absolute values, ascending)
Artist | Unique Words |
---|---|
Naomi Scott | 568 |
Kacy Hill | 592 |
Kelsey Lu | 601 |
Sudan Archives | 679 |
Alabama Shakes | 688 |
London Grammar | 712 |
Shura | 737 |
Laura Mvula | 781 |
Anna Wise | 796 |
Kadhja Bonet | 800 |
In order to provide a more balanced view of the actual range of each act's dictionary, I did a simple ratio of unique words per number of songs. Surprise, surprise! Not only the leading positions changed, but most of the top 10 and bottom 10 are completely different.
The ratio below accounts for the number of unique words per song analyzed. We see Kate Tempest once again on the top spot with a ratio of 78 unique words per song. With the exception of multi-instrumentalist Joanna Newsom and R&B sisters VanJess, most of the names on the top 10 are rappers - that's 7 out of 10. Notoriously, Azealia Banks is the only artist who sits confortably on the top 10 twice, both in absolute unique words and in ratio of unique words per song. Another interesting point is Dolly Parton moving from the top 10 to the bottom 10, when we replace the absolute values with this ratio.
You can explore these results on the plot below. The furthest to the right, the highest the ratio of unique words per song. The widest (and closer to a yellow shade) the dot is, the biggest the unique word count.
Top 10 Artists Unique Words per Song
(ratio, descending)
Artist | Word/Song Ratio |
---|---|
Kate Tempest | 78.48 |
Noname | 67.19 |
Ivy Sole | 65.59 |
Joanna Newsom | 57.27 |
Lizzo | 51.35 |
Sampa The Great | 48.43 |
Lauryn Hill | 45.30 |
Little Simz | 45.06 |
VanJess | 44.95 |
Azealia Banks | 44.53 |
Bottom 10 Artists Unique Words per Song
(ratio, ascending)
Artist | Word/Song Ratio |
---|---|
Aretha Franklin | 8.33 |
Billie Holiday | 8.35 |
The Supremes | 9.19 |
Diana Ross | 9.15 |
Dolly Parton | 9.27 |
Kylie Minogue | 9.96 |
Sarah Vaughan | 10.33 |
Dionne Warwick | 10.60 |
Janet Jackson | 10.93 |
Etta James | 11.11 |
So, what are women singing about?
The technical explanation
For topic modelling, I used Latent Dirichlet Allocation (LDA), which is a machine learning model for classifying documents. In very simplistic terms, the machine will use the Dirichlet distribution to identify patterns in the text like sequences of terms. The NLP part was done using NLTK and Gensim libraries.
After pre-processing the text with adequate tokenization, stemming and lemmatization, I used a randomized sample of 80% of my data set (passed twenty times) to train my model. To test the accuracy of the model, I used the c_v Measure coherence score. At its latest stage, the average score of my model was 45-48%. During the tests, the lowest score was 43% and the highest was 54%. The chart below displays an accuracy of 46,5%, which is far from good and may indicate that I need a bigger data set.
The topics
While the main words for each topic (and their weight) vary slightly depending on the sample, the topics were mostly the same throughout the trials: breakups, female empowerment, posse tracks (judging by the slang, I think it's safe to assume those are hip-hop tracks), parties, christmas and seduction themes seem to always appear. The most distant topic on the the distance map always corresponds to words in Spanish - the sneaky songs I didn't manage to remove.
Future improvements (or things I'd like to do)
This was literally an education project. From improving the coherence score of my model, to having a bigger data set, there's a lot of room for improvement here. For starters, the data should be manually verified. While the Genius community of contributors, editors and moderators do an incredible work keeping things tidy, there was a lot of both missing and duplicate information. For instance, I would like to have done a time series to check if there were more unique words over time, but date was only available for less than half of the full data set. Further categorization like music genres and nationalities could also be interesting to see how language has evolved in different music scenes.