What’s Eminem Thinking?

Building a TFIDF-NMF model in Python and performing topic modelling on Eminem’s Lyrics

5 min readJul 20, 2021

Natural Language Processing (NLP) concerns itself with how machines comprehend the spoken word. Real-life applications of NLP concepts range from digital voice-assistants such as Siri and Alexa to the predictive keyboard we now have on our phones. One fascinating combined use of NLP and Machine Learning is that texts can be analysed in their diction. This is achieved through concepts such as Term Frequency-Inverse Document Frequency (TF-IDF) — which outlines how significant a certain word is in a corpus of texts — as well as Non-Negative Matrix Factorization (NMF) — which allows for an unsupervised grouping of these certain words (found from TF-IDF) in the form of topics of said corpus. Thus, the TFIDF-NMF model allows for topics to be modelled based on a certain corpus of texts, for which the user only needs the raw text files. I decided to leverage these concepts and extract topics from rapper Eminem’s lyrics.

Over the course of this project, I’ll be illustrating how I got the data, prepared the data, wrote the topic-modelling code, and further analysed said code’s results:

· Data
∘ Data Used
∘ Gathering Data
· Code
· Analysis
· Evaluation
∘ Future Work

The GitHub repo of this project can be found here.

Data

Data Used

The data used for this project is solely pulled from the Genius website. More information on the same lies in the next subsection.

Gathering Data

As outlined in data.py, I made a Genius account to get access to an API key, which then enabled me to use the lyricsgenius API to pull data. Next, I decided to only analyse the lyrics from Eminem’s 11 studio albums, which was an arbitrary decision but would be useful for future work (see last subsection). I made a list for all of them, a feature I couldn’t find in the lyricsgenius library, and then pulled lyrical data from all individual albums. These saved on my directory in json format. Next, I opened these jsons and saved relevant data onto a dictionary (this too, would be helpful for future work and not this project specifically). I then proceeded to use the dictionary to make a list of all lyrics only. While previously it was required that data fed into the TFIDF vectorizer be normalized, it now does its own preprocessing including tokenization and lemmatization — as far as I understand. Thus, the making and saving of the list (into a pickle file, for use in main.py) was the last step needed in data preparation:

Code

In this section, I’ll be reviewing the code in-depth and explaining how to go about topic-modelling any corpus of lyrics. Do skip ahead if this is boring (how dare you?).

First, the imports:

Next, stop-words are defined. These are basically words such as ‘and’, ‘the’, etc. which do not add much meaning to a sentence. There’s already a list in NLTK’s library (all hail NLTK). After noticing some patterns, I added my own list of stop-words to this list:

*excuse the profanity, I did not speak these words

The model itself is now created. The min_df and max_df parameters outline how rarely and often a word should appear in the corpus in order to not be ignored. For this model, I chose 5 and 0.95 — which means that a word should appear in more than 5 documents (songs) to be considered, and if it appears in more than 95% of the documents it won’t be considered. The stopwords are passed along too, along with ngram range — which means that phrases within a 1–3 word range will be independently analysed. The model is then fitted onto the lyrics data, which I opened using the pickle file containing the lyrics list:

I found this method to make the displaying of the results (topics modelled) easier:

Finally, a NMF model is created and fitted using the TFIDF model, which is further used to produce the results. I chose 7 to be the number of words which account to one topic:

Analysis

During testing, models were built using TFIDF and Bag-Of-Words(BOW) concepts. While TFIDF is preferred when analysing corpuses, it did not take in an argument of the separate corpuses (albums), and thus I’m not sure how well it performed. The BOW was not that relevant either. I used these vectorizers to transform the data into separate TFIDF and BOW data. It was then that I fed these data into both Latent Dirichlet Allocation (LDA, also used for topic modelling) and NMF models. The best result I found was in the TFIDF-NMF model, which I have shown in this article. Following were the topics found:

Topics found in NMF model (tfidf):
Topic #0: let   time   man   back   bitch   stay
Topic #1: shady   slim   eminem   slimshady   kill   name   back
Topic #2: baby   never   crazy   meant   girl   meant   dad
Topic #3: love   wanna   want   n*gga   us   love  dick
Topic #4: album   paul   mean   thing   back   call   now

*excuse the profanity yet again

While the topics seem obscure, I could find some meaning in them. For example, I would assume Topic #1 with Eminem returning to features, Topic #2 with fatherhood, and Topic #3 has to do with love. Further improvements that need to be made to the code is to find ways to make TFIDF more useful, by incorporating the corpus concept — by inputting the album data separately.

Evaluation

I’m not completely happy with the results. The topics aren’t visibly separate from one another, nor are they distinct enough to be categorised. I’m also not sure why NLTK does not give the user an option to prepare his own data. I tried this with many different permutations, but was unable to do so. I do think this impacted the results, and had I been able to prepare the data I would have gotten better topics.

Future Work

I would firstly want to, as stated in previous subsections, not analyse all of Eminem’s works together, but separately in the form of albums. I would then want to know how significant each of these topics is in the albums. Then, I would also want to make a t-SNE plot of all vocabulary the code has analysed. I would lastly want to apply this to other artists and writers, perhaps even authors.