For TidyTuesday 2020 week 40 I created code to randomly pair Beyoncé and Taylor Swift rhyming lyrics.
If you want to make your own BeyonTay verse, you need the following things:
- be willing to code in R
If the above criteria are satisfied, head over to the GitHub repo. Feel free to reach out if you have any issues or questions.
In the rest of this blog post I’ll talk about how I went about getting to this output, with some musings on the way. Each section will end with a BeyonTay verse to break up the text.
The raw data needed some playing with to become make the different sets of Beyoncé and Taylor Swift lyrics comparable. I used the tidyverse flavour of R for this (code here), in keeping with TidyTuesday, but I won’t say much more than that as it’s not that interesting.
I was a bit slapdash with my data prep, but I think it’s important to note that that doesn’t matter because this is a silly for-fun project. The consequences of getting this wrong are… nothing? Maybe mild embarrassment at worst. The point being, it is OK to be a bit fast and carefree if the project is inconsequential, but our code can, and often does, have consequences so it is important to really think about the potential harm down the line once your code is deployed. Tom Scott explains this point far more eloquently.
I had this thought ‘how can I combine the data sets?’, which led to the thought ‘could I make a Beyoncé / Taylor Swift lyric set?’ which led to ‘could I find rhymes to make a BeyonTay verse?’
Not documented are the many other failed ideas I had. One example was seeing ‘who uses words which didn’t exist in the formal English language?’ but it didn’t really work. There’s always a lot of the iceberg you don’t see.
How can you tell if two words rhyme?
My first thought was to try matching the end of words, but then the english language isn’t that straight-forward - we would miss that “time” and “rhyme” don’t rhyme, and incorrectly assert that “food” and “good” do rhyme. As an aside, I wonder if the reason the Co-Op dropped their slogan “good with food” is because it reads like it should rhyme, but doesn’t.
Any other ideas?
Aha! The international phonetic alphabet (IPA)! It tells you how words sound, which is what we’re after really. It turns out if you want a good API, so you can send an English word and it will return the IPA, you have to spend money. It’s one of the services the big Dictionary companies provide now. Ah well. Good for them, I guess.
As an alternative to an IPA API, I found this site phonetizer which has a user interface for inputting english and outputting IPA.
I did some real code bodging at this step (for a bodging explanation see this Tom Scott video - yes I am aware I have referenced him twice now).
My Bodge Solution:
Strip words of all non-alphanumeric characters except for apostrophes. Through trial and error with phonetizer on a small subset of words this looked like the best bet.
select only the unique words at the end of a lyric line. This gave me the 3,635 words I wanted to find IPA for, which is much more manageable than the 29,874 individual lines or the 211,430 words. Literally orders of magnitude easier.
print as an output in my R console with ‘::’ as a word separator. This is because phonetizer removed characters, like the comma, but :: translated to () which was perfect as a word separator.
copy the entire text from R, put in phonetizer, then copy the IPA output into Libre Office Writer, then save as a text file with UTF-8 encoding, then read that output back into R. Not the prettiest of processes. I came to this because, as I’ve learnt many times, encoding can be a real pain. Special characters are fiddly, and IPA is almost by definition full of them. My initial bodge was to use the datapasta R package, but it couldn’t handle the IPA characters.
If it has the last 3 IPA characters the same, it’s a rhyme. From trial and error this seems most sensible. I chose to exclude homophones, words that sound identical, because as far as I am concerned “air” doesn’t rhyme with “heir” (nor, for that matter, does “air” rhyme with “air”).
Visualising the answers
I decided that rhyming couplets were the way forward, and that it should always be a match across artist. The idea of a Beyontay verse came quite naturally as a pair of rhyming couplets, so I used ggplot to visualise such a thing. I also limited the length of each line to be at most 40 characters, to stop them running of the right hand side of the image.
Issues with the results
There are a few problems with my approach.
Firstly, the last 3 letters of IPA aren’t always sufficient. For example, “don’t” and “amount” are given as valid rhymes via this method because their IPAs are “dəʊnt” and “əˈmaʊnt”. Damn. I can’t see a sensible way round this problem (as you probably have worked out by now, I ain’t no linguist). Another example is in the above image; “running” and “drowning”.
Secondly, a vast number of words are totally ignored. For example, the two most common line ending words are “you” and “me”, and they account for 7.6% of all the line enders (2260 out of 29571). Now, for both of them there is no rhyme pair in the data set. Why? Because their entire word is taken as the match for the rhyme, as they are short words with under 4 IPA characters making them up. So “me” and “be” don’t rhyme with my definition of last 3 IPA character matching. Dangit.
This is stupid, but fun.
My favourite rhyming couplet to come out of it was this:
And you don’t know what you don’t know
I threw myself into a volcano
Pure poetry from BeyonTay.