A couple of months ago I found a website with extremely rich data, an event which usually makes me very happy. This website didn't have that effect on me. I was trying to figure out the weight of a specific baseball player, and stumbled upon a database of detailed celebrity body measurements (all women, of course), found here. Later I found that data included political candidates, and it raised a question in my mind about the different ways we talk about men and women in politics.
Simultaneously, I was looking for a way to measure the presence of certain ideas across the internet. I can already measure sentiments and topics on twitter, but Twitter is only a portion of the internet, and most people access the internet through Google search when seeking out new information. Could I write code that would start my text mining operations through Google Search?
(NON-Nerds Skip this)
I had a social idea (how we talk about candidates based on gender) and a coding/statistical concept to test: to mine google search results. I went forward with a formalized test plan:
- I would use the google search API to pull results for "Candidate's Name" + Body Measurements.
- I would capture the data and turn it into mine-able text.
- I would compare the results of top words, and generally compare them. (note: rate limits on the Google API as well as some Google restrictions slow me down, in the future I may apply more sophisticated text mining techniques).
I wrote some code pull the Google Search results, the google API only allows us to pull 4 results at a time, so I wrote a loop to pull four at a time. Here's what that looks like (building step by step for ease of understanding):
So what are the results of googling Candidate Names + Body Measurements? I googled four candidates, two men, two women. My observations:
- Men: The men's results were generally about the campaign, with each returning a few references to BMI (Body Mass Index).
- Women: The women's results were heavily focused on the size of their bodies. In fact, the top four words for each women were the same: size, weight, height, and bra.
This table shows the top 10 words returned for each candidate. This is obviously on a small sample size (four candidates, only top 44 google results for each) but is interesting nonetheless.
Some final takeaways from this analysis:
- It's definitely possible to use text mine google results in order to find prevalence on the internet. I probably need to refine my methodology in the future, and obviously implement more sophisticated techniques, but the basic scraping method is complete.
- There exists relatively little information on the internet regarding the body measurements of male candidates. And I really wanted to know Ben Carson's waist to hip ratio!
- Female candidates are talked about online a lot more in terms of their body. I'm not an expert in feminist discourse analysis, or even really qualified to give an opinion here, but I have certainly measured a difference in the way candidates are talked about online.