Friday, July 1, 2016

Search Origin and "Hidden Identities"

I've been very busy lately, so sorry to my normal readers for the lack of posts.  Here's a post I have had in the works for a while, a bit of a "meta" post on what types of people find this blog from different sources.


I've been running this blog for nearly a year and a half now, and in the last few months it has seen quite an improvement in organic search traffic (ie, people finding this blog, via search engines). Somewhat ironically I don't do a lot of deep-dive data science type work into the blog stats because, well honestly the numbers are small and I have access too much more interesting web analytics data.

Over the past few weeks, however, I noticed something interesting in blog statistics: the blog gets quite a bit of traffic from a relatively obscure (at least low-use) search engine: DuckDuckGo.  I first heard in depth of DuckDuckGo from Bruce Schneier in his book Data and Goliath, a book that takes fairly extreme views towards cyber security.  I was surprised to see the volume of hits to datascience from this relatively obscure search engine.  I wondered what was going on.

What is DuckDuckGo?  Effectively, DuckDuckGo is a search engine that doesn't track users, and offers a theoretically more secure view of the internet.  It also gives "user agnostic" search results, which is a topic for a more in-depth post on another day.

(Side note, Data and Goliath is worth a read. I deal with Big Data in my everyday working life and don't live in the paranoia of Schneier, but I appreciate his point of view and perspective.)


The first thing I noted when analyzing Google Analytics data, was that the Google search traffic and DuckDuckGo search traffic tended to land on different resources on the blog.  So I dug into Google analytics data regarding traffic sources and landing pages, here's a summary of top landing pages by search engine:

On the Google side, we see that most people are directed to my homepage (which is a good thing, btw), followed by two posts on specific issues related to the R statistical engine, a Bernie Sanders post, a Voter Fraud post and a few more general data science posts.  Generally speaking, traditional search users find this blog for it's intent: data science, with a couple of "pop-science" posts mixed in.

The DuckDuckGo results, on the other hand exclusively go to Bernie Sanders and Election Fraud related posts.  The Election Fraud and Bernie Sanders posts do get quite a bit of traffic overall on this blog, but comparatively little from google and traditional search channels.


There are two main reasons this difference in search engine redirects could occur:

  1. The type of people who use DuckDuckGo could be more interested in Bernie Sanders and Election fraud. There's actually quite a bit of face validity to this view, based on both election fraud truthers and Bernie Sanders voters believe that the system is rigged, and are somewhat paranoid of systems (voting; economic) that actively monitor, control and put-down "defectors."
  2. The search engine DuckDuckGo is better optimized towards my Bernie Sanders and Election Fraud posts than Google.  This one is difficult to falsify because I don't have a list of the search terms used to find this blog in DuckDuckGo.  A quick test of both websites (from a clean browser) gives similar results, and it is known that DuckDuckGo relies on other search engines for results, so it seems somewhat unlikely that optimization is creating a large variation in search results.  However if DuckDuckGo is self-optimizing, and my these posts create more clicks within a more-paranoid user group, it's possible that optimization is still in play.
It's possible that a combination of both factors are at play, but on face it's more likely that the users of DuckDuckGo are more interested in protecting their search and browsing patterns from intrusion of the government.  That's interesting, but maybe not all that surprising.

What's more interesting to me, is that these users are interested in protecting their identities while *searching* the internet, but not while browsing the internet. What does that mean?  Essentially this: while election fraud/Bernie Sanders users protect their identities by using DuckDuckGo to find this blog, once they reach the website I can generally ascertain a lot about that user by looking at logs and IP related data.

1 comment: