Thursday, May 5, 2016

Social Media Data: Identifying Tweet Schedulers

I've been working with social media data recently, which always makes me a bit more astute to my interactions on Twitter and Facebook. I saw a post from an acquaintance that was interesting, as the person was talking about their current emotional state though the post was obviously scheduled in advance.

The ability to schedule posts is quite useful for organizations trying to make multiple contacts with users, or hitting users in off-hour time periods.  To those of us that work with social media data, it's fairly easy to identify scheduled posts on face.  If an algorithm could be developed to identify likely users of scheduled posts, we could remove them from algorithmic analysis, which is a good thing for a few reasons:
  • Scheduled posts are more likely to be spam.
  • Scheduled posts are more likely to be out-of-time (not posted in direct relation to a current event).
  • Scheduled posts are more likely to be emotional fraud (people posting about how great they are.. scheduling their own emotions.. which is fairly difficult to do).
In my career there's a fairly big payoff for identifying tweet schedulers and spammers, but other people seem interested in this too (wanting to know when their friends are just generating content, rather than being sincere on social media).  I'll look at both a simple (every-day user) and data-derived-algorithmic way to identify scheduled posts.


In order to predict or identify a process or event, we need an understanding of said process, and how it differs from "ordinary" events.  Here's what scheduled social media processes look like:
  1. You use a scheduling program (TweetDeck, HootSuite, Buffer).
  2. You pick a time that you want to post.
  3. You create content you want to post.
Using this knowledge of the process we can design a method to determine if someone is using a tweet scheduler. I'll use my prior Twitter data mining of the #ksleg hashtag.  Here's R my code to pull:

 #search setting  
 searchterm <- "#ksleg"  
 numtweets <- 10000000  
 since_date <- "2016-02-10"  
 min_freq_terms <- 100  
 topic_num <- 4  
 #auth in  
                          access_token=NULL ,  
 #pull from twitter  
 tweets <- searchTwitter(searchterm,  
 #pull out retweets                                
 tweets <-strip_retweets(tweets,   
                          strip_manual = TRUE,   
                          strip_mt = TRUE)  
 #JSON to Data Frame                           
 tweets <- twListToDF(tweets)  
 #create a clean object  
 mydata <-tweets  
 #create a field of direct URL  
 mydata$tl <- paste(as.character(""),  

This creates a nice data frame, which I exported to excel for demonstrative purposes.  We'll start our analysis from that.


Non-Technical Explanation:  The easiest to identify element of a scheduled tweet is it's origin, because scheduled social media posts usually come from some external program.  If you just want to know if your friends are using a scheduler, you can often identify it from the post, for instance in Facebook an icon will appear next to the post like this:

Technical Explanation: Using our downloaded data, identifying tweet scheduling programs are just as easy with the *statusSource* field form our data frame above.  This field shows the program or origination for each tweet, whether it came through the web interface, Buffer, or Hootsuite. Here's a screenshot that shows that column, with yellow indicating common scheduler programs.

Identifying tweet scheduling programs gives us some good candidates for tweet schedulers, but you can also directly tweet from Buffer or Hootsuite so we need to look at other factors.


Identifying the time of social media post is fairly easy, both online and through our tweet metadata data set.  Here's what each looks like in each:

But how do those help us identify tweet schedulers?

Non-Technical Explanation:  Look for your friends that tend to post everything at the same "minute" mark.  Most notable are those that always post on the hour, half hour, or ten-minute increment.  But also the people who schedule on the 5's or other seemingly pre-set timing (I have a friend that tends to schedule posts on the 3's).

Technical Explanation: Digital Analysis.  A simple phenomenon in which psychological/chosen distributions don't match random, natural, or stochastic distributions.  To understand this, you can read about Benford's distribution and its application to forensic accounting.  For an algorithmic implementation of this analysis, you can run each user's "minute" distribution against the entire distribution using a Chi-Squared Goodness of Fit test.  Those that don't fit a flat or global distribution are more likely to be Tweet schedulers.


The content of the posts often vary significantly for post schedulers too.  In simple terms, the decision to schedule your posts is a very deliberate content-creation choice. A few simple rules for the content created by this process:
  • The posts are less likely to be a reply to another user's post.
  • The posts are MORE likely to contain a link or other pre-planned content.
  • The posts are tonally more likely to be a call to action in someway.
Non-Technical Explanation: Your friend's posts seem like a pre-planned call to action, aren't responses, and tend to contain links or marketing material.

Technical Explanation:  For items 1 and 2 (replies and links) you can fairly easily create factors in your data set that identify these types of posts (and either remove them from consideration, or model association probabilities).  For the call to action we can use sentiment and topic modeling to attempt to identify posts that resemble "asks" or "calls to action."


As I mentioned before, there is a business advantage to identifying scheduled posts, especially if we can create an algorithm and automate it.  Then we can use this algorithm to identify fraudulent and spam users, and move them out of analytical processes.

Starting on that algorithm (using our data from above), and implementing the three areas above I coded a solution that does the following things:
  • Removes tweets that came from the Twitter web application (these are unlikely to be scheduled).
  • Remove posts that are replies to other users.
  • Calculate a "goodness of fit" test at the user level, comparing last-digit date-time frequencies minutes) to a flat-line prior.
  • Calculate % of posts that are content creation (largely links and graphics).
  • Calculate user % of time posting from services versus the web application.
  • Topic modeling of tweets, to find "call to action" type tweets.
  • A classifier based on marrying the above factors together in a robust way that can classify tweets by their overall likelihood to be scheduled.
I'm still working on refining the classifier, and hope to bring in some more sophisticated methods eventually, but the initial results are good.  Here are some tweets from the top two *likely to be scheduler* accounts (click to view):


Scheduled social media posts are a fact of life now, and they are useful for organizations trying to optimize messaging. They also are useful for spammers and others looking to use social media for manipulative purposes.  Either way, there is a business and analytical advantage in being able to discriminate between organic and scheduled posts, and my initial attempts at an algorithm provide a road-map.


  1. Interesting read. Have you heard about mass planner? It can schedule posts to all social media networks. I wonder that topic is on your reach

  2. I have read your blog its very attractive and impressive. I like it your blog.

    Social Media Marketing Agency Social Media Marketing Services

  3. I thought I would leave you with some final advice from things I have learnt from my own experiences being a social media manager.

  4. valuable post! I really like and appreciate your work, thank you for sharing such a useful information about forensics accounting strategies, keep updating the information, hear i prefer some more information about jobs for your career hr jobs in hyderabad .