I've been working with social media data recently, which always makes me a bit more astute to my interactions on Twitter and Facebook. I saw a post from an acquaintance that was interesting, as the person was talking about their current emotional state though the post was obviously scheduled in advance.
The ability to schedule posts is quite useful for organizations trying to make multiple contacts with users, or hitting users in off-hour time periods. To those of us that work with social media data, it's fairly easy to identify scheduled posts on face. If an algorithm could be developed to identify likely users of scheduled posts, we could remove them from algorithmic analysis, which is a good thing for a few reasons:
- Scheduled posts are more likely to be spam.
- Scheduled posts are more likely to be out-of-time (not posted in direct relation to a current event).
- Scheduled posts are more likely to be emotional fraud (people posting about how great they are.. scheduling their own emotions.. which is fairly difficult to do).
BACKGROUNDIn order to predict or identify a process or event, we need an understanding of said process, and how it differs from "ordinary" events. Here's what scheduled social media processes look like:
- You use a scheduling program (TweetDeck, HootSuite, Buffer).
- You pick a time that you want to post.
- You create content you want to post.
#search setting searchterm <- "#ksleg" numtweets <- 10000000 since_date <- "2016-02-10" min_freq_terms <- 100 topic_num <- 4 #packages library(httr) library(twitteR) library(tm) library(RTextTools) library(topicmodels) library(wordcloud) #auth in setup_twitter_oauth("", "", access_token=NULL , access_secret=NULL) #pull from twitter tweets <- searchTwitter(searchterm, n=numtweets, lang="en", since=since_date) #pull out retweets tweets <-strip_retweets(tweets, strip_manual = TRUE, strip_mt = TRUE) #JSON to Data Frame tweets <- twListToDF(tweets) #create a clean object mydata <-tweets #create a field of direct URL mydata$tl <- paste(as.character("http://twitter.com/"), as.character(mydata$screenName), as.character("/status/"), as.character(mydata$id),sep="")
This creates a nice data frame, which I exported to excel for demonstrative purposes. We'll start our analysis from that.
SCHEDULING PROGRAMSNon-Technical Explanation: The easiest to identify element of a scheduled tweet is it's origin, because scheduled social media posts usually come from some external program. If you just want to know if your friends are using a scheduler, you can often identify it from the post, for instance in Facebook an icon will appear next to the post like this:
Technical Explanation: Using our downloaded data, identifying tweet scheduling programs are just as easy with the *statusSource* field form our data frame above. This field shows the program or origination for each tweet, whether it came through the web interface, Buffer, or Hootsuite. Here's a screenshot that shows that column, with yellow indicating common scheduler programs.
Identifying tweet scheduling programs gives us some good candidates for tweet schedulers, but you can also directly tweet from Buffer or Hootsuite so we need to look at other factors.
TIME OF SCHEDULEIdentifying the time of social media post is fairly easy, both online and through our tweet metadata data set. Here's what each looks like in each:
But how do those help us identify tweet schedulers?
Non-Technical Explanation: Look for your friends that tend to post everything at the same "minute" mark. Most notable are those that always post on the hour, half hour, or ten-minute increment. But also the people who schedule on the 5's or other seemingly pre-set timing (I have a friend that tends to schedule posts on the 3's).
Technical Explanation: Digital Analysis. A simple phenomenon in which psychological/chosen distributions don't match random, natural, or stochastic distributions. To understand this, you can read about Benford's distribution and its application to forensic accounting. For an algorithmic implementation of this analysis, you can run each user's "minute" distribution against the entire distribution using a Chi-Squared Goodness of Fit test. Those that don't fit a flat or global distribution are more likely to be Tweet schedulers.
CONTENT OF POSTThe content of the posts often vary significantly for post schedulers too. In simple terms, the decision to schedule your posts is a very deliberate content-creation choice. A few simple rules for the content created by this process:
- The posts are less likely to be a reply to another user's post.
- The posts are MORE likely to contain a link or other pre-planned content.
- The posts are tonally more likely to be a call to action in someway.
Technical Explanation: For items 1 and 2 (replies and links) you can fairly easily create factors in your data set that identify these types of posts (and either remove them from consideration, or model association probabilities). For the call to action we can use sentiment and topic modeling to attempt to identify posts that resemble "asks" or "calls to action."
TECHNICAL IMPLEMENTATIONAs I mentioned before, there is a business advantage to identifying scheduled posts, especially if we can create an algorithm and automate it. Then we can use this algorithm to identify fraudulent and spam users, and move them out of analytical processes.
Starting on that algorithm (using our data from above), and implementing the three areas above I coded a solution that does the following things:
- Removes tweets that came from the Twitter web application (these are unlikely to be scheduled).
- Remove posts that are replies to other users.
- Calculate a "goodness of fit" test at the user level, comparing last-digit date-time frequencies minutes) to a flat-line prior.
- Calculate % of posts that are content creation (largely links and graphics).
- Calculate user % of time posting from services versus the web application.
- Topic modeling of tweets, to find "call to action" type tweets.
- A classifier based on marrying the above factors together in a robust way that can classify tweets by their overall likelihood to be scheduled.