WeRateDogs (https://twitter.com/dog_rates) is a Twitter account that rates people's dogs with a humorous comment about the dog. The goal of this project is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. This requires the prior gathering assessing and cleaning of the data.
#load Python libraries
import pandas as pd
import numpy as np
import requests
import tweepy
import json
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
#set up matplotlib to work interactively
%matplotlib inline
sns.set() # switch on seaborn defaults
There different pieces of data are gathered for the project as described in the Project Details page :
from the course's project description :
The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv
#this file was downloded and saved in the same folder as the Jupyter notebook file
# it can simply be read into a dataframe
df_archive = pd.read_csv('twitter-archive-enhanced.csv')
#initial check to see what the dataframe looks like
df_archive.head(3)
from the course's project description :
The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv
#load .tsv into working memory using a method from the requests library
df_predictions = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')
#store in working direcory
with open('image-predictions.tsv',mode='wb') as file:
file.write(df_predictions.content)
#read .tsv into dataframe, specifying the tab separator
df_predictions = pd.read_csv('image-predictions.tsv',sep="\t")
#initial check what the dataframe looks like
df_predictions.head(3)
from the course's project description :
Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission.
#I obtained a Twitter dev account and the relevant API keys, secrets
#and tokens are below :
consumer_key = '########################################################'
consumer_secret = '########################################################'
access_token = '########################################################'
access_secret = '########################################################'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
#api = tweepy.API(auth)
#ensure appropriate rate_limit as described here: https://stackoverflow.com/a/44586034
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
#function to look up list of tweets, adapted from https://stackoverflow.com/a/44586034
def lookup_tweets(tweet_IDs):
'''
Returns list of tweets as generated by Twitter API
Parameter: Tweet IDs, either as a list of integers, or
column of a Pandas dataframe ;
handles 100 tweet limit, uses api.statuses_lookup()
'''
if isinstance(tweet_IDs, pd.Series): #https://stackoverflow.com/a/18117744
tweet_IDs = tweet_IDs.tolist() #convert pandas df column to list
full_tweets = [] #initialise list for list of tweets
tweet_count = len(tweet_IDs)
try:
for i in range((tweet_count // 100) + 1): #handle 100 tweet limit
# Catch the last group if it is less than 100 tweets
end_loc = min((i + 1) * 100, tweet_count)
print(i) #get feedback during downloading
print("range {} to {}".format(i*100,end_loc))
full_tweets.extend(
api.statuses_lookup(tweet_IDs[i * 100:end_loc])
)
return full_tweets
except tweepy.TweepError:
print("Something went wrong, quitting...")
# The aim is to use the Tweet IDs from WeRateDogs Twitter archive to gather
# additional information using the lookup_tweets() function
# with the tweet_id column from df_archive
results = lookup_tweets(df_archive.tweet_id)
#Next the tweets will be stored in a text file with one line per tweet
#using the _json property which contains JSON serializable response data
#as the tweepy status object itself is not JSON serializable https://stackoverflow.com/a/27901076
for tweet in results:
with open('tweet_json.txt', 'a', encoding='utf8') as file: #append to text file
json.dump(tweet._json,file)
#json.dump()) as described in https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/
file.write("\n") #add newline character
The structure of JSON tweet objects is explained in https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.html
Tweet objects contain information such as the tweet "id", "text", "user", as well as the retweet_count favorite_count - the latter two are specified in the projection description. I will also use the tweet id (so that it can be matched with the archive) as well as the user name.
JSON of tweets which are retweets (i.e. simply re-posting an existing tweet potentially authored by another user, as is but creating a new tweet) or quote tweets (i.e. reposting an existing tweet with additional tweet text) contain additional tweet objects : a "retweeted_status" object or a "quoted_status" object, respectively. The "retweeted_status" and "quoted_status" objects also contain information such as "id", "text", "user" of the the tweet being retweeted/quoted. They are simply absent from tweet obects that aren't retweeted/quoted. In order to be able to easily distinguish between original tweets by WeRateDogs, I will add columns for whether a tweet is a retweet, a quote tweet, as well as the author of the original tweets being retweeted/quoted. Similarly I will add information about whether a tweet is a reply to another tweet, again to be able to easily distinguish between original tweets by WeRateDogs and tweets that are part of an ongoing conversation.
#The json file with the tweets will be read in and processed to build a
#dataframe with IDs, retweet count and favourite count
#also check if a tweet is a retweet, and if it is a reply
additional = [] #initialise list that will hold dicitionaries
with open('tweet_json.txt','r', encoding='utf8') as file:
for line in file:
tweet=json.loads(line)
#get retweet status, user if they exist
try:
retweet_user=tweet['retweeted_status']['user']['name']
is_retweet = True
except:
retweet_user = None
is_retweet = False
#get quote status, user if they exist
try:
quote_user=tweet['quoted_status']['user']['name']
is_quote = True
except:
quote_user = None
is_quote = False
additional.append({'tweet_id': tweet['id'],
'user_name':tweet['user']['name'],
'retweet_count': tweet['retweet_count'],
'favorite_count' : tweet['favorite_count'],
#'is_retweet' : 'retweeted_status' in tweet,
#check if it's a retweet, adapted from https://stackoverflow.com/a/18937252
'is_retweet' : is_retweet,
'retweet_user' : retweet_user,
'is_quote' : is_quote,
'quote_user': quote_user,
'is_reply' : tweet['in_reply_to_status_id'] is not None
#check if it's a reply, adapted from https://stackoverflow.com/a/49469052
})
# break
df_additional=pd.DataFrame(additional, columns=['tweet_id','user_name','retweet_count','favorite_count','is_retweet','retweet_user','is_quote','quote_user','is_reply'])
#initial check of new dataframe
df_additional.head(3)
df_additional.shape
Data quality needs to be assessed in the context of the questions that analysis of a data set is meant to answer. Before starting the assessment, it's helpful to be aware of what those questions might be. For the present data, questions might include:
df_additional.info()
Data types look OK. None values for retweet_user, quote_user are accurate rather than missing values because these tweets are not retweets/quote tweets. tweet_id could also be a string rather than integer, but integer may be more efficient here.
#visual assessment of the dataframe in Pandas - also done in separate spreadsheet software
df_additional
df_additional.describe()
retweet_count and favorite_count of 0 seem low compared to their respective median values- what are these?
df_additional[df_additional.favorite_count==0].head(10)
df_additional[df_additional.favorite_count==0].is_retweet.value_counts()
Low favorite_count values are seen on retweets - because favourites are added to the original tweet rather than the retweet. This is true whether the retweet is of another user's tweet or a tweet by WeRateDogs. These low favorite_count values are not reflective of a data quality issue - but are important to understand the data.
df_additional[df_additional.retweet_count<10]
Low retweet_count values are seen on tweets which are replies and presumably often of lesser interest in isolation therfore generally not worth retweeting. Again: not reflective of a data quality issue - but are important to understand the data.
df_additional[df_additional.favorite_count==0].head(10)
df_additional.retweet_user.value_counts()
df_additional.quote_user.value_counts()
WeRateDogs doesn't quote-tweet their own tweets, but commonly retweets their own tweets - this is a popular way for tweeters to boost their recent posts.
No major data quality issues stand out with the additional tweet data in df_additional. Data completeness and tidiness is assessed below.
#visual assessment of the dataframe in Pandas - also done in separate spreadsheet software
df_archive
#looking at missing fields and if data types are appropriate
df_archive.info()
No data are available in a large number of fields relating to replies (e.g. 78 in_reply_to_status_id), retweets (181 retweeted_status_id), expanded urls (2297 expanded_urls).
The following data types are not appropriate:
#replies
df_archive[df_archive.in_reply_to_status_id.notnull()][['tweet_id','in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp']].head()
#retweets
df_archive[df_archive.retweeted_status_id.notnull()][['tweet_id','retweeted_status_id','retweeted_status_user_id', 'retweeted_status_timestamp']].head()
Twitter uses the same process to generate tweet (status) IDs and user IDs. These are unique IDs based on time https://developer.twitter.com/en/docs/basics/twitter-ids.html
Their purpose is to create a unique identifier rather than serve mathematical calculations. However when IDs are encoded as floats, they are displayed in scientific notation, obscuring most of the digits - undermining the pupose of serving as unique identifiers.
Unlike the other ID variables, the tweet_id field has the data type of integer and is displayed correctly. The reason the other ID variables were read in as floats by read_csv is that NaN can't be handled by the integer data type (see https://stackoverflow.com/a/11548224) (NB. The latest version of Pandas, 0.24, has introduced experimental support for NA in integer data type https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
#test if tweet_id really is unique
sum(df_archive.tweet_id.duplicated())
sum(df_archive.text.duplicated())
#check a few sample tweet texts
df_archive.text.sample(5)
#using for loop to print the whole text of tweets
for text in df_archive.text.sample(10):
print(text)
The standard format of tweets is to start tweets with "This is ...(name of dog)", where the name is known, and ending in a rating out of 10, followed by a photo of the dog.
#checking the ratings numerator and denominator - how many unqiue values?
df_archive.rating_numerator.nunique(), df_archive.rating_denominator.nunique()
#frequency of different values of the numerator
df_archive.rating_numerator.value_counts()
#checking tweet text where the numerators is 15 or more
for tweet in df_archive[df_archive.rating_numerator>14].text.sample(10):
print(tweet)
#checking tweet text where the numerators is less than 10
for tweet in df_archive[df_archive.rating_numerator<10].text.sample(10):
print(tweet)
Most values of the rating_numerator are between 10 and 14. Values outside of that range are often from tweets that do not follow the standard dog ratings format of this accounts.
Based on a sample, values much above 15 are often for pics with several dogs, and can be taken to be a multiple of an actual rating.
Based on a sample, values below 10 are often not pictures of a dog but other animals, but can also be of somewhat ugly dogs.
#checking the values of the ratings denominator
df_archive.rating_denominator.value_counts()
for text in df_archive.query("rating_denominator != 10").text:
if "/10" in text:
print(text)
for text in df_archive.query("rating_denominator != 10").text:
if "/10" not in text:
print(text)
The rating_denominator is almost always 10.
In cases where it isn't 10, but the tweet text contains a fraction out of 10, it is a mistake with how the ratings were extracted due to the presence of another faction. This is an inaccuracy that can be fixed since we still have the tweet text available. In these cases the numerator is also wrong and can be cleaned.
In cases where it isn't 10, but the tweet text does not contain a fraction out of 10 but a multiple of 10, the rating is a deviation from the normal dog rating format applied to multiple dogs - again this can be fixed, for both the numerator and denominator.
#Checking the name variable
df_archive.name.value_counts()
#checking all words in the names column that are not capitalised
df_archive[df_archive.name.str.islower()].name.value_counts()
#for tweet_text in df_archive[df_archive.name == 'a'].text.sample(5):
for tweet_text in df_archive[df_archive.name.str.islower()].text.sample(10):
print(tweet_text)
It appears that the name was extracted by using a rather simplistic method of looking for the construction "This is"..., assuming the next word would be a dog's name. These cases are easily identified by looking for the absence of a capitalised named.
#create a mask to select records where the name is either None or lower-case
mask = (df_archive.name == 'None') | (df_archive.name.str.islower())
#where there is no correct name, check for names after a different construction:
for tweet_text in df_archive.loc[mask].text.sample(100):
if ("named" in tweet_text) | ("name is" in tweet_text) :
print(tweet_text)
Some records with a misattributed lower-case name or with "None" do still contain a dog's name - these can be extracted from the original tweet.
#checking the 4 last columns
df_archive.iloc[:,13:].head()
df_archive.doggo.value_counts()
df_archive.floofer.value_counts()
df_archive.pupper.value_counts()
df_archive.puppo.value_counts()
df_archive[(df_archive.doggo!="None") & (df_archive.pupper!="None")].iloc[:,13:]
The four last columns represent dog stage. Usually they have at most one dog stage, but sometimes there is more than one. The structure of these columns is not consistent with the principles of data tidiness: the column names are values - there should be a single variable instead.
df_archive.source.value_counts()
df_archive.expanded_urls.value_counts()
The usefulness of some columns is not clear, e.g. the source column and expanded_urls
df_predictions
df_predictions.info()
data types are fine, there are no apparent missing fields, however the column names are not very descriptive
df_predictions.img_num.nunique()
df_predictions.img_num.value_counts()
df_predictions[df_predictions.img_num>1].sample(5)
sum(df_predictions.tweet_id.duplicated())
#check duplicated images
sum(df_predictions.jpg_url.duplicated()), sum(df_predictions.jpg_url.duplicated(keep=False))
df_predictions[df_predictions.jpg_url.duplicated(keep=False)].sort_values(by=['jpg_url']).head(10)
df_predictions[(df_predictions.p1_dog==False) & (df_predictions.p2_dog==False) & (df_predictions.p3_dog==False)].sample(10)
len(df_predictions[(df_predictions.p1_dog==False) & (df_predictions.p2_dog==False) & (df_predictions.p3_dog==False)])
For 324 records, no dog has been predicted. Looking at the pictures in question, this is often accurate, but at other times they are pictures in which dogs are very hard to predict because they only occupy a small proportion of the photo, or are somehow disguised, eg. wearing glasses, wigs etc.
#for how many is confidence of prediction 1 higher than 2, and 2 higher than 3?
sum(df_predictions.p1_conf > df_predictions.p2_conf), sum(df_predictions.p2_conf > df_predictions.p3_conf)
The predictions are ranked according the confidence
the number of rows differ between the 3 dataframes:
df_archive.shape, df_additional.shape, df_predictions.shape
#df_additional was constructed using tweet_ids from df_archive, yet it has fewer rows
#sample of tweets missing from df_additional, using isin() and tilde for boolean indexing
df_archive[~df_archive.tweet_id.isin(df_additional.tweet_id)].sample(5) #https://stackoverflow.com/a/19960116
none of these tweets are accessible anymore - they appear to have been deleted: https://twitter.com/dog_rates/status/802247111496568832, https://twitter.com/dog_rates/status/861769973181624320, https://twitter.com/dog_rates/status/837012587749474308, https://twitter.com/dog_rates/status/842892208864923648,https://twitter.com/dog_rates/status/812747805718642688
#make a pd.Series of the deleted tweets and check if it explains the difference in number of records
deleted_tweets=df_archive[~df_archive.tweet_id.isin(df_additional.tweet_id)].tweet_id
deleted_tweets.size+df_additional.shape[0]==df_archive.shape[0]
#are the deleted tweets in the image prediction?
df_predictions[df_predictions.tweet_id.isin(deleted_tweets)]
#number of deleted tweets in image predictions, and total number of deleted tweets
df_predictions[df_predictions.tweet_id.isin(deleted_tweets)].shape[0],deleted_tweets.size
9 of the 17 deleted tweets are also in df_predictions.
What about the difference in record numbers between df_predictions and the other two dataframes?
df_predictions[~df_predictions.tweet_id.isin(df_archive.tweet_id)].shape[0]
len(df_archive[~df_archive.tweet_id.isin(df_predictions.tweet_id)])
df_additional[~df_additional.tweet_id.isin(df_predictions.tweet_id)][['tweet_id','is_retweet','is_quote','is_reply']].head()
to sum up:
Why is there no prediction for 281 tweets?
#calculate the tweed ids that are missing for which there is also no expanded_url - i.e. no image
missing = (df_archive.expanded_urls.isnull()) & (~df_archive.tweet_id.isin(df_predictions.tweet_id))
sum(missing)
#list of Tweet IDs not present in predictions
missing_predictions = df_archive[~df_archive.tweet_id.isin(df_predictions.tweet_id)].tweet_id.tolist()
#IDs of tweets that are either a retweet, quote, or reply
not_original_tweet = df_additional[df_additional.is_retweet | df_additional.is_quote | df_additional.is_reply].tweet_id.tolist()
#how many IDs are a retweet, quote, or reply, i.e. in the intersection of missing_predictions and not_original_tweet
len(set(missing_predictions).intersection(set(not_original_tweet)))
Many tweets for which there is no image prediction available either did not have a photo attached that is available, or they are a reply/retweet/quote-tweet -- these tweets tend to not follow the standard format of the WeRateDogs dog rating tweets.
discuss 4 data quality dimensions: missing, accuracy etc
Data quality
df_additional (tweepy):
df_archive:
image predictions:
Data tidyness
##make copies of all dataframes to be cleaned
df_archive_copy = df_archive.copy()
df_predictions_copy = df_predictions.copy()
df_additional_copy = df_additional.copy()
17 rows in df_archive are missing from df_additional because the corresponding tweets have since been deleted, and 9 of these are also present in df_predictions. There are several ways of dealing with this.
One possibility is to reconstruct data about the missing tweets from df_archive and add those to df_additional. However, the main purpose of obtaining additional data using the Twitter API (in df_additional) is to complement df_archive, and this is unrecoverable for the missing tweets. Nothing would be gained with this approach.
Therefore a better choice would be to delete the rows representing deleted tweets in df_archive and df_predictions.
Define:
Code:
df_archive_copy = df_archive_copy[df_archive_copy.tweet_id.isin(df_additional.tweet_id)]
df_predictions_copy = df_predictions_copy[df_predictions_copy.tweet_id.isin(df_additional.tweet_id)]
Test:
#check nb of rows has decreased
df_archive.shape, df_archive_copy.shape, df_archive.shape > df_archive_copy.shape
df_predictions.shape, df_predictions_copy.shape, df_predictions.shape > df_predictions_copy.shape
#check row number of cleaned df_archive is the same as df_additional
df_archive_copy.shape[0] == df_additional.shape[0]
#make dataframe of tweets absent from df_additional
deleted_tweets = df_archive[~df_archive.tweet_id.isin(df_additional.tweet_id)]
#check all the tweet_ids in the cleaned dataframes are absent from the deleted tweets dataframe
assert sum(df_archive_copy.tweet_id.isin(deleted_tweets.tweet_id)) == 0
assert sum(df_predictions_copy.tweet_id.isin(deleted_tweets.tweet_id)) == 0
assert sum(df_archive_copy.tweet_id.isin(deleted_tweets.tweet_id)) == 0
assert sum(df_predictions_copy.tweet_id.isin(deleted_tweets.tweet_id)) == 0
#compute the number of Tweet IDs for which there is no image prediction after removing missing tweets
df_additional_copy.shape[0] - sum(df_additional_copy.tweet_id.isin(df_predictions_copy.tweet_id))
Define
Four columns represent "dog stage" i.e. the column names are values that represent a single variable of dog-stage. However, there are more than 4 different values which this variable can take, as many have no dog stage associated, and in some cases there are several dogs in the picture. This makes the task more complex.
Code
#check head of 4 last columns ie. dog stage
df_archive_copy.iloc[:1,13:]
#create columns with boolean values for each dog stage
df_archive_copy['doggo_b'] = df_archive_copy.iloc[:,13]!="None"
df_archive_copy['floofer_b'] = df_archive_copy.iloc[:,14]!="None"
df_archive_copy['pupper_b'] = df_archive_copy.iloc[:,15]!="None"
df_archive_copy['puppo_b'] = df_archive_copy.iloc[:,16]!="None"
#calculate sum of boolean values for dog stage
#new column - will be 0 when there is none, and 2,3 or 4 if there is more than 1 dog stage value
df_archive_copy['several'] = df_archive_copy['doggo_b'].astype(int) + df_archive_copy['floofer_b'].astype(int) + df_archive_copy['pupper_b'].astype(int) + df_archive_copy['puppo_b'].astype(int)
#new column with boolean to say when there has been no dog stage defined
df_archive_copy['none'] = df_archive_copy['several']==0
#convert column to boolean - True = there are several stages
df_archive_copy['several'] = df_archive_copy['several']>1
#Look at records with several dog stages
mask=df_archive_copy.several==True
columns=['doggo', 'floofer', 'pupper', 'puppo']
df_archive_copy.loc[mask,columns]
#need to ensure the final dog stage variable can only take one value
#where there are several dog stages, overwrite values in the original columns with "None"
df_archive_copy.loc[mask,columns] = df_archive_copy.loc[mask,columns].replace(to_replace=['doggo','floofer','pupper','puppo'],
value='None')
#check
df_archive_copy.loc[mask,columns]
#Now change boolean values in 'several' and 'no stage' to strings
#so that they can be used by merge in the same way as the other columns
df_archive_copy.loc[:,'several'].replace(True,"several dogs", inplace=True)
df_archive_copy.loc[:,'none'].replace(True,"no stage", inplace=True)
df_archive_copy.loc[:,'several'].replace(False,"None", inplace=True)
df_archive_copy.loc[:,'none'].replace(False,"None", inplace=True)
#drop the columns with booleans no longer needed
df_archive_copy.drop([ 'doggo_b', 'floofer_b', 'pupper_b', 'puppo_b'],axis=1, inplace=True)
#check values are either None or dog stage in single column
df_archive_copy.iloc[:15,13:]
#use melt to create temporary dataframe column for dog stage variable
df_dog_stages = pd.melt(df_archive_copy, id_vars='tweet_id',
value_vars=['doggo','floofer','pupper','puppo','several','none'],
var_name = 'stages', value_name = 'dog_stage')
#check
df_dog_stages.head(15)
#there should be duplication due to 5 "None" values for each record
df_dog_stages.dog_stage.value_counts()
#delete rows with "None", drop redundant column
df_dog_stages=df_dog_stages[df_dog_stages.dog_stage!="None"]
df_dog_stages.drop('stages', axis=1, inplace=True)
#convert to category
df_dog_stages.dog_stage = df_dog_stages.dog_stage.astype('category')
#check 'None' is removed
df_dog_stages.dog_stage.value_counts()
#check df_dog_stages and df_archive_copy have same length,
df_dog_stages.shape[0], df_archive_copy.shape[0], df_dog_stages.shape[0] == df_archive_copy.shape[0]
df_dog_stages.head()
#merge into df_archive_copy, using tweet_id, removing now redundant old dog stage columns
df_archive_copy = pd.merge(df_archive_copy.iloc[:,:13],df_dog_stages,on='tweet_id', how='left')
Test:
#check column structure
list(df_archive_copy)
#dtype should be categorical
df_archive_copy.info()
#check values
df_archive_copy.dog_stage.value_counts()
Define
df_additional has several columns with boolean values stating if a tweet is a retweet, reply or quote. These can be better expressed as a single variable of different exclusive categories of tweet: Original tweet, retweet of own tweet (SelfRT), retweet of tweet from another source (OtherRT), reply, or quote tweet.
Code
#using boolean indexing for self-retweets
df_additional_copy[(df_additional_copy.retweet_user.notnull()) & (df_additional_copy.retweet_user == df_additional_copy.user_name)].head()
#new column with boolean to state if a tweet is a self-retweet
df_additional_copy['is_self_retweet'] = (df_additional_copy.retweet_user.notnull()) & (df_additional_copy.retweet_user == df_additional_copy.user_name)
#using boolean indexing for retweets from other source
df_additional_copy[(df_additional_copy.retweet_user.notnull()) & (df_additional_copy.retweet_user != df_additional_copy.user_name)].head()
#new column with boolean to state if a tweet is a retweet from other source
df_additional_copy['is_other_retweet'] = (df_additional_copy.retweet_user.notnull()) & (df_additional_copy.retweet_user != df_additional_copy.user_name)
#new column to state if tweet is original tweet, i.e. those where is_retweet, is_quote, is_reply are all False
df_additional_copy['is_original'] = (df_additional_copy.is_retweet | df_additional_copy.is_quote | df_additional_copy.is_reply) == False
#each tweet should belong to one of the following categories: 'is_quote', 'is_reply', 'is_self_retweet',
#'is_other_retweet', 'is_original'
#therefore their sum should be the same as the number of rows.
df_additional_copy.shape[0] == sum(df_additional_copy.is_quote) + sum(df_additional_copy.is_reply) + sum(df_additional_copy.is_self_retweet) + sum(df_additional_copy.is_other_retweet) + sum(df_additional_copy.is_original)
#Replace True with the corresponding string
df_additional_copy.is_quote = df_additional_copy.is_quote.replace(True,"Quote")
df_additional_copy.is_reply = df_additional_copy.is_reply.replace(True,"Reply")
df_additional_copy.is_self_retweet = df_additional_copy.is_self_retweet.replace(True,"SelfRT")
df_additional_copy.is_other_retweet = df_additional_copy.is_other_retweet.replace(True,"OtherRT")
df_additional_copy.is_original = df_additional_copy.is_original.replace(True,"Original")
#Replace bool False with str
df_additional_copy[['is_quote', 'is_reply', 'is_self_retweet', 'is_other_retweet', 'is_original']] = df_additional_copy[['is_quote', 'is_reply', 'is_self_retweet', 'is_other_retweet', 'is_original']].replace(False,"False")
#use melt, also dropping variables no longer needed by leaving them out from id_vars
df_additional_copy = pd.melt(df_additional_copy, id_vars=['tweet_id', 'retweet_count', 'favorite_count'],#['tweet_id', 'user_name', 'retweet_count', 'favorite_count', 'is_retweet', 'retweet_user', 'quote_user'],
value_vars=[ 'is_quote', 'is_reply', 'is_self_retweet', 'is_other_retweet', 'is_original'],
var_name = 'is_type', value_name = 'tweet_type')
#check result
df_additional_copy.head()
#drop rows with False tweet_type
df_additional_copy = df_additional_copy[df_additional_copy.tweet_type!='False']
#convert to category
df_additional_copy.tweet_type = df_additional_copy.tweet_type.astype('category')
#drop unneeded column
df_additional_copy.drop('is_type', axis=1, inplace=True)
Test
#check cleaned dataframe structure
df_additional_copy.head()
#check data type
df_additional_copy.info()
#check there are no False values left
df_additional_copy.tweet_type.value_counts()
#check no records have been lost by checking all tweet_ids of cleaned df are in original df
df_additional_copy.tweet_id.isin(df_additional.tweet_id).shape[0] == df_additional.shape[0]
According to the principles of data tidiness, each table should match an observational unit. The current structure of our data is in contradiction with this principle, because it is a relatively heterogeneous data set, relating to:
There are different ways of restructuring the data to better follow tidiness principles :
One possibility is to merge the three dataframes on the basis of the unique tweet IDs. In this case the observational unit would be the data associated with each tweet. However there is no image prediction for 273 tweets, so these records would need to be removed, and we may be interested in analysing these tweets.
A second possibility is to re-structure the data into two dataframes, on the basis of two observational units : (1) tweets from the WeRateDogs Twitter user, including data provided by Twitter ; (2) dog predictions derived from the content of WeRateDogs tweets, in particular, predictions from images posted by the WeRateDogs user, as well as ratings, dog names and dog stages extracted from tweet text.
Only the first option is deemed to be correct by the project reviewer, although it implies losing data on tweets for which there are no image predictions and which will not be available for later analysis.
Define
Code:
#first merge step
df_archive_copy = pd.merge(df_archive_copy,df_additional_copy,on='tweet_id')
#before moving to second step, check the number of rows is unchanged
df_archive_copy.shape[0] == df_additional_copy.shape[0]
#second merge step - only keep rows that are in df_predictions_copy
df_archive_copy = pd.merge(df_archive_copy, df_predictions_copy,on='tweet_id', how='right')
Test:
#check the number of rows is now the same as df_predictions_copy
df_archive_copy.shape[0] == df_predictions_copy.shape[0]
#check all the column names of df_additional_copy and df_predictions_copy are still in df_archive_copy
assert set(df_additional_copy).intersection(set(df_archive_copy)) == set(df_additional_copy)
assert set(df_predictions_copy).intersection(set(df_archive_copy)) == set(df_predictions_copy)
#check the new dataframe structure
df_archive_copy.info()
#find out number of rows, and how many were lost
df_archive_copy.shape[0], df_additional_copy.shape[0] - df_archive_copy.shape[0]
#which types of tweets were deleted?
df_additional_copy[~df_additional_copy.tweet_id.isin(df_archive_copy.tweet_id)].tweet_type.value_counts()
#which types of tweets remain?
df_archive_copy.tweet_type.value_counts()
Define:
Two columns with timestamps of when a tweet was posted,or a retweet was originally posted, were read in as strings.
Code:
df_archive_copy.timestamp = pd.to_datetime(df_archive_copy.timestamp)
df_archive_copy.retweeted_status_timestamp = pd.to_datetime(df_archive_copy.retweeted_status_timestamp)
Test
df_archive_copy.info()
df_archive_copy[['timestamp','retweeted_status_timestamp']].sample(10)
note that null values are handled correctly
Define
Twitter uses unique IDs for user and tweet id. read_csv assigned inconsistent data types : float where values are missing, and integer to the others. Floats are displayed in scientific notation, which obscures their only utility of being unique identifiers.
To achieve a consistent data type, all of these could be converted to strings.
Code
#check current data types and column names
df_archive_copy.iloc[:,[0,1,2,6,7]].info()
#convert to strings, replace missing with "none"
df_archive_copy[['tweet_id','in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id',
'retweeted_status_user_id']] = df_archive_copy[['tweet_id','in_reply_to_status_id','in_reply_to_user_id',
'retweeted_status_id',
'retweeted_status_user_id']].fillna(0).astype(int).astype(str).replace("0","none")
Test
df_archive_copy.iloc[:,[0,1,2,6,7]].info()
df_archive_copy[['tweet_id','in_reply_to_status_id', 'in_reply_to_user_id',
'retweeted_status_id', 'retweeted_status_user_id']].sample(10)
Define:
Some dog names have not been extracted properly. Some are not names - identifiably in lower-case - and some may have been missed in fields with the value "None".
Code:
#create a mask to select records where the name is either None or lower-case
mask = (df_archive_copy.name == 'None') | (df_archive_copy.name.str.islower())
sum(mask)
df_archive_copy[mask].name.value_counts()
#re-extract names as the first word that follows the construction "named" or "name is", and fill null values with "None"
df_archive_copy.loc[mask,'name'] = df_archive_copy[mask].text.str.extract(r'.*named?\s?i?s?\s(?P<name>\w+).*.*', expand=True).fillna("None")
#check for lower-case "names"
df_archive_copy[df_archive_copy.name.str.islower()].name
#replace with "None"
mask_lower=df_archive_copy.name.str.islower()
df_archive_copy.loc[mask_lower,"name"] = "None"
Test
#check name values that were previously None or wrong
df_archive_copy[mask].name.value_counts()
#check if there are any lower-case names in the dataframe
sum(df_archive_copy.name.str.islower())
Define:
The rating_denominator is almost always 10 - where it isn't, it is indicative of either a mistake while extracting the rating (inaccurate), or of a non-standard rating being used (invalid). In both cases the numerator is also affected.
Code:
#define mask
mask = df_archive_copy.rating_denominator != 10
#create temporary dataframe
wrong_ratings = df_archive_copy.loc[mask,['rating_numerator','rating_denominator','text']]
wrong_ratings
#extract ratings that follow the pattern numerator/10
wrong_ratings[['corr_numerator','corr_denominator']] = wrong_ratings.text.str.extract(r'.*\s(?P<corr_numerator>\d+)\/(?P<corr_denominator>10).*.*', expand=True)
wrong_ratings
#fix ratings that are a multiple of 10 - assume multiple is the number of dogs, redefine numerator and denominator
#sampling the pics of corrected records for the presence of multiple dogs
for i,denominator in zip(wrong_ratings.index, wrong_ratings.rating_denominator):
if (denominator % 10 == 0) and (denominator != 0):
n_dogs = denominator / 10
wrong_ratings.loc[i,'corr_numerator'] = wrong_ratings.loc[i,'rating_numerator'] / n_dogs
wrong_ratings.loc[i,'corr_denominator'] = wrong_ratings.loc[i,'rating_denominator'] / n_dogs
print("index: ", i)
print("Tweet text:",wrong_ratings.loc[i,'text']) #check some of the URLs
print("nb of dogs: {}, corrected numerator : {}, corrected denominator: {}".format(n_dogs,wrong_ratings.loc[i,'corr_numerator'], wrong_ratings.loc[i,'corr_denominator']))
wrong_ratings
#check the text of remaining tweets for ratings, make list of tweet_ids to delete
to_delete=[]
for i in wrong_ratings[wrong_ratings.corr_denominator.isnull()].index:
print("index: ", i)
print("Tweet text:",wrong_ratings.loc[i,'text'])
to_delete.append(df_archive_copy.loc[i,'tweet_id']) #get tweetd_id from df_archive_copy as index is the same
print(to_delete)
#check tweets in df_archive_copy for which no rating can be extracted
df_archive_copy[df_archive_copy.tweet_id.isin(to_delete)]
wrong_ratings.info()
The corrected numerator and denominator are strings. Since integers can't handle NaN, replace NaN values with 0, convert to integers, insert into original dataframe, then delete the records.
wrong_ratings[['corr_numerator', 'corr_denominator']] = wrong_ratings[['corr_numerator', 'corr_denominator']].fillna(0).astype(int)
wrong_ratings.info()
#replace re-extracted correct ratings in df_archive_copy
df_archive_copy.loc[mask,'rating_numerator'] = wrong_ratings.corr_numerator
df_archive_copy.loc[mask,'rating_denominator'] = wrong_ratings.corr_denominator
#delete rows relating to tweets without an extractable rating
df_archive_copy = df_archive_copy[~df_archive_copy.tweet_id.isin(to_delete)]
Test
df_archive_copy.rating_denominator.value_counts()
sum(df_archive_copy.rating_denominator.isnull())
df_archive_copy[['rating_numerator','rating_denominator']].info()
Define:
All ratings numerators were extracted as integers, but some are the decimal of a rating.
Code:
#extract decimal ratings into new column
df_archive_copy['decimal_numerator'] = df_archive_copy.text.str.extract(r'.*\s(\d+\.\d+)\/10.*.*', expand=False).astype(float)
#extraction results
df_archive_copy[df_archive_copy.decimal_numerator.notnull()]
#define mask of extracted decimal ratings:
mask = df_archive_copy.decimal_numerator.notnull()
df_archive_copy.loc[mask,['tweet_id','text','decimal_numerator']]
#print text to check extraction
for t in df_archive_copy[mask].text:
print(t)
#confirm original extraction was inaccurate
df_archive_copy.loc[mask,['rating_numerator','decimal_numerator']]
#convert rating_numerator to floats
df_archive_copy['rating_numerator'] = df_archive_copy['rating_numerator'].astype(float)
#replace original numerators
df_archive_copy.loc[mask,'rating_numerator'] = df_archive_copy.loc[mask,'decimal_numerator']
#drop unnecessary decimal_numerator column
df_archive_copy.drop('decimal_numerator',axis=1, inplace=True)
Test:
#check for presence of decimal numerators
df_archive_copy.rating_numerator.value_counts()
df_archive_copy.loc[mask,'rating_numerator']
#check correct dtype and unnecessary column has been dropped
df_archive_copy.info()
Define
Replace the column names with more descriptive names :
Code
#img_num and the image number that corresponded to the most
#confident prediction (numbered 1 to 4 since tweets can have up to four images
df_archive_copy = df_archive_copy.rename (columns={'jpg_url':'image_used_url',
'img_num':'image_used_num',
'p1': 'prediction_p1',
'p1_conf': 'confidence_p1',
'p1_dog': 'is_dog_p1',
'p2': 'prediction_p2',
'p2_conf': 'confidence_p2',
'p2_dog': 'is_dog_p2',
'p3': 'prediction_p3',
'p3_conf': 'confidence_p3',
'p3_dog': 'is_dog_p3'})
Test
list(df_archive_copy)
Define:
Remove underscores, capitalise all dog breed names
Code:
df_archive_copy.prediction_p1 = df_archive_copy.prediction_p1.str.replace("_", " ")
df_archive_copy.prediction_p2 = df_archive_copy.prediction_p2.str.replace("_", " ")
df_archive_copy.prediction_p3 = df_archive_copy.prediction_p3.str.replace("_", " ")
#capitalise the dog breed names only, using is_dog_px for boolean indexing
df_archive_copy.loc[df_archive_copy.is_dog_p1,'prediction_p1'] = df_archive_copy[df_archive_copy.is_dog_p1].prediction_p1.str.title()
df_archive_copy.loc[df_archive_copy.is_dog_p2,'prediction_p2'] = df_archive_copy[df_archive_copy.is_dog_p2].prediction_p2.str.title()
df_archive_copy.loc[df_archive_copy.is_dog_p3,'prediction_p3'] = df_archive_copy[df_archive_copy.is_dog_p3].prediction_p3.str.title()
Test
df_archive_copy[df_archive_copy.is_dog_p1][['prediction_p1','prediction_p2','prediction_p3']].sample(5)
Define:
The df_archive_copy dataframe contains 3 columns each for the 3 preductions, one is the prediction name, one whether it is a dog, and one of the confidence of the prediction. For the present analysis, this level of detailed information about the machine learning process is unnecessary as we are only interested in the top prediction if it is of a dog.
Code:
#Get dog prediction from best prediction of a dog
mask = df_archive_copy.is_dog_p1
df_archive_copy.loc[mask,'top_prediction'] = df_archive_copy.loc[mask,'prediction_p1']
mask = (df_archive_copy.is_dog_p1==False) & df_archive_copy.is_dog_p2
df_archive_copy.loc[mask,'top_prediction'] = df_archive_copy.loc[mask,'prediction_p2']
mask = (df_archive_copy.is_dog_p1==False) & (df_archive_copy.is_dog_p2==False) & df_archive_copy.is_dog_p3
df_archive_copy.loc[mask,'top_prediction'] = df_archive_copy.loc[mask,'prediction_p3']
#fill in rows for which there is no predicted dog with "No dog predicted"
df_archive_copy.top_prediction = df_archive_copy.top_prediction.fillna("None")
#remove redundant columns
df_archive_copy.drop(['prediction_p1', 'confidence_p1', 'is_dog_p1', 'prediction_p2', 'confidence_p2',
'is_dog_p2', 'prediction_p3', 'confidence_p3', 'is_dog_p3'],axis=1, inplace=True)
Test:
#check remaining columns
list(df_archive_copy)
Define:
There are also a number of duplicated dog predictions: the same image was used, with an identical prediction, though the unique tweet_id is different. These may be the result of WeRataDogs retweeting their own tweets. The project description suggests that retweeted images should be removed.
Code:
#variable to record self-retweeting
df_archive_copy['was_retweeted'] = df_archive_copy.tweet_id.isin(df_archive_copy.retweeted_status_id)
#check how many duplicates are self-retweets
df_archive_copy[df_archive_copy.image_used_url.duplicated(keep=False)].tweet_type.value_counts()
df_archive_copy.tweet_type.value_counts()
#remove retweets
df_archive_copy = df_archive_copy[~df_archive_copy.tweet_type.isin(['SelfRT','OtherRT'])]
#remove unnecessary columns related to retweets and replies
df_archive_copy.drop(['retweeted_status_id', 'retweeted_status_user_id',
'retweeted_status_timestamp', 'in_reply_to_status_id', 'in_reply_to_user_id'],axis=1, inplace=True)
Test:
#check new variable
df_archive_copy['was_retweeted'].value_counts()
list(df_archive_copy)
#ensure no duplicated images remain
sum(df_archive_copy.image_used_url.duplicated())
#check tweet_types remaining
df_archive_copy.tweet_type.value_counts()
Given the relatively small size of the data set, the most straighforward way to store the cleaned data is as csv files :
df_archive_copy.to_csv('twitter_archive_master.csv', index=False)
On Twitter, favourite and retweet counts are an important metric of audience engagement. Here I want to use them to to analyse the popularity of tweets and of dogs. To get a better idea of how they behave, I want to look at their distribution first.
#distribution of retweet and favourite counts
n_rt= df_archive_copy.retweet_count
n_fav= df_archive_copy.favorite_count
fig = plt.figure(figsize=(12,3))
ax1 = fig.add_subplot(121)
stats.probplot(n_rt, plot=plt) #https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html
ax1.set_title("Distribution of Retweet count")
ax2 = fig.add_subplot(122)
stats.probplot(n_fav, plot=plt)
ax2.set_title("Distribution of Favourite count");
#calculate logarithm for non-zero counts
log_rt= np.log(df_archive_copy[df_archive_copy.retweet_count>0].retweet_count)
log_fav= np.log(df_archive_copy[df_archive_copy.favorite_count>0].favorite_count)
fig = plt.figure(figsize=(12,3))
ax3 = fig.add_subplot(121)
stats.probplot(log_rt, plot=plt)
ax3.set_title("Distribution of Retweet count Logarithm")
ax4 = fig.add_subplot(122)
stats.probplot(log_fav, plot=plt);
ax4.set_title("Distribution of Favourite count Logarithm");
The counts of favourites and of retweets are approximately log-normally distributed (though retweets, deviate somewhat at very low counts).
Seaborn's jointplot function is a good way to look at the relationship between retweet and favourite count
#only consider non-zero values
x=(df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.retweet_count>0)].favorite_count)
y=(df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.retweet_count>0)].retweet_count)
corr_coeff= stats.pearsonr(x,y) #https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html
ax=sns.jointplot(x,y,kind="reg")
ax.fig.text(x=0.3,y=0.7, s="Pearson: r={0:.2f}, p={1:.1f}".format(corr_coeff[0],corr_coeff[1]), ha="center", va="center")
#decimals https://stackoverflow.com/a/8940627
ax.fig.subplots_adjust(top=0.93) #title https://stackoverflow.com/a/29814281
ax.fig.suptitle('Correlation of favourites vs retweets', fontsize=14);
Plot the log values of retweet count and of favourite count - to reduce overplotting and get a better distribution of the counts
x=np.log(df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.retweet_count>0)].favorite_count)
y=np.log(df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.retweet_count>0)].retweet_count)
corr_coeff= stats.pearsonr(x,y)
ax=sns.jointplot(x,y,kind="reg")
ax.fig.text(x=0.3,y=0.7, s="Pearson: r={0:.2f}, p={1:.1f}".format(corr_coeff[0],corr_coeff[1]), ha="center", va="center")
ax.fig.subplots_adjust(top=0.93)
ax.fig.suptitle('Fvourites vs retweets (log)', fontsize=14);
There is a very stong correlation (r=0.93) between retweet count and favourite count. This can easily be understood as both retweets and favourite are expressions of audience engagement, the former being a stronger form of engagement. There is also a mutual re-inforcement mechanism at play, as tweets that are retweeted are shown to a wider audience, increasing the pool of tweets who may favourite the tweet ; and popular tweets, as expressed by a higher favorite count are given higher prominence in the Twitter timeline. This feedback mechanism is likely to be the basis for the log-normal distribution observed.
With nearly 8 million followers, @dog_rates (WeRateDogs) is a very successful Twitter account. How did it get there?
The plot below shows the evolution of audience engagement over the first 20 months. Only original tweets are considered - i.e. replies, quote-tweets and retweets are left out, as audience engagement is fundamentally different with them.
The retweet and favourite counts are shown for each tweet as a point in the scatter plot, on a logarithmic y-axis, with a line shown a rolling mean of a window size of 50 tweets.
fig, ax = plt.subplots(figsize=(14,6))
x = df_archive_copy[df_archive_copy.tweet_type=='Original'].timestamp
y1 = df_archive_copy[df_archive_copy.tweet_type=='Original'].favorite_count
y1m = y1.rolling(50).median()
y2 = df_archive_copy[df_archive_copy.tweet_type=='Original'].retweet_count
y2m = y2.rolling(50).median()
ax.plot_date(x,y1, color='b', markersize = 1)
ax.plot_date(x,y2, color='y', markersize = 1)
ax.plot(x,y1m, color='k')
ax.plot(x,y2m,color='r')
plt.yscale("log")
ax.set_title("Popularity of @dog_rates (WeRateDogs) over 2 years")
ax.set_ylabel("retweet / favourite count (log axis)")
ax.set_xlabel("date")
ax.legend(['Favourites','Retweets','Favourites - Rolling mean','Retweets - Rolling mean']);
During the first month or two, there's a rapid increase in popularity - from tens to thousands of retweets. Thereafter, the slope of the curve become much less steep but increases steadily until the end of the period studied. The approximately linear aspect of the rolling means expresses an exponential increase since the y-axis is logarithmic. The distance between the lines for favourites and retweets is relatively consistent but widens slowly.
WeRateDogs commonly states the name of the dog in the picture, as well as the "dog stage". Does this make the tweets more popular?
#create new variable stating if name and dog stage are used in text
df_archive_copy['has_name'] = df_archive_copy.name != "None"
df_archive_copy['has_stage_defined']=df_archive_copy.dog_stage != "no stage"
data = df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.tweet_type=='Original')]
#correlation https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pointbiserialr.html
corr_name = stats.pointbiserialr(data.has_name,data.favorite_count)
corr_stage = stats.pointbiserialr(data.has_stage_defined,data.favorite_count)
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(12,8))
sns.boxplot(y='favorite_count',x='has_name',data= data,ax=ax1).set_title('Favourite count: without/with named dogs')
fig.get_axes()[0].set_yscale('log')
sns.boxplot(y='favorite_count',x='has_stage_defined',data= data,ax=ax2).set_title('Favourite count: without/with dog stage')
fig.get_axes()[1].set_yscale('log')
ax1.text(x=0.6,y=65, s="Point-biserial: r={0:.2f}, p={1:.2f}".format(corr_name[0],corr_name[1]), ha="center", va="center")
ax2.text(x=0.6,y=65, s="Point-biserial: r={0:.2f}, p={1:.2f}".format(corr_stage[0],corr_stage[1]), ha="center", va="center");
Tweets that include a named dog or mention a dog stage are more popular - however the the difference is marginal. The correlation coefficient (r=0.02) and p-value (p=0.28) suggest the difference is not significant for named dogs, and it is only slightly higher for unnamed vs named dogs (r=0.08, p=0)
Is there a discernible effect of retweeting, and of the type of tweet?
data = df_archive_copy[df_archive_copy.favorite_count>0]
corr_coeff = stats.pointbiserialr(data.was_retweeted,data.favorite_count)
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(12,8))
sns.boxplot(y='favorite_count',x='was_retweeted',data= data,ax=ax1).set_title('Favourite count: retweets')
fig.get_axes()[0].set_yscale('log')
ax1.text(x=0.6,y=60, s="Point-biserial: r={0:.2f}, p={1:.2f}".format(corr_coeff[0],corr_coeff[1]), ha="center", va="center")
sns.boxplot(y='favorite_count',x='tweet_type',data= data,ax=ax2).set_title('Favourite count: Tweet type')
fig.get_axes()[1].set_yscale('log')
The box-plot shows that WeRateDogs tweets which are retweeted by the same account have a higher favourite count than those that are not. Thus retweeting is plausibly a good strategy for increasing a tweets popularity.However, the analysis here can't establish causality, because it's certainly possible that popular tweets are more likely to be chosen for retweeting.
As expected replies are less popular than original WeRateDog tweets, but quote tweets are surprisingly more popular than the standard original WeRateDog tweets.
WeRateDogs uses an ideosyncratic rating system for dogs, in which ratings of usually between 10 to 13 out of 10 are used. Lower rating do exist as well as 14, but ratings above 14 are outside of the standard used.
Do these ratings express anything meanful about the dog(s) in question? Below is a regression plot of the rating (numerator) against the dog's popularity (log of favorite count). Only ratings below 15/10 are considered.
data = df_archive_copy[df_archive_copy.rating_numerator<15]
x = data.rating_numerator
y = np.log(data.favorite_count)
pearson = stats.pearsonr(x,y)
spearman = stats.spearmanr(x,y) #https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.spearmanr.html
ax=sns.jointplot(x,y,kind="reg")
ax.fig.text(x=0.17,y=0.77, s="Pearson: r={0:.2f}, p={1:.1f}".format(pearson[0],pearson[1]), ha="left", va="center")
ax.fig.text(x=0.17,y=0.74, s="Spearman: r={0:.2f}, p={1:.1f}".format(spearman[0],spearman[1]), ha="left", va="center")
ax.fig.subplots_adjust(top=0.93)
ax.fig.suptitle('Correlation of Dog rating vs popularity', fontsize=14);
The plot suggests that the rating system is meaningful and there is in fact a relationship between the rating and the popularity of a dog. The Pearson correlation coefficient of r=0.49 is lower than the Spearman correlation coefficient r=0.60, since Pearson only measure linear relationships. Again, it is not possible to draw any conclusions re. causality here - it may be that a higher rating biases the audience to favour a dog picture, or it may be that a high/low favourite count and a high/low dog rating are both independently caused by how great a dog picture is.
Are young dogs more popular? Are certain breeds more popular? The box plot below looks at this question.
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(12,8))
x= df_archive_copy.dog_stage
y= np.log(df_archive_copy.favorite_count)
sns.boxplot(x,y,data = df_archive_copy,ax=ax1).set_title('Dog stage and popularity')
top_dogs=df_archive_copy.top_prediction.value_counts().head(8).index.tolist()
data=df_archive_copy[df_archive_copy.top_prediction.isin(top_dogs)]
x= data.dog_stage
y= np.log(data.favorite_count)
sns.boxplot(x,y,data = data,ax=ax2).set_title('Dog breed and popularity')
plt.xticks(rotation=45);
The most popular dog stages are "puppo", "doggo" and "floofer", whose median popularity is about one unit above that of "pupper" and photos of dogs where no 'dog stage' is mentioned. It is perhaps surprising that "pupper" has the lowest median popularity rating - this suggests that it isn't simply the fact that a dog staage is mentioned in a tweet that contributes to its popularity.
Amongst the 7 most common dog breeds, Golden Retrievers tend to be the most popular, with Pugs and Chihuahuas at the lowest median popularity - these are roughly equal with pics that have no detected dog breed, which usually do not have a dog in the photo at all.
However, the effect of dog stage and dog breed on the popularity of dog pics is relatively small, considering that the ranges overlap.
Our cleaned data enable us to find out what the most common dog breeds are - here are the top 20 breeds. This analysis assumes that the breed predictions are quite accurate and don't have any systematic bias.
df_archive_copy.top_prediction.value_counts()[1:20].plot.barh(figsize=(8,5))
plt.gca().invert_yaxis()
Similarly, our cleaned data enable us to find out what the most common dog names are.
df_archive_copy.name.value_counts()[1:20].plot.barh(figsize=(8,5))
plt.gca().invert_yaxis();
Sources used
A number of sources were used for this project - links are usually made available as comments