DAND Term 2, Project 3:

Wrangle and Analyse Data: WeRateDogs Twitter activity

1 - Introduction ¶

WeRateDogs (https://twitter.com/dog_rates) is a Twitter account that rates people's dogs with a humorous comment about the dog. The goal of this project is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. This requires the prior gathering assessing and cleaning of the data.

#load Python libraries
import pandas as pd
import numpy as np
import requests
import tweepy
import json
import matplotlib.pyplot as plt 
import seaborn as sns
from scipy import stats

#set up matplotlib to work interactively
%matplotlib inline  
sns.set()           # switch on seaborn defaults

2 - Gathering¶

There different pieces of data are gathered for the project as described in the Project Details page :

WeRateDogs Twitter archive¶

from the course's project description :

The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv

#this file was downloded and saved in the same folder as the Jupyter notebook file
# it can simply be read into a dataframe
df_archive = pd.read_csv('twitter-archive-enhanced.csv')

#initial check to see what the dataframe looks like
df_archive.head(3)

Tweet image predictions¶

from the course's project description :

The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

#load .tsv into working memory using a method from the requests library
df_predictions  = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')

#store in working direcory
with open('image-predictions.tsv',mode='wb') as file:
              file.write(df_predictions.content)

#read .tsv into dataframe, specifying the tab separator
df_predictions = pd.read_csv('image-predictions.tsv',sep="\t")

#initial check what the dataframe looks like
df_predictions.head(3)

Additional gathering from Twitter API using Tweepy library¶

from the course's project description :

Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission.

#I obtained a Twitter dev account and the relevant API keys, secrets
#and tokens are below :

consumer_key = '########################################################'
consumer_secret = '########################################################'
access_token = '########################################################'
access_secret = '########################################################'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

#api = tweepy.API(auth)
#ensure appropriate rate_limit as described here: https://stackoverflow.com/a/44586034
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

#function to look up list of tweets, adapted from https://stackoverflow.com/a/44586034
def lookup_tweets(tweet_IDs):
    '''
    Returns list of tweets as generated by Twitter API
    Parameter: Tweet IDs, either as a list of integers, or 
    column of a Pandas dataframe ; 
    handles 100 tweet limit, uses api.statuses_lookup()
    '''    
    if isinstance(tweet_IDs, pd.Series): #https://stackoverflow.com/a/18117744
        tweet_IDs = tweet_IDs.tolist() #convert pandas df column to list
        
    full_tweets = [] #initialise list for list of tweets
    tweet_count = len(tweet_IDs)
    try:
        for i in range((tweet_count // 100) + 1): #handle 100 tweet limit
                        # Catch the last group if it is less than 100 tweets
            end_loc = min((i + 1) * 100, tweet_count)
            print(i)    #get feedback during downloading
            print("range {} to {}".format(i*100,end_loc))
            full_tweets.extend(   
                api.statuses_lookup(tweet_IDs[i * 100:end_loc])
            )
        return full_tweets
    except tweepy.TweepError:
        print("Something went wrong, quitting...")

# The aim is to use the Tweet IDs from WeRateDogs Twitter archive to gather 
# additional information using the lookup_tweets() function
# with the tweet_id column from df_archive
results = lookup_tweets(df_archive.tweet_id)

0
range 0 to 100
1
range 100 to 200
2
range 200 to 300
3
range 300 to 400
4
range 400 to 500
5
range 500 to 600
6
range 600 to 700
7
range 700 to 800
8
range 800 to 900
9
range 900 to 1000
10
range 1000 to 1100
11
range 1100 to 1200
12
range 1200 to 1300
13
range 1300 to 1400
14
range 1400 to 1500
15
range 1500 to 1600
16
range 1600 to 1700
17
range 1700 to 1800
18
range 1800 to 1900
19
range 1900 to 2000
20
range 2000 to 2100
21
range 2100 to 2200
22
range 2200 to 2300
23
range 2300 to 2356

#Next the tweets will be stored in a text file with one line per tweet
#using the _json property which contains JSON serializable response data
#as the tweepy status object itself is not JSON serializable https://stackoverflow.com/a/27901076
for tweet in results:
    with open('tweet_json.txt', 'a', encoding='utf8') as file: #append to text file
        json.dump(tweet._json,file)
            #json.dump()) as described in https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/
        file.write("\n") #add newline character

The structure of JSON tweet objects is explained in https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.html

Tweet objects contain information such as the tweet "id", "text", "user", as well as the retweet_count favorite_count - the latter two are specified in the projection description. I will also use the tweet id (so that it can be matched with the archive) as well as the user name.

JSON of tweets which are retweets (i.e. simply re-posting an existing tweet potentially authored by another user, as is but creating a new tweet) or quote tweets (i.e. reposting an existing tweet with additional tweet text) contain additional tweet objects : a "retweeted_status" object or a "quoted_status" object, respectively. The "retweeted_status" and "quoted_status" objects also contain information such as "id", "text", "user" of the the tweet being retweeted/quoted. They are simply absent from tweet obects that aren't retweeted/quoted. In order to be able to easily distinguish between original tweets by WeRateDogs, I will add columns for whether a tweet is a retweet, a quote tweet, as well as the author of the original tweets being retweeted/quoted. Similarly I will add information about whether a tweet is a reply to another tweet, again to be able to easily distinguish between original tweets by WeRateDogs and tweets that are part of an ongoing conversation.

#The json file with the tweets will be read in and processed to build a 
#dataframe with IDs, retweet count and favourite count
#also check if a tweet is a retweet, and if it is a reply


additional = [] #initialise list that will hold dicitionaries
with open('tweet_json.txt','r', encoding='utf8') as file:
    for line in file:
        tweet=json.loads(line)

        #get retweet status, user if they exist
        try:
            retweet_user=tweet['retweeted_status']['user']['name']
            is_retweet = True
        except:
            retweet_user = None
            is_retweet = False         

            #get quote status, user if they exist        
        try:
            quote_user=tweet['quoted_status']['user']['name']
            is_quote = True
        except:
            quote_user = None
            is_quote = False         
        
        
        
        additional.append({'tweet_id': tweet['id'],
                           'user_name':tweet['user']['name'],
                          'retweet_count': tweet['retweet_count'],
                           'favorite_count' : tweet['favorite_count'],
                           #'is_retweet' : 'retweeted_status' in tweet, 
                                 #check if it's a retweet, adapted from https://stackoverflow.com/a/18937252
                           'is_retweet' : is_retweet,
                           'retweet_user' : retweet_user,
                           'is_quote' : is_quote,
                           'quote_user': quote_user,
                           'is_reply' : tweet['in_reply_to_status_id'] is not None 
                                 #check if it's a reply, adapted from https://stackoverflow.com/a/49469052
                          })
       # break
        
df_additional=pd.DataFrame(additional, columns=['tweet_id','user_name','retweet_count','favorite_count','is_retweet','retweet_user','is_quote','quote_user','is_reply'])

#initial check of new dataframe
df_additional.head(3)

df_additional.shape

(2339, 9)

3 - Assess¶

Data quality needs to be assessed in the context of the questions that analysis of a data set is meant to answer. Before starting the assessment, it's helpful to be aware of what those questions might be. For the present data, questions might include:

is the idiosyncratic rating system meaningful?
which dogs are the most popular?
how is the popularity of tweets from this Twitter account developing over time?

Assessing additional data gathered using Tweepy (df_additional)¶

df_additional.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2339 entries, 0 to 2338
Data columns (total 9 columns):
tweet_id          2339 non-null int64
user_name         2339 non-null object
retweet_count     2339 non-null int64
favorite_count    2339 non-null int64
is_retweet        2339 non-null bool
retweet_user      167 non-null object
is_quote          2339 non-null bool
quote_user        24 non-null object
is_reply          2339 non-null bool
dtypes: bool(3), int64(3), object(3)
memory usage: 116.6+ KB

Data types look OK. None values for retweet_user, quote_user are accurate rather than missing values because these tweets are not retweets/quote tweets. tweet_id could also be a string rather than integer, but integer may be more efficient here.

#visual assessment of the dataframe in Pandas - also done in separate spreadsheet software

df_additional

df_additional.describe()

retweet_count and favorite_count of 0 seem low compared to their respective median values- what are these?

df_additional[df_additional.favorite_count==0].head(10)

df_additional[df_additional.favorite_count==0].is_retweet.value_counts()

True    167
Name: is_retweet, dtype: int64

Low favorite_count values are seen on retweets - because favourites are added to the original tweet rather than the retweet. This is true whether the retweet is of another user's tweet or a tweet by WeRateDogs. These low favorite_count values are not reflective of a data quality issue - but are important to understand the data.

df_additional[df_additional.retweet_count<10]

Low retweet_count values are seen on tweets which are replies and presumably often of lesser interest in isolation therfore generally not worth retweeting. Again: not reflective of a data quality issue - but are important to understand the data.

df_additional[df_additional.favorite_count==0].head(10)

df_additional.retweet_user.value_counts()

WeRateDogs™                           144
We Rate Dog Ratings                     2
Jon 🇬🇧                                  1
Rachel Baker                            1
tally                                   1
Ramón Laureano’s Arm 💪🏽                 1
Rachael                                 1
Jeannette                               1
Patreon                                 1
♪{•.•}* • banditø 8/3/19                1
Ellen M                                 1
Jenna Marbles                           1
Lin-Manuel Miranda                      1
Kiba                                    1
Eddie                                   1
Twitter                                 1
Kate Gray                               1
katie                                   1
Allison Schafer                         1
Gia 👩🏼‍🍳                                1
That wasn't very cash money of you      1
Emilie Shepherd                         1
Aaron Chewning                          1
Name: retweet_user, dtype: int64

df_additional.quote_user.value_counts()

NBC News                   1
Shomari Stone              1
Codeine Bryant             1
Yahoo News                 1
Animal Cognition           1
Tower Hamlets Police       1
Bellevue, WA Police        1
ABC News                   1
Thug on a thot tin roof    1
Mark Halperin              1
WSAZ NewsChannel 3         1
FOX Deportes               1
Badlands Nat'l Park        1
NBC Sports Capitals        1
KTLA                       1
Micah Grimes               1
StickerGrub                1
WGN TV News                1
Deadspin                   1
Kai Johnson                1
BBC News (World)           1
The Telegraph              1
KING 5 News                1
CHP San Francisco          1
Name: quote_user, dtype: int64

WeRateDogs doesn't quote-tweet their own tweets, but commonly retweets their own tweets - this is a popular way for tweeters to boost their recent posts.

No major data quality issues stand out with the additional tweet data in df_additional. Data completeness and tidiness is assessed below.

Assessing the WeRateDogs Twitter archive (df_archive)¶

#visual assessment of the dataframe in Pandas - also done in separate spreadsheet software
df_archive

#looking at missing fields and if data types are appropriate
df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB

No data are available in a large number of fields relating to replies (e.g. 78 in_reply_to_status_id), retweets (181 retweeted_status_id), expanded urls (2297 expanded_urls).

The following data types are not appropriate:

timestamp and retweeted_status_timestamp should not be strings (object) but datetime

in_reply_to_status_id, retweeted_status_id should probably not be floats -
the same is true for in_reply_to_user_id, retweeted_status_user_id

#replies 
df_archive[df_archive.in_reply_to_status_id.notnull()][['tweet_id','in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp']].head()

#retweets
df_archive[df_archive.retweeted_status_id.notnull()][['tweet_id','retweeted_status_id','retweeted_status_user_id', 'retweeted_status_timestamp']].head()

Twitter uses the same process to generate tweet (status) IDs and user IDs. These are unique IDs based on time https://developer.twitter.com/en/docs/basics/twitter-ids.html

Their purpose is to create a unique identifier rather than serve mathematical calculations. However when IDs are encoded as floats, they are displayed in scientific notation, obscuring most of the digits - undermining the pupose of serving as unique identifiers.

Unlike the other ID variables, the tweet_id field has the data type of integer and is displayed correctly. The reason the other ID variables were read in as floats by read_csv is that NaN can't be handled by the integer data type (see https://stackoverflow.com/a/11548224) (NB. The latest version of Pandas, 0.24, has introduced experimental support for NA in integer data type https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

#test if tweet_id really is unique
sum(df_archive.tweet_id.duplicated())

0

sum(df_archive.text.duplicated())

0

#check a few sample tweet texts
df_archive.text.sample(5)

1703    Meet Griswold. He's dapper as hell. Already pu...
1847    Here's a handful of sleepy puppers. All look u...
179                          @Marc_IRL pixelated af 12/10
944     Nothing better than a doggo and a sunset. 10/1...
2251    This is Filup. He is overcome with joy after f...
Name: text, dtype: object

#using for loop to print the whole text of tweets
for text in df_archive.text.sample(10):
        print(text)

This is Olivia. She just saw an adult wearing crocs. 11/10 poor pupper. No one should witness such a thing https://t.co/yJVTi1DjJc
My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O
Who leaves the last cupcake just sitting there? 9/10 https://t.co/PWMqAoEx2a
This dog can't see its haters. 11/10 https://t.co/35BcGFdEAK
"Yes hello I'ma just snag this here toasted bagel real quick. carry on." 9/10 https://t.co/Cuz0Osnekp
This is Ozzie. He was doing fine until he lost traction in those festive socks. Now he's tired. 9/10 still killin it https://t.co/u4FYdIRKnY
This is Livvie. Someone should tell her it's been 47 years since Woodstock. Magical eyes tho 11/10 would stare into https://t.co/qw07vhVHuO
Say hello to Rorie. She's zen af. Just enjoying a treat in the sunlight. 10/10 would immediately trade lives with https://t.co/yctnFptdQ1
This pupper just descended from heaven. 12/10 can probably fly https://t.co/X6X9wM7NuS
"AND IIIIIIIIIIIEIIIIIIIIIIIII WILL ALWAYS LOVE YOUUUUU" 11/10 https://t.co/rSNCEiTtfI

The standard format of tweets is to start tweets with "This is ...(name of dog)", where the name is known, and ending in a rating out of 10, followed by a photo of the dog.

#checking the ratings numerator and denominator - how many unqiue values?
df_archive.rating_numerator.nunique(), df_archive.rating_denominator.nunique()

(40, 18)

#frequency of different values of the numerator
df_archive.rating_numerator.value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

#checking tweet text where the numerators is 15 or more
for tweet in df_archive[df_archive.rating_numerator>14].text.sample(10):
    print(tweet)

RT @KibaDva: I collected all the good dogs!! 15/10 @dog_rates #GoodDogs https://t.co/6UCGFczlOI
This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
@markhoppus 182/10
@roushfenway These are good dogs but 17/10 is an emotional impulse rating. More like 13/10s
This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS
Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
From left to right:
Cletus, Jerome, Alejandro, Burp, &amp; Titson
None know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK
Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. 
Keep Sam smiling by clicking and sharing this link:
https://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
After so many requests... here you go.

Good dogg. 420/10 https://t.co/yfAAo1gdeY

#checking tweet text where the numerators is less than 10
for tweet in df_archive[df_archive.rating_numerator<10].text.sample(10):
    print(tweet)

This is Gus. He's super stoked about being an elephant. Couldn't be happier. 9/10 for elephant pupper https://t.co/gJS1qU0jP7
This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI
Say hello to Oliver. He thought what was inside the pillow should be outside the pillow. Blurry since birth. 8/10 https://t.co/lFU9W31Fg9
This is Billl. He's trying to be a ghost but he's not very good at it. 6/10 c'mon Billl https://t.co/ero0XfdGtY
Two obedient dogs here. Left one has extra leg sticking out of its back. They each get 9/10. Would pet both at once https://t.co/RGcNPsmAfY
This is Philippe from Soviet Russia. Commanding leader. Misplaced other boot. Hung flag himself. 9/10 charismatic af https://t.co/5NhPV8E45i
Army of water dogs here. None of them know where they're going. Have no real purpose. Aggressive barks. 5/10 for all https://t.co/A88x73TwMN
Say hello to Penny &amp; Gizmo. They are practicing their caroling. The ambition in the room is tangible. 9/10 for both https://t.co/aqBHjjh5VD
Meet Stanley. He's an inverted Uzbekistani water pup. Hella exotic. Floats around all day. 8/10 I want to be Stanley https://t.co/XpYMBQ1FD8
This is Keurig. He's a rare dog. Laughs like an idiot tho. Head is basically a weapon. Poorly maintained goatee 4/10 https://t.co/xOrUyj7K30

Most values of the rating_numerator are between 10 and 14. Values outside of that range are often from tweets that do not follow the standard dog ratings format of this accounts.

Based on a sample, values much above 15 are often for pics with several dogs, and can be taken to be a multiple of an actual rating.

Based on a sample, values below 10 are often not pictures of a dog but other animals, but can also be of somewhat ugly dogs.

#checking the values of the ratings denominator
df_archive.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

for text in df_archive.query("rating_denominator != 10").text:
    if "/10" in text:
        print(text)

@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https:/…
After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ
Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a
This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq
This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5
This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv

for text in df_archive.query("rating_denominator != 10").text:
    if "/10" not in text:
        print(text)

@docmisterio account started on 11/15/15
The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd
Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. 
Keep Sam smiling by clicking and sharing this link:
https://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1
Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12
From left to right:
Cletus, Jerome, Alejandro, Burp, &amp; Titson
None know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK
Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa
Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ
Yes I do realize a rating of 4/20 would've been fitting. However, it would be unjust to give these cooperative pups that low of a rating
Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3
Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55
I'm aware that I could've said 20/16, but here at WeRateDogs we are very professional. An inconsistent rating scale is simply irresponsible
IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq
Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw

The rating_denominator is almost always 10.

In cases where it isn't 10, but the tweet text contains a fraction out of 10, it is a mistake with how the ratings were extracted due to the presence of another faction. This is an inaccuracy that can be fixed since we still have the tweet text available. In these cases the numerator is also wrong and can be cleaned.

In cases where it isn't 10, but the tweet text does not contain a fraction out of 10 but a multiple of 10, the rating is a deviation from the normal dog rating format applied to multiple dogs - again this can be fixed, for both the numerator and denominator.

#Checking the name variable
df_archive.name.value_counts()

None         745
a             55
Charlie       12
Cooper        11
Oliver        11
Lucy          11
Penny         10
Lola          10
Tucker        10
Bo             9
Winston        9
Sadie          8
the            8
Bailey         7
Buddy          7
an             7
Daisy          7
Toby           7
Stanley        6
Dave           6
Rusty          6
Oscar          6
Milo           6
Bella          6
Koda           6
Leo            6
Scout          6
Jack           6
Jax            6
Oakley         5
            ... 
Durg           1
space          1
Furzey         1
Sunshine       1
Maxwell        1
Jonah          1
Chloe          1
Divine         1
Ginger         1
Tug            1
Kara           1
Julio          1
Vinnie         1
Gordon         1
Bonaparte      1
Brandi         1
Lipton         1
Crawford       1
Blanket        1
Ace            1
Brooks         1
Milky          1
Clifford       1
Kathmandu      1
Rudy           1
Jameson        1
Bauer          1
Boston         1
Ferg           1
Lucia          1
Name: name, Length: 957, dtype: int64

#checking all words in the names column that are not capitalised
df_archive[df_archive.name.str.islower()].name.value_counts()

a               55
the              8
an               7
very             5
just             4
quite            4
one              4
mad              2
not              2
getting          2
actually         2
unacceptable     1
such             1
officially       1
incredibly       1
my               1
all              1
life             1
old              1
his              1
this             1
infuriating      1
light            1
space            1
by               1
Name: name, dtype: int64

#for tweet_text in df_archive[df_archive.name == 'a'].text.sample(5):
for tweet_text in df_archive[df_archive.name.str.islower()].text.sample(10):
    print(tweet_text)

Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af 
(IG: puffie_the_chow) https://t.co/ghXBIIeQZF
This is a very rare Great Alaskan Bush Pupper. Hard to stumble upon without spooking. 12/10 would pet passionately https://t.co/xOBKCdpzaa
This is the newly formed pupper a capella group. They're just starting out but I see tons of potential. 8/10 for all https://t.co/wbAcvFoNtn
This is quite the dog. Gets really excited when not in water. Not very soft tho. Bad at fetch. Can't do tricks. 2/10 https://t.co/aMCTNWO94t
I've yet to rate a Venezuelan Hover Wiener. This is such an honor. 14/10 paw-inspiring af (IG: roxy.thedoxy) https://t.co/20VrLAA8ba
This is an Irish Rigatoni terrier named Berta. Completely made of rope. No eyes. Quite large. Loves to dance. 10/10 https://t.co/EM5fDykrJg
Guys this is getting so out of hand. We only rate dogs. This is a Galapagos Speed Panda. Pls only send dogs... 10/10 https://t.co/8lpAGaZRFn
This is a baby Rand Paul. Curls for days. 11/10 would cuddle the hell out of https://t.co/xHXNaPAYRe
This is space pup. He's very confused. Tries to moonwalk at one point. Super spiffy uniform. 13/10 I love space pup https://t.co/SfPQ2KeLdq
This is the happiest pupper I've ever seen. 10/10 would trade lives with https://t.co/ep8ATEJwRb

It appears that the name was extracted by using a rather simplistic method of looking for the construction "This is"..., assuming the next word would be a dog's name. These cases are easily identified by looking for the absence of a capitalised named.

#create a mask to select records where the name is either None or lower-case
mask = (df_archive.name == 'None') | (df_archive.name.str.islower())

#where there is no correct name, check for names after a different construction:
for tweet_text in df_archive.loc[mask].text.sample(100):
    if ("named" in tweet_text) | ("name is" in tweet_text) :
        print(tweet_text)

This is a Birmingham Quagmire named Chuk. Loves to relax and watch the game while sippin on that iced mocha. 10/10 https://t.co/HvNg9JWxFt
This pup's name is Sabertooth (parents must be cool). Ears for days. Jumps unannounced. 9/10 would pet diligently https://t.co/iazoiNUviP
This is a Sizzlin Menorah spaniel from Brooklyn named Wylie. Lovable eyes. Chiller as hell. 10/10 and I'm out.. poof https://t.co/7E0AiJXPmI
We normally don't rate bears but this one seems nice. Her name is Thea. Appears rather fluffy. 10/10 good bear https://t.co/fZc7MixeeT
This is a Tuscaloosa Alcatraz named Jacob (Yacōb). Loves to sit in swing. Stellar tongue. 11/10 look at his feet https://t.co/2IslQ8ZSc7
This is a southwest Coriander named Klint. Hat looks expensive. Still on house arrest :(
9/10 https://t.co/IQTOMqDUIe

Some records with a misattributed lower-case name or with "None" do still contain a dog's name - these can be extracted from the original tweet.

#checking the 4 last columns
df_archive.iloc[:,13:].head()

df_archive.doggo.value_counts()

None     2259
doggo      97
Name: doggo, dtype: int64

df_archive.floofer.value_counts()

None       2346
floofer      10
Name: floofer, dtype: int64

df_archive.pupper.value_counts()

None      2099
pupper     257
Name: pupper, dtype: int64

df_archive.puppo.value_counts()

None     2326
puppo      30
Name: puppo, dtype: int64

df_archive[(df_archive.doggo!="None") & (df_archive.pupper!="None")].iloc[:,13:]

The four last columns represent dog stage. Usually they have at most one dog stage, but sometimes there is more than one. The structure of these columns is not consistent with the principles of data tidiness: the column names are values - there should be a single variable instead.

df_archive.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

df_archive.expanded_urls.value_counts()

https://twitter.com/dog_rates/status/753375668877008896/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/775733305207554048/photo/1                                                                                                                                                                                                                                                                                                                                    2
http://www.gofundme.com/bluethewhitehusky,https://twitter.com/dog_rates/status/831650051525054464/photo/1,https://twitter.com/dog_rates/status/831650051525054464/photo/1,https://twitter.com/dog_rates/status/831650051525054464/photo/1,https://twitter.com/dog_rates/status/831650051525054464/photo/1                                                                                          2
https://twitter.com/dog_rates/status/839549326359670784/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/762464539388485633/photo/1,https://twitter.com/dog_rates/status/762464539388485633/photo/1,https://twitter.com/dog_rates/status/762464539388485633/photo/1,https://twitter.com/dog_rates/status/762464539388485633/photo/1                                                                                                                                    2
https://twitter.com/dog_rates/status/817827839487737858/video/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/681523177663676416/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/878057613040115712/photo/1,https://twitter.com/dog_rates/status/878057613040115712/photo/1                                                                                                                                                                                                                                                                    2
https://www.gofundme.com/3ti3nps,https://twitter.com/dog_rates/status/868552278524837888/photo/1,https://twitter.com/dog_rates/status/868552278524837888/photo/1                                                                                                                                                                                                                                   2
https://twitter.com/dog_rates/status/774314403806253056/photo/1,https://twitter.com/dog_rates/status/774314403806253056/photo/1,https://twitter.com/dog_rates/status/774314403806253056/photo/1,https://twitter.com/dog_rates/status/774314403806253056/photo/1                                                                                                                                    2
https://twitter.com/dog_rates/status/829501995190984704/photo/1,https://twitter.com/dog_rates/status/829501995190984704/photo/1                                                                                                                                                                                                                                                                    2
https://www.gofundme.com/help-lorenzo-beat-cancer,https://twitter.com/dog_rates/status/860563773140209665/photo/1,https://twitter.com/dog_rates/status/860563773140209665/photo/1                                                                                                                                                                                                                  2
https://twitter.com/dog_rates/status/694669722378485760/photo/1,https://twitter.com/dog_rates/status/694669722378485760/photo/1                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/820749716845686786/photo/1,https://twitter.com/dog_rates/status/820749716845686786/photo/1                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/700747788515020802/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/667152164079423490/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://www.gofundme.com/helpquinny,https://twitter.com/dog_rates/status/863062471531167744/photo/1,https://twitter.com/dog_rates/status/863062471531167744/photo/1,https://twitter.com/dog_rates/status/863062471531167744/photo/1,https://twitter.com/dog_rates/status/863062471531167744/photo/1                                                                                                2
https://twitter.com/dog_rates/status/809920764300447744/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/718631497683582976/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/866450705531457537/photo/1,https://twitter.com/dog_rates/status/866450705531457537/photo/1                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/781524693396357120/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/750719632563142656/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://www.gofundme.com/3hgsuu0,https://twitter.com/dog_rates/status/840632337062862849/photo/1                                                                                                                                                                                                                                                                                                   2
https://twitter.com/dog_rates/status/833124694597443584/photo/1,https://twitter.com/dog_rates/status/833124694597443584/photo/1,https://twitter.com/dog_rates/status/833124694597443584/photo/1                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/680055455951884288/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/819227688460238848/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/759923798737051648/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/786963064373534720/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/759447681597108224/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/768193404517830656/photo/1                                                                                                                                                                                                                                                                                                                                    2
                                                                                                                                                                                                                                                                                                                                                                                                  ..
https://twitter.com/dog_rates/status/674045139690631169/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/744971049620602880/photo/1,https://twitter.com/dog_rates/status/744971049620602880/photo/1,https://twitter.com/dog_rates/status/744971049620602880/photo/1                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/712717840512598017/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/693280720173801472/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/687818504314159109/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/795076730285391872/photo/1,https://twitter.com/dog_rates/status/795076730285391872/photo/1,https://twitter.com/dog_rates/status/795076730285391872/photo/1,https://twitter.com/dog_rates/status/795076730285391872/photo/1,https://twitter.com/dog_rates/status/795076730285391872/photo/1,https://twitter.com/dog_rates/status/795076730285391872/photo/1    1
https://twitter.com/dog_rates/status/680889648562991104/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/884162670584377345/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/669972011175813120/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/814153002265309185/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1                                                                                                                                                                                                                                                                    1
https://twitter.com/ChinoChinako/status/800065028116385792/photo/1,https://twitter.com/ChinoChinako/status/800065028116385792/photo/1,https://twitter.com/ChinoChinako/status/800065028116385792/photo/1                                                                                                                                                                                           1
https://twitter.com/dog_rates/status/687704180304273409/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/674754018082705410/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/718460005985447936/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/668226093875376128/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/860524505164394496/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/886258384151887873/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/884562892145688576/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/856282028240666624/photo/1,https://twitter.com/dog_rates/status/856282028240666624/photo/1,https://twitter.com/dog_rates/status/856282028240666624/photo/1,https://twitter.com/dog_rates/status/856282028240666624/photo/1                                                                                                                                    1
https://twitter.com/dog_rates/status/671159727754231808/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/739606147276148736/photo/1,https://twitter.com/dog_rates/status/739606147276148736/photo/1,https://twitter.com/dog_rates/status/739606147276148736/photo/1                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/742385895052087300/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/737678689543020544/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/813112105746448384/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/rolltidered/status/832434358292209665/photo/1                                                                                                                                                                                                                                                                                                                                  1
https://twitter.com/dog_rates/status/677530072887205888/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/883482846933004288/photo/1,https://twitter.com/dog_rates/status/883482846933004288/photo/1                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/667393430834667520/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/692901601640583168/photo/1                                                                                                                                                                                                                                                                                                                                    1
Name: expanded_urls, Length: 2218, dtype: int64

The usefulness of some columns is not clear, e.g. the source column and expanded_urls

Assessing image predictions (df_predictions)¶

df_predictions

df_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

data types are fine, there are no apparent missing fields, however the column names are not very descriptive

df_predictions.img_num.nunique()

4

df_predictions.img_num.value_counts()

1    1780
2     198
3      66
4      31
Name: img_num, dtype: int64

df_predictions[df_predictions.img_num>1].sample(5)

sum(df_predictions.tweet_id.duplicated())

0

#check duplicated images
sum(df_predictions.jpg_url.duplicated()), sum(df_predictions.jpg_url.duplicated(keep=False))

(66, 132)

df_predictions[df_predictions.jpg_url.duplicated(keep=False)].sort_values(by=['jpg_url']).head(10)

some images are duplicated, even though the tweet IDs are not
each is duplicated only once, the records are not consecutive
this may result from retweeting or sth similar

df_predictions[(df_predictions.p1_dog==False) & (df_predictions.p2_dog==False) & (df_predictions.p3_dog==False)].sample(10)

len(df_predictions[(df_predictions.p1_dog==False) & (df_predictions.p2_dog==False) & (df_predictions.p3_dog==False)])

324

For 324 records, no dog has been predicted. Looking at the pictures in question, this is often accurate, but at other times they are pictures in which dogs are very hard to predict because they only occupy a small proportion of the photo, or are somehow disguised, eg. wearing glasses, wigs etc.

#for how many is confidence of prediction 1 higher than 2, and 2 higher than 3?
sum(df_predictions.p1_conf > df_predictions.p2_conf), sum(df_predictions.p2_conf > df_predictions.p3_conf)

(2075, 2075)

The predictions are ranked according the confidence

Assessing dataframe tidiness¶

the number of rows differ between the 3 dataframes:

df_archive.shape, df_additional.shape, df_predictions.shape

((2356, 17), (2339, 9), (2075, 12))

#df_additional was constructed using tweet_ids from df_archive, yet it has fewer rows
#sample of tweets missing from df_additional, using isin() and tilde for boolean indexing

df_archive[~df_archive.tweet_id.isin(df_additional.tweet_id)].sample(5) #https://stackoverflow.com/a/19960116

none of these tweets are accessible anymore - they appear to have been deleted: https://twitter.com/dog_rates/status/802247111496568832, https://twitter.com/dog_rates/status/861769973181624320, https://twitter.com/dog_rates/status/837012587749474308, https://twitter.com/dog_rates/status/842892208864923648,https://twitter.com/dog_rates/status/812747805718642688

#make a pd.Series of the deleted tweets and check if it explains the difference in number of records
deleted_tweets=df_archive[~df_archive.tweet_id.isin(df_additional.tweet_id)].tweet_id
deleted_tweets.size+df_additional.shape[0]==df_archive.shape[0]

True

#are the deleted tweets in the image prediction?
df_predictions[df_predictions.tweet_id.isin(deleted_tweets)]

#number of deleted tweets in image predictions, and total number of deleted tweets
df_predictions[df_predictions.tweet_id.isin(deleted_tweets)].shape[0],deleted_tweets.size

(9, 17)

9 of the 17 deleted tweets are also in df_predictions.

What about the difference in record numbers between df_predictions and the other two dataframes?

df_predictions[~df_predictions.tweet_id.isin(df_archive.tweet_id)].shape[0]

0

len(df_archive[~df_archive.tweet_id.isin(df_predictions.tweet_id)])

281

df_additional[~df_additional.tweet_id.isin(df_predictions.tweet_id)][['tweet_id','is_retweet','is_quote','is_reply']].head()

to sum up:

17 tweets are missing from df_additional compared to df_archive
9 of these tweets missing from df_additional are present in df_predictions
all tweet ids in df_predictions are also present in df_archive
however there is no image prediction for 281 tweets present in df_additional - these are commonly either a reply, retweet or quote-tweet
66 image predictions are duplicated images - though the tweet IDs are unique
324 of the image predictions don't contain a dog

Why is there no prediction for 281 tweets?

#calculate the tweed ids that are missing for which there is also no expanded_url - i.e. no image
missing = (df_archive.expanded_urls.isnull())  & (~df_archive.tweet_id.isin(df_predictions.tweet_id)) 
sum(missing)

59

#list of Tweet IDs not present in predictions
missing_predictions = df_archive[~df_archive.tweet_id.isin(df_predictions.tweet_id)].tweet_id.tolist()

#IDs of tweets that are either a retweet, quote, or reply
not_original_tweet = df_additional[df_additional.is_retweet | df_additional.is_quote | df_additional.is_reply].tweet_id.tolist()

#how many IDs are a retweet, quote, or reply, i.e. in the intersection of missing_predictions and not_original_tweet
len(set(missing_predictions).intersection(set(not_original_tweet)))

170

Many tweets for which there is no image prediction available either did not have a photo attached that is available, or they are a reply/retweet/quote-tweet -- these tweets tend to not follow the standard format of the WeRateDogs dog rating tweets.

Summary¶

discuss 4 data quality dimensions: missing, accuracy etc

Data quality

df_additional (tweepy):

missing rows compared to df_archive, due to deleted tweets

df_archive:

timestamp and retweeted_status_timestamp are strings not datetime

tweet_id is integer but other unique identifier variables (retweeted_status_id,retweeted_status_user_id, in_reply_to_status_id and in_reply_to_user_id) are floats (wrong dtype)
some dog names are not names, some have been missed
some dog ratings are invalid or inaccurate, usually affecting rating_denominator and rating_numerator

image predictions:

image predictions: column names are not descriptive

image predictions: inconsistent capitalisation of breed names
duplicated images lead to identical predictions, despite unique tweet_id

Data tidyness

dog stage variable is spread over several columns (doggo,floofer,pupper, puppo)

df_additional: is_reply, is_retweet, is_quote could be same variable
tweet each table an observational unit : df_archive and df_additional are same observational unit (tweet), while dog predictions could potentially regarded as separate obsevrational unit

4 - Data cleaning¶

##make copies of all dataframes to be cleaned
df_archive_copy = df_archive.copy()
df_predictions_copy = df_predictions.copy()
df_additional_copy = df_additional.copy()

4.1 - Data quality : Incompleteness - dealing with missing rows in df_additional¶

17 rows in df_archive are missing from df_additional because the corresponding tweets have since been deleted, and 9 of these are also present in df_predictions. There are several ways of dealing with this.

One possibility is to reconstruct data about the missing tweets from df_archive and add those to df_additional. However, the main purpose of obtaining additional data using the Twitter API (in df_additional) is to complement df_archive, and this is unrecoverable for the missing tweets. Nothing would be gained with this approach.

Therefore a better choice would be to delete the rows representing deleted tweets in df_archive and df_predictions.

Define:

remove 17 rows from df_archive
remove 9 rows from df_predictions

Code:

df_archive_copy = df_archive_copy[df_archive_copy.tweet_id.isin(df_additional.tweet_id)]
df_predictions_copy = df_predictions_copy[df_predictions_copy.tweet_id.isin(df_additional.tweet_id)]

Test:

#check nb of rows has decreased
df_archive.shape, df_archive_copy.shape, df_archive.shape > df_archive_copy.shape

((2356, 17), (2339, 17), True)

df_predictions.shape, df_predictions_copy.shape, df_predictions.shape > df_predictions_copy.shape

((2075, 12), (2066, 12), True)

#check row number of cleaned df_archive is the same as df_additional
df_archive_copy.shape[0] == df_additional.shape[0]

True

#make dataframe of tweets absent from df_additional
deleted_tweets = df_archive[~df_archive.tweet_id.isin(df_additional.tweet_id)]

#check all the tweet_ids in the cleaned dataframes are absent from the deleted tweets dataframe
assert sum(df_archive_copy.tweet_id.isin(deleted_tweets.tweet_id)) == 0
assert sum(df_predictions_copy.tweet_id.isin(deleted_tweets.tweet_id)) == 0

assert sum(df_archive_copy.tweet_id.isin(deleted_tweets.tweet_id)) == 0

assert sum(df_predictions_copy.tweet_id.isin(deleted_tweets.tweet_id)) == 0

#compute the number of Tweet IDs for which there is no image prediction after removing missing tweets
df_additional_copy.shape[0] - sum(df_additional_copy.tweet_id.isin(df_predictions_copy.tweet_id))

273

4.2 - Data tidying: dog stage¶

Define

Four columns represent "dog stage" i.e. the column names are values that represent a single variable of dog-stage. However, there are more than 4 different values which this variable can take, as many have no dog stage associated, and in some cases there are several dogs in the picture. This makes the task more complex.

Use a system where the variable dog stage takes on one of 6 different values: "doggo", "floofer", "pupper", "puppo", "several dogs", "no stage".
- Create columns with booleans stating whether or not the tweet has the dog stage of the column name
- Then, from the four columns with booleans, create a new column to state whether there are several dog stages, and a further column stating whether there is no dog stage
- ensure there's only one dog stage in each of the 6 columns, or "None"

use merge to create temporary table with a column for the variable "dog stage" and tweet_id
- merge leads to duplicated tweet_ids due to "None" values, so drop rows with "None"
- convert type to category
- check there are no duplicated and no missing tweetids
merge dog stage variable from temporary df back into df_archive_copy, omitting redundant columns

Code

#check head of 4 last columns ie. dog stage
df_archive_copy.iloc[:1,13:]

#create columns with boolean values for each dog stage
df_archive_copy['doggo_b'] = df_archive_copy.iloc[:,13]!="None"
df_archive_copy['floofer_b'] = df_archive_copy.iloc[:,14]!="None"
df_archive_copy['pupper_b'] = df_archive_copy.iloc[:,15]!="None"
df_archive_copy['puppo_b'] = df_archive_copy.iloc[:,16]!="None"

#calculate sum of boolean values for dog stage
#new column - will be 0 when there is none, and 2,3 or 4 if there is more than 1 dog stage value
df_archive_copy['several'] = df_archive_copy['doggo_b'].astype(int) + df_archive_copy['floofer_b'].astype(int) + df_archive_copy['pupper_b'].astype(int) + df_archive_copy['puppo_b'].astype(int) 

#new column with boolean to say when there has been no dog stage  defined
df_archive_copy['none'] = df_archive_copy['several']==0 

#convert column to boolean - True = there are several stages
df_archive_copy['several'] = df_archive_copy['several']>1

#Look at records with several dog stages
mask=df_archive_copy.several==True
columns=['doggo', 'floofer', 'pupper', 'puppo']
df_archive_copy.loc[mask,columns]

#need to ensure the final dog stage variable can only take one value
#where there are several dog stages, overwrite values in the original columns with "None"
df_archive_copy.loc[mask,columns] = df_archive_copy.loc[mask,columns].replace(to_replace=['doggo','floofer','pupper','puppo'],
                                                                 value='None')
#check
df_archive_copy.loc[mask,columns]

#Now change boolean values in 'several' and 'no stage' to strings 
#so that they can be used by merge in the same way as the other columns
df_archive_copy.loc[:,'several'].replace(True,"several dogs", inplace=True)
df_archive_copy.loc[:,'none'].replace(True,"no stage", inplace=True)
df_archive_copy.loc[:,'several'].replace(False,"None", inplace=True)
df_archive_copy.loc[:,'none'].replace(False,"None", inplace=True)

#drop the columns with booleans no longer needed
df_archive_copy.drop([ 'doggo_b', 'floofer_b', 'pupper_b', 'puppo_b'],axis=1, inplace=True)

#check values are either None or dog stage in single column
df_archive_copy.iloc[:15,13:]

#use melt to create temporary dataframe column for dog stage variable
df_dog_stages = pd.melt(df_archive_copy, id_vars='tweet_id',
        value_vars=['doggo','floofer','pupper','puppo','several','none'],
        var_name = 'stages', value_name = 'dog_stage')
#check 
df_dog_stages.head(15)

#there should be duplication due to 5 "None" values for each record
df_dog_stages.dog_stage.value_counts()

None            11695
no stage         1961
pupper            244
doggo              82
puppo              29
several dogs       14
floofer             9
Name: dog_stage, dtype: int64

#delete rows with "None", drop redundant column
df_dog_stages=df_dog_stages[df_dog_stages.dog_stage!="None"]
df_dog_stages.drop('stages', axis=1, inplace=True)
#convert to category
df_dog_stages.dog_stage = df_dog_stages.dog_stage.astype('category')
#check 'None' is removed
df_dog_stages.dog_stage.value_counts()

no stage        1961
pupper           244
doggo             82
puppo             29
several dogs      14
floofer            9
Name: dog_stage, dtype: int64

#check df_dog_stages and df_archive_copy have same length, 
df_dog_stages.shape[0],  df_archive_copy.shape[0], df_dog_stages.shape[0] == df_archive_copy.shape[0]

(2339, 2339, True)

df_dog_stages.head()

#merge into df_archive_copy, using tweet_id, removing now redundant old dog stage columns
df_archive_copy = pd.merge(df_archive_copy.iloc[:,:13],df_dog_stages,on='tweet_id', how='left')

Test:

#check column structure
list(df_archive_copy)

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'dog_stage']

#dtype should be categorical
df_archive_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2339 entries, 0 to 2338
Data columns (total 14 columns):
tweet_id                      2339 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2339 non-null object
source                        2339 non-null object
text                          2339 non-null object
retweeted_status_id           167 non-null float64
retweeted_status_user_id      167 non-null float64
retweeted_status_timestamp    167 non-null object
expanded_urls                 2280 non-null object
rating_numerator              2339 non-null int64
rating_denominator            2339 non-null int64
name                          2339 non-null object
dog_stage                     2339 non-null category
dtypes: category(1), float64(4), int64(3), object(6)
memory usage: 258.3+ KB

#check values
df_archive_copy.dog_stage.value_counts()

no stage        1961
pupper           244
doggo             82
puppo             29
several dogs      14
floofer            9
Name: dog_stage, dtype: int64

4.3 - Data quality: Type of tweet - reply, retweet, quote tweet, or original tweet¶

Define

df_additional has several columns with boolean values stating if a tweet is a retweet, reply or quote. These can be better expressed as a single variable of different exclusive categories of tweet: Original tweet, retweet of own tweet (SelfRT), retweet of tweet from another source (OtherRT), reply, or quote tweet.

Code

#using boolean indexing for self-retweets
df_additional_copy[(df_additional_copy.retweet_user.notnull()) & (df_additional_copy.retweet_user == df_additional_copy.user_name)].head()

#new column with boolean to state if a tweet is a self-retweet
df_additional_copy['is_self_retweet'] = (df_additional_copy.retweet_user.notnull()) & (df_additional_copy.retweet_user == df_additional_copy.user_name)

#using boolean indexing for retweets from other source
df_additional_copy[(df_additional_copy.retweet_user.notnull()) & (df_additional_copy.retweet_user != df_additional_copy.user_name)].head()

#new column with boolean to state if a tweet is a retweet from other source
df_additional_copy['is_other_retweet'] = (df_additional_copy.retweet_user.notnull()) & (df_additional_copy.retweet_user != df_additional_copy.user_name)

#new column to state if tweet is original tweet, i.e. those where is_retweet, is_quote, is_reply are all False
df_additional_copy['is_original'] = (df_additional_copy.is_retweet | df_additional_copy.is_quote | df_additional_copy.is_reply) == False

#each tweet should belong to one of the following categories: 'is_quote', 'is_reply', 'is_self_retweet', 
#'is_other_retweet', 'is_original'
#therefore their sum should be the same as the number of rows.
df_additional_copy.shape[0] == sum(df_additional_copy.is_quote) + sum(df_additional_copy.is_reply) + sum(df_additional_copy.is_self_retweet) + sum(df_additional_copy.is_other_retweet) + sum(df_additional_copy.is_original)

True

#Replace True with the corresponding string
df_additional_copy.is_quote = df_additional_copy.is_quote.replace(True,"Quote")
df_additional_copy.is_reply = df_additional_copy.is_reply.replace(True,"Reply")
df_additional_copy.is_self_retweet = df_additional_copy.is_self_retweet.replace(True,"SelfRT")
df_additional_copy.is_other_retweet = df_additional_copy.is_other_retweet.replace(True,"OtherRT")
df_additional_copy.is_original = df_additional_copy.is_original.replace(True,"Original")

#Replace bool False with str
df_additional_copy[['is_quote', 'is_reply', 'is_self_retweet', 'is_other_retweet', 'is_original']] = df_additional_copy[['is_quote', 'is_reply', 'is_self_retweet', 'is_other_retweet', 'is_original']].replace(False,"False")

#use melt, also dropping variables no longer needed by leaving them out from id_vars
df_additional_copy = pd.melt(df_additional_copy, id_vars=['tweet_id', 'retweet_count', 'favorite_count'],#['tweet_id', 'user_name', 'retweet_count', 'favorite_count', 'is_retweet', 'retweet_user', 'quote_user'],
        value_vars=[ 'is_quote', 'is_reply', 'is_self_retweet', 'is_other_retweet', 'is_original'],
       var_name = 'is_type', value_name = 'tweet_type')
#check result
df_additional_copy.head()

#drop rows with False tweet_type
df_additional_copy = df_additional_copy[df_additional_copy.tweet_type!='False']
#convert to category
df_additional_copy.tweet_type = df_additional_copy.tweet_type.astype('category')
#drop unneeded column
df_additional_copy.drop('is_type', axis=1, inplace=True)

Test

#check cleaned dataframe structure
df_additional_copy.head()

#check data type
df_additional_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2339 entries, 6 to 11694
Data columns (total 4 columns):
tweet_id          2339 non-null int64
retweet_count     2339 non-null int64
favorite_count    2339 non-null int64
tweet_type        2339 non-null category
dtypes: category(1), int64(3)
memory usage: 75.6 KB

#check there are no False values left
df_additional_copy.tweet_type.value_counts()

Original    2071
SelfRT       144
Reply         77
Quote         24
OtherRT       23
Name: tweet_type, dtype: int64

#check no records have been lost by checking all tweet_ids of cleaned df are in original df
df_additional_copy.tweet_id.isin(df_additional.tweet_id).shape[0] == df_additional.shape[0]

True

4.4 - Data tidying: each observational unit a table¶

According to the principles of data tidiness, each table should match an observational unit. The current structure of our data is in contradiction with this principle, because it is a relatively heterogeneous data set, relating to:

tweets, with information on author, whether retweet, reply, quote tweet, its popularity,
predictions based on the tweet content, i.e. tweet text (for the name of dog, dog "stage", rating) and tweeted images (whether it's a dog, dog breed)

There are different ways of restructuring the data to better follow tidiness principles :

One possibility is to merge the three dataframes on the basis of the unique tweet IDs. In this case the observational unit would be the data associated with each tweet. However there is no image prediction for 273 tweets, so these records would need to be removed, and we may be interested in analysing these tweets.

A second possibility is to re-structure the data into two dataframes, on the basis of two observational units : (1) tweets from the WeRateDogs Twitter user, including data provided by Twitter ; (2) dog predictions derived from the content of WeRateDogs tweets, in particular, predictions from images posted by the WeRateDogs user, as well as ratings, dog names and dog stages extracted from tweet text.

Only the first option is deemed to be correct by the project reviewer, although it implies losing data on tweets for which there are no image predictions and which will not be available for later analysis.

Define

merge df_archive_copy and df_additional_copy - both dataframes should have the same IDs due to the previous cleaning step (4.1)

merge df_archive_copy and df_predictions_copy - this will lead to the removal of several hundred tweets

Code:

#first merge step
df_archive_copy = pd.merge(df_archive_copy,df_additional_copy,on='tweet_id')
#before moving to second step, check the number of rows is unchanged
df_archive_copy.shape[0] == df_additional_copy.shape[0]

True

#second merge step - only keep rows that are in df_predictions_copy
df_archive_copy = pd.merge(df_archive_copy, df_predictions_copy,on='tweet_id', how='right')

Test:

#check the number of rows is now the  same as df_predictions_copy
df_archive_copy.shape[0] == df_predictions_copy.shape[0]

True

#check all the column names of df_additional_copy and df_predictions_copy are still in df_archive_copy
assert set(df_additional_copy).intersection(set(df_archive_copy)) == set(df_additional_copy)
assert set(df_predictions_copy).intersection(set(df_archive_copy)) == set(df_predictions_copy)

#check the new dataframe structure
df_archive_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2066 entries, 0 to 2065
Data columns (total 28 columns):
tweet_id                      2066 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2066 non-null object
source                        2066 non-null object
text                          2066 non-null object
retweeted_status_id           75 non-null float64
retweeted_status_user_id      75 non-null float64
retweeted_status_timestamp    75 non-null object
expanded_urls                 2066 non-null object
rating_numerator              2066 non-null int64
rating_denominator            2066 non-null int64
name                          2066 non-null object
dog_stage                     2066 non-null category
retweet_count                 2066 non-null int64
favorite_count                2066 non-null int64
tweet_type                    2066 non-null category
jpg_url                       2066 non-null object
img_num                       2066 non-null int64
p1                            2066 non-null object
p1_conf                       2066 non-null float64
p1_dog                        2066 non-null bool
p2                            2066 non-null object
p2_conf                       2066 non-null float64
p2_dog                        2066 non-null bool
p3                            2066 non-null object
p3_conf                       2066 non-null float64
p3_dog                        2066 non-null bool
dtypes: bool(3), category(2), float64(7), int64(6), object(10)
memory usage: 397.9+ KB

#find out number of rows, and how many were lost
df_archive_copy.shape[0], df_additional_copy.shape[0] - df_archive_copy.shape[0]

(2066, 273)

#which types of tweets were deleted?
df_additional_copy[~df_additional_copy.tweet_id.isin(df_archive_copy.tweet_id)].tweet_type.value_counts()

Original    103
SelfRT       83
Reply        54
Quote        24
OtherRT       9
Name: tweet_type, dtype: int64

#which types of tweets remain?
df_archive_copy.tweet_type.value_counts()

Original    1968
SelfRT        61
Reply         23
OtherRT       14
Quote          0
Name: tweet_type, dtype: int64

4.5 - Data quality: format of timestamps¶

Define:

Two columns with timestamps of when a tweet was posted,or a retweet was originally posted, were read in as strings.

Convert timestamp and retweeted_status_timestamp to datetime.

Code:

df_archive_copy.timestamp = pd.to_datetime(df_archive_copy.timestamp)
df_archive_copy.retweeted_status_timestamp = pd.to_datetime(df_archive_copy.retweeted_status_timestamp)

Test

df_archive_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2066 entries, 0 to 2065
Data columns (total 28 columns):
tweet_id                      2066 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2066 non-null datetime64[ns]
source                        2066 non-null object
text                          2066 non-null object
retweeted_status_id           75 non-null float64
retweeted_status_user_id      75 non-null float64
retweeted_status_timestamp    75 non-null datetime64[ns]
expanded_urls                 2066 non-null object
rating_numerator              2066 non-null int64
rating_denominator            2066 non-null int64
name                          2066 non-null object
dog_stage                     2066 non-null category
retweet_count                 2066 non-null int64
favorite_count                2066 non-null int64
tweet_type                    2066 non-null category
jpg_url                       2066 non-null object
img_num                       2066 non-null int64
p1                            2066 non-null object
p1_conf                       2066 non-null float64
p1_dog                        2066 non-null bool
p2                            2066 non-null object
p2_conf                       2066 non-null float64
p2_dog                        2066 non-null bool
p3                            2066 non-null object
p3_conf                       2066 non-null float64
p3_dog                        2066 non-null bool
dtypes: bool(3), category(2), datetime64[ns](2), float64(7), int64(6), object(8)
memory usage: 397.9+ KB

df_archive_copy[['timestamp','retweeted_status_timestamp']].sample(10)

note that null values are handled correctly

4.6 - Data quality: status IDs should not be floats¶

Define

Twitter uses unique IDs for user and tweet id. read_csv assigned inconsistent data types : float where values are missing, and integer to the others. Floats are displayed in scientific notation, which obscures their only utility of being unique identifiers.

To achieve a consistent data type, all of these could be converted to strings.

replace missing values by 0
convert IDs to strings

Code

#check current data types and column names
df_archive_copy.iloc[:,[0,1,2,6,7]].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2066 entries, 0 to 2065
Data columns (total 5 columns):
tweet_id                    2066 non-null int64
in_reply_to_status_id       23 non-null float64
in_reply_to_user_id         23 non-null float64
retweeted_status_id         75 non-null float64
retweeted_status_user_id    75 non-null float64
dtypes: float64(4), int64(1)
memory usage: 96.8 KB

#convert to strings, replace missing with "none"
df_archive_copy[['tweet_id','in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 
            'retweeted_status_user_id']] =  df_archive_copy[['tweet_id','in_reply_to_status_id','in_reply_to_user_id',
                                                        'retweeted_status_id',
                                                        'retweeted_status_user_id']].fillna(0).astype(int).astype(str).replace("0","none")

Test

df_archive_copy.iloc[:,[0,1,2,6,7]].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2066 entries, 0 to 2065
Data columns (total 5 columns):
tweet_id                    2066 non-null object
in_reply_to_status_id       2066 non-null object
in_reply_to_user_id         2066 non-null object
retweeted_status_id         2066 non-null object
retweeted_status_user_id    2066 non-null object
dtypes: object(5)
memory usage: 96.8+ KB

df_archive_copy[['tweet_id','in_reply_to_status_id', 'in_reply_to_user_id', 
            'retweeted_status_id', 'retweeted_status_user_id']].sample(10)

4.7 - Data quality: Wrong dog names¶

Define:

Some dog names have not been extracted properly. Some are not names - identifiably in lower-case - and some may have been missed in fields with the value "None".

Where names are wrong or missing in , attempt to re-extract names from tweet text

Code:

#create a mask to select records where the name is either None or lower-case
mask = (df_archive_copy.name == 'None') | (df_archive_copy.name.str.islower())
sum(mask)

675

df_archive_copy[mask].name.value_counts()

None            575
a                55
the               7
an                7
very              4
just              4
one               4
quite             3
getting           2
not               1
space             1
my                1
incredibly        1
light             1
all               1
this              1
actually          1
his               1
infuriating       1
by                1
unacceptable      1
officially        1
such              1
Name: name, dtype: int64

#re-extract names as the first word that follows the construction "named" or "name is", and fill null values with "None"
df_archive_copy.loc[mask,'name'] = df_archive_copy[mask].text.str.extract(r'.*named?\s?i?s?\s(?P<name>\w+).*.*', expand=True).fillna("None")

#check for lower-case "names"
df_archive_copy[df_archive_copy.name.str.islower()].name

121      to
1800     of
2016    are
Name: name, dtype: object

#replace with "None"
mask_lower=df_archive_copy.name.str.islower()
df_archive_copy.loc[mask_lower,"name"] = "None"

Test

#check name values that were previously None or wrong
df_archive_copy[mask].name.value_counts()

None          645
Tickles         2
Zoey            2
Berta           1
Zeus            1
Wylie           1
Rufus           1
Johm            1
Guss            1
Alphred         1
Pepe            1
Sabertooth      1
Cheryl          1
Klint           1
Alfonso         1
Jessiga         1
Leroi           1
Kohl            1
Spork           1
Daryl           1
Cherokee        1
Thea            1
Big             1
Hemry           1
Kip             1
Jacob           1
Octaviath       1
Chuk            1
Alfredo         1
Name: name, dtype: int64

#check if there are any lower-case names in the dataframe
sum(df_archive_copy.name.str.islower())

0

4.8 - Data quality: inaccurate/invalid ratings (rating_denominator and rating_numerator)¶

Define:

The rating_denominator is almost always 10 - where it isn't, it is indicative of either a mistake while extracting the rating (inaccurate), or of a non-standard rating being used (invalid). In both cases the numerator is also affected.

Select rows in which the denominator is different from 10 to make a temporary dataframe
- where the tweet text contains a rating with a denominator of 10, extract the actual rating
- where the denominator is divisble by 10, it is due to several dogs being present in the picture - both the denominator and numerator can be corrected

Delete records without rating
- assess the remaining tweets with inaccurate denominators and remove them if they don't follow the standard WRD rating scheme

Code:

#define mask
mask = df_archive_copy.rating_denominator != 10
#create temporary dataframe
wrong_ratings = df_archive_copy.loc[mask,['rating_numerator','rating_denominator','text']]
wrong_ratings

#extract ratings that follow the pattern numerator/10
wrong_ratings[['corr_numerator','corr_denominator']] = wrong_ratings.text.str.extract(r'.*\s(?P<corr_numerator>\d+)\/(?P<corr_denominator>10).*.*', expand=True)
wrong_ratings

#fix ratings that are a multiple of 10 - assume multiple is the number of dogs, redefine numerator and denominator
#sampling the pics of corrected records for the presence of multiple dogs
for i,denominator in zip(wrong_ratings.index, wrong_ratings.rating_denominator):
    if (denominator % 10 == 0) and (denominator != 0):
        n_dogs = denominator / 10
        wrong_ratings.loc[i,'corr_numerator'] = wrong_ratings.loc[i,'rating_numerator'] / n_dogs
        wrong_ratings.loc[i,'corr_denominator'] = wrong_ratings.loc[i,'rating_denominator'] / n_dogs
        print("index: ", i)
        print("Tweet text:",wrong_ratings.loc[i,'text']) #check some of the URLs
        print("nb of dogs: {}, corrected numerator : {}, corrected denominator: {}".format(n_dogs,wrong_ratings.loc[i,'corr_numerator'], wrong_ratings.loc[i,'corr_denominator']))

index:  340
Tweet text: The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd
nb of dogs: 7.0, corrected numerator : 12.0, corrected denominator: 10.0
index:  729
Tweet text: Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
nb of dogs: 15.0, corrected numerator : 11.0, corrected denominator: 10.0
index:  918
Tweet text: Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
nb of dogs: 17.0, corrected numerator : 12.0, corrected denominator: 10.0
index:  961
Tweet text: Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a
nb of dogs: 2.0, corrected numerator : 2.0, corrected denominator: 10.0
index:  995
Tweet text: This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq
nb of dogs: 5.0, corrected numerator : 10.0, corrected denominator: 10.0
index:  1016
Tweet text: Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1
nb of dogs: 9.0, corrected numerator : 11.0, corrected denominator: 10.0
index:  1041
Tweet text: Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12
nb of dogs: 8.0, corrected numerator : 10.0, corrected denominator: 10.0
index:  1059
Tweet text: From left to right:
Cletus, Jerome, Alejandro, Burp, &amp; Titson
None know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK
nb of dogs: 5.0, corrected numerator : 9.0, corrected denominator: 10.0
index:  1125
Tweet text: Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa
nb of dogs: 5.0, corrected numerator : 12.0, corrected denominator: 10.0
index:  1201
Tweet text: Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ
nb of dogs: 4.0, corrected numerator : 11.0, corrected denominator: 10.0
index:  1373
Tweet text: Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3
nb of dogs: 13.0, corrected numerator : 11.0, corrected denominator: 10.0
index:  1374
Tweet text: Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55
nb of dogs: 11.0, corrected numerator : 11.0, corrected denominator: 10.0
index:  1505
Tweet text: IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq
nb of dogs: 12.0, corrected numerator : 12.0, corrected denominator: 10.0
index:  1564
Tweet text: Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw
nb of dogs: 8.0, corrected numerator : 11.0, corrected denominator: 10.0

wrong_ratings

#check the text of remaining tweets for ratings, make list of tweet_ids to delete
to_delete=[]
for i in wrong_ratings[wrong_ratings.corr_denominator.isnull()].index:
    print("index: ", i)
    print("Tweet text:",wrong_ratings.loc[i,'text'])
    to_delete.append(df_archive_copy.loc[i,'tweet_id']) #get tweetd_id from df_archive_copy as index is the same
print(to_delete)

index:  410
Tweet text: Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. 
Keep Sam smiling by clicking and sharing this link:
https://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
['810984652412424192']

#check tweets in df_archive_copy for which no rating can be extracted 
df_archive_copy[df_archive_copy.tweet_id.isin(to_delete)]

wrong_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18 entries, 340 to 2045
Data columns (total 5 columns):
rating_numerator      18 non-null int64
rating_denominator    18 non-null int64
text                  18 non-null object
corr_numerator        17 non-null object
corr_denominator      17 non-null object
dtypes: int64(2), object(3)
memory usage: 1.5+ KB

The corrected numerator and denominator are strings. Since integers can't handle NaN, replace NaN values with 0, convert to integers, insert into original dataframe, then delete the records.

wrong_ratings[['corr_numerator', 'corr_denominator']] = wrong_ratings[['corr_numerator', 'corr_denominator']].fillna(0).astype(int)

wrong_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18 entries, 340 to 2045
Data columns (total 5 columns):
rating_numerator      18 non-null int64
rating_denominator    18 non-null int64
text                  18 non-null object
corr_numerator        18 non-null int64
corr_denominator      18 non-null int64
dtypes: int64(4), object(1)
memory usage: 1.5+ KB

#replace re-extracted correct ratings in df_archive_copy
df_archive_copy.loc[mask,'rating_numerator'] = wrong_ratings.corr_numerator
df_archive_copy.loc[mask,'rating_denominator'] = wrong_ratings.corr_denominator

#delete rows relating to tweets without an extractable rating
df_archive_copy = df_archive_copy[~df_archive_copy.tweet_id.isin(to_delete)]

Test

df_archive_copy.rating_denominator.value_counts()

10    2065
Name: rating_denominator, dtype: int64

sum(df_archive_copy.rating_denominator.isnull())

0

df_archive_copy[['rating_numerator','rating_denominator']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2065 entries, 0 to 2065
Data columns (total 2 columns):
rating_numerator      2065 non-null int64
rating_denominator    2065 non-null int64
dtypes: int64(2)
memory usage: 48.4 KB

4.9 - Data quality: Inaccurate rating numerators¶

Define:

All ratings numerators were extracted as integers, but some are the decimal of a rating.

re-extract ratings denominator that follow a decimal pattern, replace after converting column to float

Code:

#extract decimal ratings into new column
df_archive_copy['decimal_numerator'] = df_archive_copy.text.str.extract(r'.*\s(\d+\.\d+)\/10.*.*', expand=False).astype(float)
#extraction results
df_archive_copy[df_archive_copy.decimal_numerator.notnull()]

#define mask of extracted decimal ratings:
mask = df_archive_copy.decimal_numerator.notnull()
df_archive_copy.loc[mask,['tweet_id','text','decimal_numerator']]

#print text to check extraction
for t in df_archive_copy[mask].text:
    print(t)

This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948
This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS
This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq
Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD

#confirm original extraction was inaccurate
df_archive_copy.loc[mask,['rating_numerator','decimal_numerator']]

#convert rating_numerator to floats
df_archive_copy['rating_numerator'] = df_archive_copy['rating_numerator'].astype(float)
#replace original numerators
df_archive_copy.loc[mask,'rating_numerator'] = df_archive_copy.loc[mask,'decimal_numerator']
#drop unnecessary decimal_numerator column
df_archive_copy.drop('decimal_numerator',axis=1, inplace=True)

Test:

#check for presence of decimal numerators
df_archive_copy.rating_numerator.value_counts()

12.00      477
10.00      431
11.00      419
13.00      279
9.00       152
8.00        95
7.00        51
14.00       39
5.00        33
6.00        32
3.00        19
4.00        15
2.00        10
1.00         4
0.00         2
9.75         1
15.00        1
11.26        1
13.50        1
11.27        1
420.00       1
1776.00      1
Name: rating_numerator, dtype: int64

df_archive_copy.loc[mask,'rating_numerator']

40      13.50
552      9.75
608     11.27
1445    11.26
Name: rating_numerator, dtype: float64

#check correct dtype and unnecessary column has been dropped
df_archive_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2065 entries, 0 to 2065
Data columns (total 28 columns):
tweet_id                      2065 non-null object
in_reply_to_status_id         2065 non-null object
in_reply_to_user_id           2065 non-null object
timestamp                     2065 non-null datetime64[ns]
source                        2065 non-null object
text                          2065 non-null object
retweeted_status_id           2065 non-null object
retweeted_status_user_id      2065 non-null object
retweeted_status_timestamp    75 non-null datetime64[ns]
expanded_urls                 2065 non-null object
rating_numerator              2065 non-null float64
rating_denominator            2065 non-null int64
name                          2065 non-null object
dog_stage                     2065 non-null category
retweet_count                 2065 non-null int64
favorite_count                2065 non-null int64
tweet_type                    2065 non-null category
jpg_url                       2065 non-null object
img_num                       2065 non-null int64
p1                            2065 non-null object
p1_conf                       2065 non-null float64
p1_dog                        2065 non-null bool
p2                            2065 non-null object
p2_conf                       2065 non-null float64
p2_dog                        2065 non-null bool
p3                            2065 non-null object
p3_conf                       2065 non-null float64
p3_dog                        2065 non-null bool
dtypes: bool(3), category(2), datetime64[ns](2), float64(4), int64(4), object(13)
memory usage: 397.7+ KB

4.10 - Data quality: dog prediction column names¶

Define

Replace the column names with more descriptive names :

jpg_url:this is the URL of the image used in the prediction
img_num : out of a maximum of 4 images, the number of the image used in the prediction
p1,p2,p3 :make clear they are predictions
conf : confidence
dog : use is_dog, consistent with is_tweet, is_reply, is_quote

Code

#img_num and the image number that corresponded to the most
#confident prediction (numbered 1 to 4 since tweets can have up to four images
df_archive_copy = df_archive_copy.rename (columns={'jpg_url':'image_used_url',
                                                   'img_num':'image_used_num',
                                                   'p1': 'prediction_p1',
                                                   'p1_conf': 'confidence_p1',
                                                   'p1_dog': 'is_dog_p1',
                                                   'p2': 'prediction_p2',
                                                   'p2_conf': 'confidence_p2',
                                                   'p2_dog': 'is_dog_p2',
                                                   'p3': 'prediction_p3',
                                                   'p3_conf': 'confidence_p3',
                                                   'p3_dog': 'is_dog_p3'})

Test

list(df_archive_copy)

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'dog_stage',
 'retweet_count',
 'favorite_count',
 'tweet_type',
 'image_used_url',
 'image_used_num',
 'prediction_p1',
 'confidence_p1',
 'is_dog_p1',
 'prediction_p2',
 'confidence_p2',
 'is_dog_p2',
 'prediction_p3',
 'confidence_p3',
 'is_dog_p3']

4.11 - Data quality: dog prediction breed names are not consistent¶

Define:

Remove underscores, capitalise all dog breed names

Code:

df_archive_copy.prediction_p1 = df_archive_copy.prediction_p1.str.replace("_", " ")
df_archive_copy.prediction_p2 = df_archive_copy.prediction_p2.str.replace("_", " ")
df_archive_copy.prediction_p3 = df_archive_copy.prediction_p3.str.replace("_", " ")

#capitalise the dog breed names only, using is_dog_px for boolean indexing
df_archive_copy.loc[df_archive_copy.is_dog_p1,'prediction_p1'] = df_archive_copy[df_archive_copy.is_dog_p1].prediction_p1.str.title()
df_archive_copy.loc[df_archive_copy.is_dog_p2,'prediction_p2'] = df_archive_copy[df_archive_copy.is_dog_p2].prediction_p2.str.title()
df_archive_copy.loc[df_archive_copy.is_dog_p3,'prediction_p3'] = df_archive_copy[df_archive_copy.is_dog_p3].prediction_p3.str.title()

Test

df_archive_copy[df_archive_copy.is_dog_p1][['prediction_p1','prediction_p2','prediction_p3']].sample(5)

4.12 - Data quality: remove column redundancy dog predictions¶

Define:

The df_archive_copy dataframe contains 3 columns each for the 3 preductions, one is the prediction name, one whether it is a dog, and one of the confidence of the prediction. For the present analysis, this level of detailed information about the machine learning process is unnecessary as we are only interested in the top prediction if it is of a dog.

Create column for the best prediction of a dog (they are ranked according to confidence).

remove redundant columns

Code:

#Get dog prediction from best prediction of a dog
mask = df_archive_copy.is_dog_p1
df_archive_copy.loc[mask,'top_prediction'] = df_archive_copy.loc[mask,'prediction_p1']
mask = (df_archive_copy.is_dog_p1==False) & df_archive_copy.is_dog_p2
df_archive_copy.loc[mask,'top_prediction'] = df_archive_copy.loc[mask,'prediction_p2']
mask = (df_archive_copy.is_dog_p1==False) & (df_archive_copy.is_dog_p2==False) & df_archive_copy.is_dog_p3
df_archive_copy.loc[mask,'top_prediction'] = df_archive_copy.loc[mask,'prediction_p3']
#fill in rows for which there is no predicted dog with "No dog predicted"
df_archive_copy.top_prediction = df_archive_copy.top_prediction.fillna("None")

#remove redundant columns
df_archive_copy.drop(['prediction_p1', 'confidence_p1', 'is_dog_p1', 'prediction_p2', 'confidence_p2', 
                      'is_dog_p2', 'prediction_p3', 'confidence_p3', 'is_dog_p3'],axis=1, inplace=True)

Test:

#check remaining columns
list(df_archive_copy)

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'dog_stage',
 'retweet_count',
 'favorite_count',
 'tweet_type',
 'image_used_url',
 'image_used_num',
 'top_prediction']

4.13 - Data quality: remove duplication related to retweets¶

Define:

There are also a number of duplicated dog predictions: the same image was used, with an identical prediction, though the unique tweet_id is different. These may be the result of WeRataDogs retweeting their own tweets. The project description suggests that retweeted images should be removed.

Record in new variable if a WeRateDogs tweet was retweeted by WeRateDogs

remove rows with retweets - predictions are the result of retweeted images
remove redundant columns related to retweets

Code:

#variable to record self-retweeting
df_archive_copy['was_retweeted'] = df_archive_copy.tweet_id.isin(df_archive_copy.retweeted_status_id)

#check how many duplicates are self-retweets
df_archive_copy[df_archive_copy.image_used_url.duplicated(keep=False)].tweet_type.value_counts()

SelfRT      61
Original    61
Reply        0
Quote        0
OtherRT      0
Name: tweet_type, dtype: int64

df_archive_copy.tweet_type.value_counts()

Original    1967
SelfRT        61
Reply         23
OtherRT       14
Quote          0
Name: tweet_type, dtype: int64

#remove retweets
df_archive_copy = df_archive_copy[~df_archive_copy.tweet_type.isin(['SelfRT','OtherRT'])]

#remove unnecessary columns related to retweets and replies
df_archive_copy.drop(['retweeted_status_id', 'retweeted_status_user_id', 
                      'retweeted_status_timestamp', 'in_reply_to_status_id', 'in_reply_to_user_id'],axis=1, inplace=True)

Test:

#check new variable
df_archive_copy['was_retweeted'].value_counts()

False    1944
True       46
Name: was_retweeted, dtype: int64

list(df_archive_copy)

['tweet_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'dog_stage',
 'retweet_count',
 'favorite_count',
 'tweet_type',
 'image_used_url',
 'image_used_num',
 'top_prediction',
 'was_retweeted']

#ensure no duplicated images remain
sum(df_archive_copy.image_used_url.duplicated())

0

#check tweet_types remaining
df_archive_copy.tweet_type.value_counts()

Original    1967
Reply         23
SelfRT         0
Quote          0
OtherRT        0
Name: tweet_type, dtype: int64

5 - Storage of cleaned data¶

Given the relatively small size of the data set, the most straighforward way to store the cleaned data is as csv files :

df_archive_copy.to_csv('twitter_archive_master.csv', index=False)

6 - Analysis and visualisation¶

The counts of favourites, retweets approximate a log-normal distribution¶

On Twitter, favourite and retweet counts are an important metric of audience engagement. Here I want to use them to to analyse the popularity of tweets and of dogs. To get a better idea of how they behave, I want to look at their distribution first.

#distribution of retweet and favourite counts
n_rt= df_archive_copy.retweet_count
n_fav= df_archive_copy.favorite_count

fig = plt.figure(figsize=(12,3))

ax1 = fig.add_subplot(121)
stats.probplot(n_rt, plot=plt) #https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html
ax1.set_title("Distribution of Retweet count")

ax2 = fig.add_subplot(122)
stats.probplot(n_fav, plot=plt)
ax2.set_title("Distribution of Favourite count");

#calculate logarithm for non-zero counts
log_rt= np.log(df_archive_copy[df_archive_copy.retweet_count>0].retweet_count)
log_fav= np.log(df_archive_copy[df_archive_copy.favorite_count>0].favorite_count)

fig = plt.figure(figsize=(12,3))

ax3 = fig.add_subplot(121)
stats.probplot(log_rt, plot=plt)
ax3.set_title("Distribution of Retweet count Logarithm")

ax4 = fig.add_subplot(122)
stats.probplot(log_fav, plot=plt);
ax4.set_title("Distribution of Favourite count Logarithm");

The counts of favourites and of retweets are approximately log-normally distributed (though retweets, deviate somewhat at very low counts).

Favourites and retweets are strongly correlated¶

Seaborn's jointplot function is a good way to look at the relationship between retweet and favourite count

#only consider non-zero values
x=(df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.retweet_count>0)].favorite_count)
y=(df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.retweet_count>0)].retweet_count)

corr_coeff= stats.pearsonr(x,y) #https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

ax=sns.jointplot(x,y,kind="reg")
ax.fig.text(x=0.3,y=0.7, s="Pearson: r={0:.2f}, p={1:.1f}".format(corr_coeff[0],corr_coeff[1]), ha="center", va="center")
                                    #decimals https://stackoverflow.com/a/8940627
ax.fig.subplots_adjust(top=0.93) #title https://stackoverflow.com/a/29814281
ax.fig.suptitle('Correlation of favourites vs retweets', fontsize=14);

Plot the log values of retweet count and of favourite count - to reduce overplotting and get a better distribution of the counts

x=np.log(df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.retweet_count>0)].favorite_count)
y=np.log(df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.retweet_count>0)].retweet_count)

corr_coeff= stats.pearsonr(x,y)

ax=sns.jointplot(x,y,kind="reg")

ax.fig.text(x=0.3,y=0.7, s="Pearson: r={0:.2f}, p={1:.1f}".format(corr_coeff[0],corr_coeff[1]), ha="center", va="center")
ax.fig.subplots_adjust(top=0.93)
ax.fig.suptitle('Fvourites vs retweets (log)', fontsize=14);

There is a very stong correlation (r=0.93) between retweet count and favourite count. This can easily be understood as both retweets and favourite are expressions of audience engagement, the former being a stronger form of engagement. There is also a mutual re-inforcement mechanism at play, as tweets that are retweeted are shown to a wider audience, increasing the pool of tweets who may favourite the tweet ; and popular tweets, as expressed by a higher favorite count are given higher prominence in the Twitter timeline. This feedback mechanism is likely to be the basis for the log-normal distribution observed.

The popularity of WeRateDogs increases steadily over a 20 month period¶

With nearly 8 million followers, @dog_rates (WeRateDogs) is a very successful Twitter account. How did it get there?

The plot below shows the evolution of audience engagement over the first 20 months. Only original tweets are considered - i.e. replies, quote-tweets and retweets are left out, as audience engagement is fundamentally different with them.

The retweet and favourite counts are shown for each tweet as a point in the scatter plot, on a logarithmic y-axis, with a line shown a rolling mean of a window size of 50 tweets.

fig, ax = plt.subplots(figsize=(14,6))

x = df_archive_copy[df_archive_copy.tweet_type=='Original'].timestamp

y1 = df_archive_copy[df_archive_copy.tweet_type=='Original'].favorite_count
y1m = y1.rolling(50).median()

y2 = df_archive_copy[df_archive_copy.tweet_type=='Original'].retweet_count
y2m = y2.rolling(50).median()

ax.plot_date(x,y1, color='b', markersize = 1)
ax.plot_date(x,y2, color='y', markersize = 1)

ax.plot(x,y1m, color='k')
ax.plot(x,y2m,color='r')
plt.yscale("log")

ax.set_title("Popularity of @dog_rates (WeRateDogs) over 2 years")
ax.set_ylabel("retweet / favourite count (log axis)")
ax.set_xlabel("date")
ax.legend(['Favourites','Retweets','Favourites - Rolling mean','Retweets - Rolling mean']);

During the first month or two, there's a rapid increase in popularity - from tens to thousands of retweets. Thereafter, the slope of the curve become much less steep but increases steadily until the end of the period studied. The approximately linear aspect of the rolling means expresses an exponential increase since the y-axis is logarithmic. The distance between the lines for favourites and retweets is relatively consistent but widens slowly.

Tweet popularity is associated with naming of dogs, dog "stage", and retweeting¶

WeRateDogs commonly states the name of the dog in the picture, as well as the "dog stage". Does this make the tweets more popular?

#create new variable stating if name and dog stage are used in text
df_archive_copy['has_name'] = df_archive_copy.name != "None"
df_archive_copy['has_stage_defined']=df_archive_copy.dog_stage != "no stage"

data = df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.tweet_type=='Original')]

#correlation https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pointbiserialr.html
corr_name = stats.pointbiserialr(data.has_name,data.favorite_count)
corr_stage = stats.pointbiserialr(data.has_stage_defined,data.favorite_count)

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(12,8))
sns.boxplot(y='favorite_count',x='has_name',data= data,ax=ax1).set_title('Favourite count: without/with named dogs')
fig.get_axes()[0].set_yscale('log')

sns.boxplot(y='favorite_count',x='has_stage_defined',data= data,ax=ax2).set_title('Favourite count: without/with dog stage')
fig.get_axes()[1].set_yscale('log')

ax1.text(x=0.6,y=65, s="Point-biserial: r={0:.2f}, p={1:.2f}".format(corr_name[0],corr_name[1]), ha="center", va="center")
ax2.text(x=0.6,y=65, s="Point-biserial: r={0:.2f}, p={1:.2f}".format(corr_stage[0],corr_stage[1]), ha="center", va="center");

Tweets that include a named dog or mention a dog stage are more popular - however the the difference is marginal. The correlation coefficient (r=0.02) and p-value (p=0.28) suggest the difference is not significant for named dogs, and it is only slightly higher for unnamed vs named dogs (r=0.08, p=0)

Is there a discernible effect of retweeting, and of the type of tweet?

data = df_archive_copy[df_archive_copy.favorite_count>0]

corr_coeff = stats.pointbiserialr(data.was_retweeted,data.favorite_count)

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(12,8))
sns.boxplot(y='favorite_count',x='was_retweeted',data= data,ax=ax1).set_title('Favourite count: retweets')
fig.get_axes()[0].set_yscale('log')
ax1.text(x=0.6,y=60, s="Point-biserial: r={0:.2f}, p={1:.2f}".format(corr_coeff[0],corr_coeff[1]), ha="center", va="center")

sns.boxplot(y='favorite_count',x='tweet_type',data= data,ax=ax2).set_title('Favourite count: Tweet type')
fig.get_axes()[1].set_yscale('log')

The box-plot shows that WeRateDogs tweets which are retweeted by the same account have a higher favourite count than those that are not. Thus retweeting is plausibly a good strategy for increasing a tweets popularity.However, the analysis here can't establish causality, because it's certainly possible that popular tweets are more likely to be chosen for retweeting.

As expected replies are less popular than original WeRateDog tweets, but quote tweets are surprisingly more popular than the standard original WeRateDog tweets.

The WeRateDogs ratings system is meaningful¶

WeRateDogs uses an ideosyncratic rating system for dogs, in which ratings of usually between 10 to 13 out of 10 are used. Lower rating do exist as well as 14, but ratings above 14 are outside of the standard used.

Do these ratings express anything meanful about the dog(s) in question? Below is a regression plot of the rating (numerator) against the dog's popularity (log of favorite count). Only ratings below 15/10 are considered.

data = df_archive_copy[df_archive_copy.rating_numerator<15]

x = data.rating_numerator
y = np.log(data.favorite_count)

pearson = stats.pearsonr(x,y)
spearman = stats.spearmanr(x,y) #https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.spearmanr.html

ax=sns.jointplot(x,y,kind="reg")
ax.fig.text(x=0.17,y=0.77, s="Pearson: r={0:.2f}, p={1:.1f}".format(pearson[0],pearson[1]), ha="left", va="center")
ax.fig.text(x=0.17,y=0.74, s="Spearman: r={0:.2f}, p={1:.1f}".format(spearman[0],spearman[1]), ha="left", va="center")

ax.fig.subplots_adjust(top=0.93)
ax.fig.suptitle('Correlation of Dog rating vs popularity', fontsize=14);

The plot suggests that the rating system is meaningful and there is in fact a relationship between the rating and the popularity of a dog. The Pearson correlation coefficient of r=0.49 is lower than the Spearman correlation coefficient r=0.60, since Pearson only measure linear relationships. Again, it is not possible to draw any conclusions re. causality here - it may be that a higher rating biases the audience to favour a dog picture, or it may be that a high/low favourite count and a high/low dog rating are both independently caused by how great a dog picture is.

Dog stage and breed have little impact on popularity¶

Are young dogs more popular? Are certain breeds more popular? The box plot below looks at this question.

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(12,8))

x= df_archive_copy.dog_stage
y= np.log(df_archive_copy.favorite_count)
sns.boxplot(x,y,data = df_archive_copy,ax=ax1).set_title('Dog stage and popularity')

top_dogs=df_archive_copy.top_prediction.value_counts().head(8).index.tolist()

data=df_archive_copy[df_archive_copy.top_prediction.isin(top_dogs)]
x= data.dog_stage
y= np.log(data.favorite_count)

sns.boxplot(x,y,data = data,ax=ax2).set_title('Dog breed and popularity')
plt.xticks(rotation=45);

The most popular dog stages are "puppo", "doggo" and "floofer", whose median popularity is about one unit above that of "pupper" and photos of dogs where no 'dog stage' is mentioned. It is perhaps surprising that "pupper" has the lowest median popularity rating - this suggests that it isn't simply the fact that a dog staage is mentioned in a tweet that contributes to its popularity.

Amongst the 7 most common dog breeds, Golden Retrievers tend to be the most popular, with Pugs and Chihuahuas at the lowest median popularity - these are roughly equal with pics that have no detected dog breed, which usually do not have a dog in the photo at all.

However, the effect of dog stage and dog breed on the popularity of dog pics is relatively small, considering that the ranges overlap.

The most common breed is Golden Retriever¶

Our cleaned data enable us to find out what the most common dog breeds are - here are the top 20 breeds. This analysis assumes that the breed predictions are quite accurate and don't have any systematic bias.

df_archive_copy.top_prediction.value_counts()[1:20].plot.barh(figsize=(8,5))
plt.gca().invert_yaxis()

The most common dog names Cooper, Lucy, Charlie and Oliver¶

Similarly, our cleaned data enable us to find out what the most common dog names are.

df_archive_copy.name.value_counts()[1:20].plot.barh(figsize=(8,5))
plt.gca().invert_yaxis();

Sources used

A number of sources were used for this project - links are usually made available as comments

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
0	666020888022790149	https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg	1	Welsh_springer_spaniel	0.465074	True	collie	0.156665	True	Shetland_sheepdog	0.061428	True
1	666029285002620928	https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg	1	redbone	0.506826	True	miniature_pinscher	0.074192	True	Rhodesian_ridgeback	0.072010	True
2	666033412701032449	https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg	1	German_shepherd	0.596461	True	malinois	0.138584	True	bloodhound	0.116197	True

	tweet_id	retweet_count	favorite_count
count	2.339000e+03	2339.000000	2339.000000
mean	7.421620e+17	2897.746045	7884.252672
std	6.828729e+16	4886.564270	12218.674683
min	6.660209e+17	1.000000	0.000000
25%	6.783378e+17	582.500000	1365.500000
50%	7.186133e+17	1353.000000	3430.000000
75%	7.986907e+17	3377.500000	9653.500000
max	8.924206e+17	82838.000000	162741.000000

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp
30	886267009285017600	8.862664e+17	2.281182e+09	2017-07-15 16:51:35 +0000
55	881633300179243008	8.816070e+17	4.738443e+07	2017-07-02 21:58:53 +0000
64	879674319642796034	8.795538e+17	3.105441e+09	2017-06-27 12:14:36 +0000
113	870726314365509632	8.707262e+17	1.648776e+07	2017-06-02 19:38:25 +0000
148	863427515083354112	8.634256e+17	7.759620e+07	2017-05-13 16:15:35 +0000

	tweet_id	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp
19	888202515573088257	8.874740e+17	4.196984e+09	2017-07-19 00:47:34 +0000
32	886054160059072513	8.860537e+17	1.960740e+07	2017-07-15 02:44:07 +0000
36	885311592912609280	8.305833e+17	4.196984e+09	2017-02-12 01:04:29 +0000
68	879130579576475649	8.780576e+17	4.196984e+09	2017-06-23 01:10:23 +0000
73	878404777348136964	8.782815e+17	4.196984e+09	2017-06-23 16:00:04 +0000

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
0	666020888022790149	https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg	1	Welsh_springer_spaniel	0.465074	True	collie	0.156665	True	Shetland_sheepdog	0.061428	True
1	666029285002620928	https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg	1	redbone	0.506826	True	miniature_pinscher	0.074192	True	Rhodesian_ridgeback	0.072010	True
2	666033412701032449	https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg	1	German_shepherd	0.596461	True	malinois	0.138584	True	bloodhound	0.116197	True
3	666044226329800704	https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg	1	Rhodesian_ridgeback	0.408143	True	redbone	0.360687	True	miniature_pinscher	0.222752	True
4	666049248165822465	https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg	1	miniature_pinscher	0.560311	True	Rottweiler	0.243682	True	Doberman	0.154629	True
5	666050758794694657	https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg	1	Bernese_mountain_dog	0.651137	True	English_springer	0.263788	True	Greater_Swiss_Mountain_dog	0.016199	True
6	666051853826850816	https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg	1	box_turtle	0.933012	False	mud_turtle	0.045885	False	terrapin	0.017885	False
7	666055525042405380	https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg	1	chow	0.692517	True	Tibetan_mastiff	0.058279	True	fur_coat	0.054449	False
8	666057090499244032	https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg	1	shopping_cart	0.962465	False	shopping_basket	0.014594	False	golden_retriever	0.007959	True
9	666058600524156928	https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg	1	miniature_poodle	0.201493	True	komondor	0.192305	True	soft-coated_wheaten_terrier	0.082086	True
10	666063827256086533	https://pbs.twimg.com/media/CT5Vg_wXIAAXfnj.jpg	1	golden_retriever	0.775930	True	Tibetan_mastiff	0.093718	True	Labrador_retriever	0.072427	True
11	666071193221509120	https://pbs.twimg.com/media/CT5cN_3WEAAlOoZ.jpg	1	Gordon_setter	0.503672	True	Yorkshire_terrier	0.174201	True	Pekinese	0.109454	True
12	666073100786774016	https://pbs.twimg.com/media/CT5d9DZXAAALcwe.jpg	1	Walker_hound	0.260857	True	English_foxhound	0.175382	True	Ibizan_hound	0.097471	True
13	666082916733198337	https://pbs.twimg.com/media/CT5m4VGWEAAtKc8.jpg	1	pug	0.489814	True	bull_mastiff	0.404722	True	French_bulldog	0.048960	True
14	666094000022159362	https://pbs.twimg.com/media/CT5w9gUW4AAsBNN.jpg	1	bloodhound	0.195217	True	German_shepherd	0.078260	True	malinois	0.075628	True
15	666099513787052032	https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg	1	Lhasa	0.582330	True	Shih-Tzu	0.166192	True	Dandie_Dinmont	0.089688	True
16	666102155909144576	https://pbs.twimg.com/media/CT54YGiWUAEZnoK.jpg	1	English_setter	0.298617	True	Newfoundland	0.149842	True	borzoi	0.133649	True
17	666104133288665088	https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg	1	hen	0.965932	False	cock	0.033919	False	partridge	0.000052	False
18	666268910803644416	https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg	1	desktop_computer	0.086502	False	desk	0.085547	False	bookcase	0.079480	False
19	666273097616637952	https://pbs.twimg.com/media/CT8T1mtUwAA3aqm.jpg	1	Italian_greyhound	0.176053	True	toy_terrier	0.111884	True	basenji	0.111152	True
20	666287406224695296	https://pbs.twimg.com/media/CT8g3BpUEAAuFjg.jpg	1	Maltese_dog	0.857531	True	toy_poodle	0.063064	True	miniature_poodle	0.025581	True
21	666293911632134144	https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg	1	three-toed_sloth	0.914671	False	otter	0.015250	False	great_grey_owl	0.013207	False
22	666337882303524864	https://pbs.twimg.com/media/CT9OwFIWEAMuRje.jpg	1	ox	0.416669	False	Newfoundland	0.278407	True	groenendael	0.102643	True
23	666345417576210432	https://pbs.twimg.com/media/CT9Vn7PWoAA_ZCM.jpg	1	golden_retriever	0.858744	True	Chesapeake_Bay_retriever	0.054787	True	Labrador_retriever	0.014241	True
24	666353288456101888	https://pbs.twimg.com/media/CT9cx0tUEAAhNN_.jpg	1	malamute	0.336874	True	Siberian_husky	0.147655	True	Eskimo_dog	0.093412	True
25	666362758909284353	https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg	1	guinea_pig	0.996496	False	skunk	0.002402	False	hamster	0.000461	False
26	666373753744588802	https://pbs.twimg.com/media/CT9vZEYWUAAlZ05.jpg	1	soft-coated_wheaten_terrier	0.326467	True	Afghan_hound	0.259551	True	briard	0.206803	True
27	666396247373291520	https://pbs.twimg.com/media/CT-D2ZHWIAA3gK1.jpg	1	Chihuahua	0.978108	True	toy_terrier	0.009397	True	papillon	0.004577	True
28	666407126856765440	https://pbs.twimg.com/media/CT-NvwmW4AAugGZ.jpg	1	black-and-tan_coonhound	0.529139	True	bloodhound	0.244220	True	flat-coated_retriever	0.173810	True
29	666411507551481857	https://pbs.twimg.com/media/CT-RugiWIAELEaq.jpg	1	coho	0.404640	False	barracouta	0.271485	False	gar	0.189945	False
...	...	...	...	...	...	...	...	...	...	...	...	...
2045	886366144734445568	https://pbs.twimg.com/media/DE0BTnQUwAApKEH.jpg	1	French_bulldog	0.999201	True	Chihuahua	0.000361	True	Boston_bull	0.000076	True
2046	886680336477933568	https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg	1	convertible	0.738995	False	sports_car	0.139952	False	car_wheel	0.044173	False
2047	886736880519319552	https://pbs.twimg.com/media/DE5Se8FXcAAJFx4.jpg	1	kuvasz	0.309706	True	Great_Pyrenees	0.186136	True	Dandie_Dinmont	0.086346	True
2048	886983233522544640	https://pbs.twimg.com/media/DE8yicJW0AAAvBJ.jpg	2	Chihuahua	0.793469	True	toy_terrier	0.143528	True	can_opener	0.032253	False
2049	887101392804085760	https://pbs.twimg.com/media/DE-eAq6UwAA-jaE.jpg	1	Samoyed	0.733942	True	Eskimo_dog	0.035029	True	Staffordshire_bullterrier	0.029705	True
2050	887343217045368832	https://pbs.twimg.com/ext_tw_video_thumb/88734...	1	Mexican_hairless	0.330741	True	sea_lion	0.275645	False	Weimaraner	0.134203	True
2051	887473957103951883	https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg	2	Pembroke	0.809197	True	Rhodesian_ridgeback	0.054950	True	beagle	0.038915	True
2052	887517139158093824	https://pbs.twimg.com/ext_tw_video_thumb/88751...	1	limousine	0.130432	False	tow_truck	0.029175	False	shopping_cart	0.026321	False
2053	887705289381826560	https://pbs.twimg.com/media/DFHDQBbXgAEqY7t.jpg	1	basset	0.821664	True	redbone	0.087582	True	Weimaraner	0.026236	True
2054	888078434458587136	https://pbs.twimg.com/media/DFMWn56WsAAkA7B.jpg	1	French_bulldog	0.995026	True	pug	0.000932	True	bull_mastiff	0.000903	True
2055	888202515573088257	https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg	2	Pembroke	0.809197	True	Rhodesian_ridgeback	0.054950	True	beagle	0.038915	True
2056	888554962724278272	https://pbs.twimg.com/media/DFTH_O-UQAACu20.jpg	3	Siberian_husky	0.700377	True	Eskimo_dog	0.166511	True	malamute	0.111411	True
2057	888804989199671297	https://pbs.twimg.com/media/DFWra-3VYAA2piG.jpg	1	golden_retriever	0.469760	True	Labrador_retriever	0.184172	True	English_setter	0.073482	True
2058	888917238123831296	https://pbs.twimg.com/media/DFYRgsOUQAARGhO.jpg	1	golden_retriever	0.714719	True	Tibetan_mastiff	0.120184	True	Labrador_retriever	0.105506	True
2059	889278841981685760	https://pbs.twimg.com/ext_tw_video_thumb/88927...	1	whippet	0.626152	True	borzoi	0.194742	True	Saluki	0.027351	True
2060	889531135344209921	https://pbs.twimg.com/media/DFg_2PVW0AEHN3p.jpg	1	golden_retriever	0.953442	True	Labrador_retriever	0.013834	True	redbone	0.007958	True
2061	889638837579907072	https://pbs.twimg.com/media/DFihzFfXsAYGDPR.jpg	1	French_bulldog	0.991650	True	boxer	0.002129	True	Staffordshire_bullterrier	0.001498	True
2062	889665388333682689	https://pbs.twimg.com/media/DFi579UWsAAatzw.jpg	1	Pembroke	0.966327	True	Cardigan	0.027356	True	basenji	0.004633	True
2063	889880896479866881	https://pbs.twimg.com/media/DFl99B1WsAITKsg.jpg	1	French_bulldog	0.377417	True	Labrador_retriever	0.151317	True	muzzle	0.082981	False
2064	890006608113172480	https://pbs.twimg.com/media/DFnwSY4WAAAMliS.jpg	1	Samoyed	0.957979	True	Pomeranian	0.013884	True	chow	0.008167	True
2065	890240255349198849	https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg	1	Pembroke	0.511319	True	Cardigan	0.451038	True	Chihuahua	0.029248	True
2066	890609185150312448	https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg	1	Irish_terrier	0.487574	True	Irish_setter	0.193054	True	Chesapeake_Bay_retriever	0.118184	True
2067	890729181411237888	https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg	2	Pomeranian	0.566142	True	Eskimo_dog	0.178406	True	Pembroke	0.076507	True
2068	890971913173991426	https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg	1	Appenzeller	0.341703	True	Border_collie	0.199287	True	ice_lolly	0.193548	False
2069	891087950875897856	https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg	1	Chesapeake_Bay_retriever	0.425595	True	Irish_terrier	0.116317	True	Indian_elephant	0.076902	False
2070	891327558926688256	https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg	2	basset	0.555712	True	English_springer	0.225770	True	German_short-haired_pointer	0.175219	True
2071	891689557279858688	https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg	1	paper_towel	0.170278	False	Labrador_retriever	0.168086	True	spatula	0.040836	False
2072	891815181378084864	https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg	1	Chihuahua	0.716012	True	malamute	0.078253	True	kelpie	0.031379	True
2073	892177421306343426	https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg	1	Chihuahua	0.323581	True	Pekinese	0.090647	True	papillon	0.068957	True
2074	892420643555336193	https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg	1	orange	0.097049	False	bagel	0.085851	False	banana	0.076110	False

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	rating_numerator	rating_denominator	name	doggo	floofer	pupper	puppo
0	892420643555336193	NaN	NaN	2017-08-01 16:23:56 +0000	<a href="http://twitter.com/download/iphone" r...	This is Phineas. He's a mystical boy. Only eve...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/892420643...	13	10	Phineas	None	None	None	None
1	892177421306343426	NaN	NaN	2017-08-01 00:17:27 +0000	<a href="http://twitter.com/download/iphone" r...	This is Tilly. She's just checking pup on you....	NaN	NaN	NaN	https://twitter.com/dog_rates/status/892177421...	13	10	Tilly	None	None	None	None
2	891815181378084864	NaN	NaN	2017-07-31 00:18:03 +0000	<a href="http://twitter.com/download/iphone" r...	This is Archie. He is a rare Norwegian Pouncin...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/891815181...	12	10	Archie	None	None	None	None

	tweet_id	user_name	retweet_count	favorite_count	is_retweet	retweet_user	is_quote	quote_user	is_reply
0	876484053909872640	WeRateDogs™	2327	18309	False	None	False	None	False
1	886366144734445568	WeRateDogs™	3100	20637	False	None	False	None	False
2	877611172832227328	WeRateDogs™	78	0	True	Rachel Baker	False	None	False

	doggo	floofer	pupper	puppo
460	doggo	None	pupper	None
531	doggo	None	pupper	None
565	doggo	None	pupper	None
575	doggo	None	pupper	None
705	doggo	None	pupper	None
733	doggo	None	pupper	None
778	doggo	None	pupper	None
822	doggo	None	pupper	None
889	doggo	None	pupper	None
956	doggo	None	pupper	None
1063	doggo	None	pupper	None
1113	doggo	None	pupper	None

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
810	692187005137076224	https://pbs.twimg.com/media/CZskaEIWIAUeTr5.jpg	2	Siberian_husky	0.810592	True	malamute	0.119745	True	Eskimo_dog	0.029265	True
609	680085611152338944	https://pbs.twimg.com/media/CXAiiHUWkAIN_28.jpg	3	pillow	0.778113	False	apron	0.095023	False	wallet	0.049326	False
976	707021089608753152	https://pbs.twimg.com/media/Cc_XtkRW8AEE7Fn.jpg	2	cocker_spaniel	0.559658	True	golden_retriever	0.314673	True	Pekinese	0.066672	True
997	708356463048204288	https://pbs.twimg.com/media/CdSWcc1XIAAXc6H.jpg	2	pug	0.871283	True	French_bulldog	0.041820	True	bath_towel	0.015228	False
1332	757725642876129280	https://pbs.twimg.com/media/CoP7c4bWcAAr55g.jpg	2	seat_belt	0.425176	False	Labrador_retriever	0.128128	True	Siamese_cat	0.091241	False

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
480	675354435921575936	https://pbs.twimg.com/ext_tw_video_thumb/67535...	1	upright	0.303415	False	golden_retriever	0.181351	True	Brittany_spaniel	0.162084	True
1297	752309394570878976	https://pbs.twimg.com/ext_tw_video_thumb/67535...	1	upright	0.303415	False	golden_retriever	0.181351	True	Brittany_spaniel	0.162084	True
1864	842892208864923648	https://pbs.twimg.com/ext_tw_video_thumb/80710...	1	Chihuahua	0.505370	True	Pomeranian	0.120358	True	toy_terrier	0.077008	True
1641	807106840509214720	https://pbs.twimg.com/ext_tw_video_thumb/80710...	1	Chihuahua	0.505370	True	Pomeranian	0.120358	True	toy_terrier	0.077008	True
1703	817181837579653120	https://pbs.twimg.com/ext_tw_video_thumb/81596...	1	Tibetan_mastiff	0.506312	True	Tibetan_terrier	0.295690	True	otterhound	0.036251	True
1691	815966073409433600	https://pbs.twimg.com/ext_tw_video_thumb/81596...	1	Tibetan_mastiff	0.506312	True	Tibetan_terrier	0.295690	True	otterhound	0.036251	True
1705	817423860136083457	https://pbs.twimg.com/ext_tw_video_thumb/81742...	1	ice_bear	0.336200	False	Samoyed	0.201358	True	Eskimo_dog	0.186789	True
1858	841833993020538882	https://pbs.twimg.com/ext_tw_video_thumb/81742...	1	ice_bear	0.336200	False	Samoyed	0.201358	True	Eskimo_dog	0.186789	True
1715	819004803107983360	https://pbs.twimg.com/media/C12whDoVEAALRxa.jpg	1	standard_poodle	0.351308	True	toy_poodle	0.271929	True	Tibetan_terrier	0.094759	True
1718	819015337530290176	https://pbs.twimg.com/media/C12whDoVEAALRxa.jpg	1	standard_poodle	0.351308	True	toy_poodle	0.271929	True	Tibetan_terrier	0.094759	True

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
1393	767191397493538821	https://pbs.twimg.com/media/CqWcgcqWcAI43jm.jpg	1	patio	0.708665	False	boathouse	0.110056	False	pier	0.039532	False
701	684880619965411328	https://pbs.twimg.com/media/CYEvSaRWwAAukZ_.jpg	1	clog	0.081101	False	spindle	0.066957	False	agama	0.060884	False
230	670408998013820928	https://pbs.twimg.com/media/CU3FbQgVAAACdCQ.jpg	1	ping-pong_ball	0.999945	False	tennis_ball	0.000018	False	racket	0.000015	False
1696	816450570814898180	https://pbs.twimg.com/media/C1SddosXUAQcVR1.jpg	1	web_site	0.352857	False	envelope	0.060107	False	nail	0.031291	False
1818	834167344700198914	https://pbs.twimg.com/media/C5OOxY6WAAAxERz.jpg	1	ox	0.991682	False	bison	0.005335	False	water_buffalo	0.001130	False
390	673363615379013632	https://pbs.twimg.com/media/CVhEoq4WcAE8pBm.jpg	1	ox	0.193431	False	warthog	0.123827	False	bison	0.111177	False
6	666051853826850816	https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg	1	box_turtle	0.933012	False	mud_turtle	0.045885	False	terrapin	0.017885	False
537	676948236477857792	https://pbs.twimg.com/media/CWUA1GFW4AAowiq.jpg	1	guenon	0.611603	False	macaque	0.135176	False	squirrel_monkey	0.083247	False
990	707995814724026368	https://pbs.twimg.com/media/CdNOb17WwAA5z4A.jpg	1	agama	0.172087	False	Gila_monster	0.126978	False	lumbermill	0.050400	False
533	676897532954456065	https://pbs.twimg.com/media/CWTSt0UW4AALMNB.jpg	1	hamster	0.628255	False	guinea_pig	0.318646	False	macaque	0.013058	False

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	rating_numerator	rating_denominator	name	doggo	floofer	pupper	puppo
818	770743923962707968	NaN	NaN	2016-08-30 22:04:05 +0000	<a href="http://twitter.com/download/iphone" r...	RT @dog_rates: Here's a doggo blowing bubbles....	7.392382e+17	4.196984e+09	2016-06-04 23:31:25 +0000	https://twitter.com/dog_rates/status/739238157...	13	10	None	doggo	None	None	None
247	845459076796616705	NaN	NaN	2017-03-25 02:15:26 +0000	<a href="http://twitter.com/download/iphone" r...	RT @dog_rates: Here's a heartwarming scene of ...	7.562885e+17	4.196984e+09	2016-07-22 00:43:32 +0000	https://twitter.com/dog_rates/status/756288534...	12	10	None	None	None	None	None
95	873697596434513921	NaN	NaN	2017-06-11 00:25:14 +0000	<a href="http://twitter.com/download/iphone" r...	RT @dog_rates: This is Walter. He won't start ...	8.688804e+17	4.196984e+09	2017-05-28 17:23:24 +0000	https://twitter.com/dog_rates/status/868880397...	14	10	Walter	None	None	None	None
506	812747805718642688	NaN	NaN	2016-12-24 19:52:31 +0000	<a href="http://twitter.com/download/iphone" r...	RT @dog_rates: Meet Sammy. At first I was like...	6.800555e+17	4.196984e+09	2015-12-24 16:00:30 +0000	https://twitter.com/dog_rates/status/680055455...	10	10	Sammy	None	None	None	None
101	872668790621863937	NaN	NaN	2017-06-08 04:17:07 +0000	<a href="http://twitter.com/download/iphone" r...	RT @loganamnosis: Penelope here is doing me qu...	8.726576e+17	1.547674e+08	2017-06-08 03:32:35 +0000	https://twitter.com/loganamnosis/status/872657...	14	10	None	None	None	None	None

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
607	680055455951884288	https://pbs.twimg.com/media/CW-ZRC_WQAAyFrL.jpg	1	Samoyed	0.995466	True	Great_Pyrenees	0.001834	True	Pomeranian	0.000667	True
1309	754011816964026368	https://pbs.twimg.com/media/CnbJuPoXEAAjcVF.jpg	1	French_bulldog	0.600985	True	Boston_bull	0.273176	True	boxer	0.056772	True
1615	802247111496568832	https://pbs.twimg.com/media/Cs_DYr1XEAA54Pu.jpg	1	Chihuahua	0.721188	True	toy_terrier	0.112943	True	kelpie	0.053365	True
1835	837012587749474308	https://pbs.twimg.com/media/C52pYJXWgAA2BEf.jpg	1	toilet_tissue	0.186387	False	cowboy_hat	0.158555	False	sombrero	0.149470	False
1864	842892208864923648	https://pbs.twimg.com/ext_tw_video_thumb/80710...	1	Chihuahua	0.505370	True	Pomeranian	0.120358	True	toy_terrier	0.077008	True
1944	861769973181624320	https://pbs.twimg.com/media/CzG425nWgAAnP7P.jpg	2	Arabian_camel	0.366248	False	house_finch	0.209852	False	cocker_spaniel	0.046403	True
1985	872261713294495745	https://pbs.twimg.com/media/DBrlZk2UQAAfAkd.jpg	2	Labrador_retriever	0.972019	True	flat-coated_retriever	0.008178	True	Chesapeake_Bay_retriever	0.007359	True
1992	873697596434513921	https://pbs.twimg.com/media/DA7iHL5U0AA1OQo.jpg	1	laptop	0.153718	False	French_bulldog	0.099984	True	printer	0.077130	False
2055	888202515573088257	https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg	2	Pembroke	0.809197	True	Rhodesian_ridgeback	0.054950	True	beagle	0.038915	True

	tweet_id	is_retweet	is_quote	is_reply
6	878604707211726852	False	True	False
7	879674319642796034	False	False	True
12	885518971528720385	False	False	False
21	886267009285017600	False	False	True
22	873337748698140672	True	False	False

	tweet_id	dog_stage
9	890240255349198849	doggo
42	884162670584377345	doggo
97	872967104147763200	doggo
104	871515927908634625	doggo
106	871102520638267392	doggo

	tweet_id	retweet_count	favorite_count	tweet_type
6	878604707211726852	6996	29473	Quote
68	876537666061221889	4519	22980	Quote
75	875097192612077568	5903	26755	Quote
93	884247878851493888	19881	71022	Quote
114	866720684873056260	4775	19841	Quote

	timestamp	retweeted_status_timestamp
313	2017-01-27 01:49:15	2016-11-06 01:33:58
2030	2015-11-17 02:06:42	NaT
607	2016-09-20 01:12:28	NaT
539	2016-10-20 16:15:26	NaT
1576	2015-12-12 03:47:46	NaT
1536	2015-12-15 17:11:09	NaT
880	2016-06-04 23:31:25	NaT
1605	2015-12-10 03:30:58	NaT
1617	2015-12-09 20:40:38	NaT
1254	2016-01-28 02:12:04	NaT

	prediction_p1	prediction_p2	prediction_p3
526	Malamute	jean	Keeshond
788	Golden Retriever	llama	Labrador Retriever
319	Cairn	snorkel	Norfolk Terrier
1407	Vizsla	paddle	Rhodesian Ridgeback
776	Golden Retriever	Afghan Hound	Chow

	tweet_id	user_name	retweet_count	is_retweet	retweet_user	is_quote	quote_user	is_reply
22	873337748698140672	WeRateDogs™	1549	True	WeRateDogs™	False	None	False
37	874434818259525634	WeRateDogs™	14349	True	WeRateDogs™	False	None	False
50	879130579576475649	WeRateDogs™	6620	True	WeRateDogs™	False	None	False
53	878404777348136964	WeRateDogs™	1250	True	WeRateDogs™	False	None	False
60	885311592912609280	WeRateDogs™	17974	True	WeRateDogs™	False	None	False

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	retweeted_status_id	retweeted_status_user_id
46	882268110199369728	none	none	none	none
277	830583320585068544	none	none	none	none
559	785872687017132033	none	none	none	none
944	726935089318363137	none	none	none	none
44	882762694511734784	none	none	none	none
696	762316489655476224	none	none	none	none
447	803276597545603072	none	none	none	none
690	763837565564780549	none	none	none	none
1262	691675652215414786	none	none	none	none
931	728751179681943552	none	none	none	none

	rating_numerator	rating_denominator	text
340	84	70	The floofs have been released I repeat the flo...
410	24	7	Meet Sam. She smiles 24/7 & secretly aspir...
729	165	150	Why does this never happen at my front door......
870	9	11	After so many requests, this is Bretagne. She ...
918	204	170	Say hello to this unbelievably well behaved sq...
961	4	20	Happy 4/20 from the squad! 13/10 for all https...
995	50	50	This is Bluebert. He just saw that both #Final...
1016	99	90	Happy Saturday here's 9 puppers on a bench. 99...
1041	80	80	Here's a brigade of puppers. All look very pre...
1059	45	50	From left to right:\nCletus, Jerome, Alejandro...
1125	60	50	Here is a whole flock of puppers. 60/50 I'll ...
1201	44	40	Happy Wednesday here's a bucket of pups. 44/40...
1373	143	130	Two sneaky puppers were not initially seen, mo...
1374	121	110	Someone help the girl is being mugged. Several...
1399	7	11	This is Darrel. He just robbed a 7/11 and is i...
1505	144	120	IT'S PUPPERGEDDON. Total of 144/120 ...I think...
1564	88	80	Here we have an entire platoon of puppers. Tot...
2045	1	2	This is an Albanian 3 1/2 legged Episcopalian...

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	...	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog	decimal_numerator
40	883482846933004288	none	none	2017-07-08 00:28:19	<a href="http://twitter.com/download/iphone" r...	This is Bella. She hopes her smile made you sm...	none	none	NaT	https://twitter.com/dog_rates/status/883482846...	...	golden_retriever	0.943082	True	Labrador_retriever	0.032409	True	kuvasz	0.005501	True	13.50
552	786709082849828864	none	none	2016-10-13 23:23:56	<a href="http://twitter.com/download/iphone" r...	This is Logan, the Chow who lived. He solemnly...	none	none	NaT	https://twitter.com/dog_rates/status/786709082...	...	Pomeranian	0.467321	True	Persian_cat	0.122978	False	chow	0.102654	True	9.75
608	778027034220126208	none	none	2016-09-20 00:24:34	<a href="http://twitter.com/download/iphone" r...	This is Sophie. She's a Jubilant Bush Pupper. ...	none	none	NaT	https://twitter.com/dog_rates/status/778027034...	...	clumber	0.946718	True	cocker_spaniel	0.015950	True	Lhasa	0.006519	True	11.27
1445	680494726643068929	none	none	2015-12-25 21:06:00	<a href="http://twitter.com/download/iphone" r...	Here we have uncovered an entire battalion of ...	none	none	NaT	https://twitter.com/dog_rates/status/680494726...	...	kuvasz	0.438627	True	Samoyed	0.111622	True	Great_Pyrenees	0.064061	True	11.26

Wrangle and Analyse Data: WeRateDogs Twitter activity

1 - Introduction¶

2 - Gathering¶

WeRateDogs Twitter archive¶

Tweet image predictions¶

Additional gathering from Twitter API using Tweepy library¶

3 - Assess¶

Assessing additional data gathered using Tweepy (df_additional)¶

Assessing the WeRateDogs Twitter archive (df_archive)¶

Assessing image predictions (df_predictions)¶

Assessing dataframe tidiness¶

Summary¶

4 - Data cleaning¶

4.1 - Data quality : Incompleteness - dealing with missing rows in df_additional¶

4.2 - Data tidying: dog stage¶

4.3 - Data quality: Type of tweet - reply, retweet, quote tweet, or original tweet¶

4.4 - Data tidying: each observational unit a table¶

4.5 - Data quality: format of timestamps¶

4.6 - Data quality: status IDs should not be floats¶

4.7 - Data quality: Wrong dog names¶

4.8 - Data quality: inaccurate/invalid ratings (rating_denominator and rating_numerator)¶

4.9 - Data quality: Inaccurate rating numerators¶

4.10 - Data quality: dog prediction column names¶

4.11 - Data quality: dog prediction breed names are not consistent¶

4.12 - Data quality: remove column redundancy dog predictions¶

4.13 - Data quality: remove duplication related to retweets¶

5 - Storage of cleaned data¶

6 - Analysis and visualisation¶

The counts of favourites, retweets approximate a log-normal distribution¶

Favourites and retweets are strongly correlated¶

The popularity of WeRateDogs increases steadily over a 20 month period¶

Tweet popularity is associated with naming of dogs, dog "stage", and retweeting¶

The WeRateDogs ratings system is meaningful¶

Dog stage and breed have little impact on popularity¶

The most common breed is Golden Retriever¶

The most common dog names Cooper, Lucy, Charlie and Oliver¶

1 - Introduction ¶