DAND Term 2, Project 3:

Wrangle and Analyse Data: WeRateDogs Twitter activity

WeRateDogs (https://twitter.com/dog_rates) is a Twitter account that rates people's dogs with a humorous comment about the dog. The goal of this project is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. This requires the prior gathering assessing and cleaning of the data.

In [1]:
#load Python libraries
import pandas as pd
import numpy as np
import requests
import tweepy
import json
import matplotlib.pyplot as plt 
import seaborn as sns
from scipy import stats
In [2]:
#set up matplotlib to work interactively
%matplotlib inline  
sns.set()           # switch on seaborn defaults

2 - Gathering

There different pieces of data are gathered for the project as described in the Project Details page :

WeRateDogs Twitter archive

from the course's project description :

The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv

In [3]:
#this file was downloded and saved in the same folder as the Jupyter notebook file
# it can simply be read into a dataframe
df_archive = pd.read_csv('twitter-archive-enhanced.csv')
In [4]:
#initial check to see what the dataframe looks like
df_archive.head(3)
Out[4]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None

Tweet image predictions

from the course's project description :

The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

In [5]:
#load .tsv into working memory using a method from the requests library
df_predictions  = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')

#store in working direcory
with open('image-predictions.tsv',mode='wb') as file:
              file.write(df_predictions.content)
In [6]:
#read .tsv into dataframe, specifying the tab separator
df_predictions = pd.read_csv('image-predictions.tsv',sep="\t")
In [7]:
#initial check what the dataframe looks like
df_predictions.head(3)
Out[7]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True

Additional gathering from Twitter API using Tweepy library

from the course's project description :

Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission.

In [8]:
#I obtained a Twitter dev account and the relevant API keys, secrets
#and tokens are below :

consumer_key = '########################################################'
consumer_secret = '########################################################'
access_token = '########################################################'
access_secret = '########################################################'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

#api = tweepy.API(auth)
#ensure appropriate rate_limit as described here: https://stackoverflow.com/a/44586034
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
In [9]:
#function to look up list of tweets, adapted from https://stackoverflow.com/a/44586034
def lookup_tweets(tweet_IDs):
    '''
    Returns list of tweets as generated by Twitter API
    Parameter: Tweet IDs, either as a list of integers, or 
    column of a Pandas dataframe ; 
    handles 100 tweet limit, uses api.statuses_lookup()
    '''    
    if isinstance(tweet_IDs, pd.Series): #https://stackoverflow.com/a/18117744
        tweet_IDs = tweet_IDs.tolist() #convert pandas df column to list
        
    full_tweets = [] #initialise list for list of tweets
    tweet_count = len(tweet_IDs)
    try:
        for i in range((tweet_count // 100) + 1): #handle 100 tweet limit
                        # Catch the last group if it is less than 100 tweets
            end_loc = min((i + 1) * 100, tweet_count)
            print(i)    #get feedback during downloading
            print("range {} to {}".format(i*100,end_loc))
            full_tweets.extend(   
                api.statuses_lookup(tweet_IDs[i * 100:end_loc])
            )
        return full_tweets
    except tweepy.TweepError:
        print("Something went wrong, quitting...")
In [10]:
# The aim is to use the Tweet IDs from WeRateDogs Twitter archive to gather 
# additional information using the lookup_tweets() function
# with the tweet_id column from df_archive
results = lookup_tweets(df_archive.tweet_id)
0
range 0 to 100
1
range 100 to 200
2
range 200 to 300
3
range 300 to 400
4
range 400 to 500
5
range 500 to 600
6
range 600 to 700
7
range 700 to 800
8
range 800 to 900
9
range 900 to 1000
10
range 1000 to 1100
11
range 1100 to 1200
12
range 1200 to 1300
13
range 1300 to 1400
14
range 1400 to 1500
15
range 1500 to 1600
16
range 1600 to 1700
17
range 1700 to 1800
18
range 1800 to 1900
19
range 1900 to 2000
20
range 2000 to 2100
21
range 2100 to 2200
22
range 2200 to 2300
23
range 2300 to 2356
In [11]:
#Next the tweets will be stored in a text file with one line per tweet
#using the _json property which contains JSON serializable response data
#as the tweepy status object itself is not JSON serializable https://stackoverflow.com/a/27901076
for tweet in results:
    with open('tweet_json.txt', 'a', encoding='utf8') as file: #append to text file
        json.dump(tweet._json,file)
            #json.dump()) as described in https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/
        file.write("\n") #add newline character 

The structure of JSON tweet objects is explained in https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.html

Tweet objects contain information such as the tweet "id", "text", "user", as well as the retweet_count favorite_count - the latter two are specified in the projection description. I will also use the tweet id (so that it can be matched with the archive) as well as the user name.

JSON of tweets which are retweets (i.e. simply re-posting an existing tweet potentially authored by another user, as is but creating a new tweet) or quote tweets (i.e. reposting an existing tweet with additional tweet text) contain additional tweet objects : a "retweeted_status" object or a "quoted_status" object, respectively. The "retweeted_status" and "quoted_status" objects also contain information such as "id", "text", "user" of the the tweet being retweeted/quoted. They are simply absent from tweet obects that aren't retweeted/quoted. In order to be able to easily distinguish between original tweets by WeRateDogs, I will add columns for whether a tweet is a retweet, a quote tweet, as well as the author of the original tweets being retweeted/quoted. Similarly I will add information about whether a tweet is a reply to another tweet, again to be able to easily distinguish between original tweets by WeRateDogs and tweets that are part of an ongoing conversation.

In [8]:
#The json file with the tweets will be read in and processed to build a 
#dataframe with IDs, retweet count and favourite count
#also check if a tweet is a retweet, and if it is a reply


additional = [] #initialise list that will hold dicitionaries
with open('tweet_json.txt','r', encoding='utf8') as file:
    for line in file:
        tweet=json.loads(line)

        #get retweet status, user if they exist
        try:
            retweet_user=tweet['retweeted_status']['user']['name']
            is_retweet = True
        except:
            retweet_user = None
            is_retweet = False         

            #get quote status, user if they exist        
        try:
            quote_user=tweet['quoted_status']['user']['name']
            is_quote = True
        except:
            quote_user = None
            is_quote = False         
        
        
        
        additional.append({'tweet_id': tweet['id'],
                           'user_name':tweet['user']['name'],
                          'retweet_count': tweet['retweet_count'],
                           'favorite_count' : tweet['favorite_count'],
                           #'is_retweet' : 'retweeted_status' in tweet, 
                                 #check if it's a retweet, adapted from https://stackoverflow.com/a/18937252
                           'is_retweet' : is_retweet,
                           'retweet_user' : retweet_user,
                           'is_quote' : is_quote,
                           'quote_user': quote_user,
                           'is_reply' : tweet['in_reply_to_status_id'] is not None 
                                 #check if it's a reply, adapted from https://stackoverflow.com/a/49469052
                          })
       # break
        
df_additional=pd.DataFrame(additional, columns=['tweet_id','user_name','retweet_count','favorite_count','is_retweet','retweet_user','is_quote','quote_user','is_reply'])
In [9]:
#initial check of new dataframe
df_additional.head(3)
Out[9]:
tweet_id user_name retweet_count favorite_count is_retweet retweet_user is_quote quote_user is_reply
0 876484053909872640 WeRateDogs™ 2327 18309 False None False None False
1 886366144734445568 WeRateDogs™ 3100 20637 False None False None False
2 877611172832227328 WeRateDogs™ 78 0 True Rachel Baker False None False
In [10]:
df_additional.shape
Out[10]:
(2339, 9)

3 - Assess

Data quality needs to be assessed in the context of the questions that analysis of a data set is meant to answer. Before starting the assessment, it's helpful to be aware of what those questions might be. For the present data, questions might include:

  • is the idiosyncratic rating system meaningful?
  • which dogs are the most popular?
  • how is the popularity of tweets from this Twitter account developing over time?

Assessing additional data gathered using Tweepy (df_additional)

In [11]:
df_additional.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2339 entries, 0 to 2338
Data columns (total 9 columns):
tweet_id          2339 non-null int64
user_name         2339 non-null object
retweet_count     2339 non-null int64
favorite_count    2339 non-null int64
is_retweet        2339 non-null bool
retweet_user      167 non-null object
is_quote          2339 non-null bool
quote_user        24 non-null object
is_reply          2339 non-null bool
dtypes: bool(3), int64(3), object(3)
memory usage: 116.6+ KB

Data types look OK. None values for retweet_user, quote_user are accurate rather than missing values because these tweets are not retweets/quote tweets. tweet_id could also be a string rather than integer, but integer may be more efficient here.

In [12]:
#visual assessment of the dataframe in Pandas - also done in separate spreadsheet software

df_additional
Out[12]:
tweet_id user_name retweet_count favorite_count is_retweet retweet_user is_quote quote_user is_reply
0 876484053909872640 WeRateDogs™ 2327 18309 False None False None False
1 886366144734445568 WeRateDogs™ 3100 20637 False None False None False
2 877611172832227328 WeRateDogs™ 78 0 True Rachel Baker False None False
3 888078434458587136 WeRateDogs™ 3378 21193 False None False None False
4 891689557279858688 WeRateDogs™ 8373 41028 False None False None False
5 879492040517615616 WeRateDogs™ 3088 22844 False None False None False
6 878604707211726852 WeRateDogs™ 6996 29473 False None True BBC News (World) False
7 879674319642796034 WeRateDogs™ 10 301 False None False None True
8 884876753390489601 WeRateDogs™ 5444 27185 False None False None False
9 882992080364220416 WeRateDogs™ 3804 23328 False None False None False
10 876120275196170240 WeRateDogs™ 4561 27275 False None False None False
11 889880896479866881 WeRateDogs™ 4820 27085 False None False None False
12 885518971528720385 WeRateDogs™ 3634 19915 False None False None False
13 879862464715927552 WeRateDogs™ 3394 21785 False None False None False
14 877556246731214848 WeRateDogs™ 3696 22174 False None False None False
15 890971913173991426 WeRateDogs™ 1989 11530 False None False None False
16 890006608113172480 WeRateDogs™ 7097 29888 False None False None False
17 873580283840344065 WeRateDogs™ 3847 23674 False None False None False
18 880935762899988482 WeRateDogs™ 2704 16616 False None False None False
19 891815181378084864 WeRateDogs™ 4020 24391 False None False None False
20 877201837425926144 WeRateDogs™ 5433 26475 False None False None False
21 886267009285017600 WeRateDogs™ 4 116 False None False None True
22 873337748698140672 WeRateDogs™ 1549 0 True WeRateDogs™ False None False
23 875747767867523072 WeRateDogs™ 4178 24593 False None False None False
24 889278841981685760 WeRateDogs™ 5203 24588 False None False None False
25 879415818425184262 WeRateDogs™ 43040 103165 False None False None False
26 882268110199369728 WeRateDogs™ 11241 43774 False None False None False
27 880465832366813184 WeRateDogs™ 6098 27800 False None False None False
28 887101392804085760 WeRateDogs™ 5775 29787 False None False None False
29 884162670584377345 WeRateDogs™ 2890 19836 False None False None False
... ... ... ... ... ... ... ... ... ...
2309 666020888022790149 WeRateDogs™ 497 2519 False None False None False
2310 666421158376562688 WeRateDogs™ 111 310 False None False None False
2311 666983947667116034 WeRateDogs™ 979 2528 False None False None False
2312 666063827256086533 WeRateDogs™ 213 463 False None False None False
2313 666447344410484738 WeRateDogs™ 19 102 False None False None False
2314 666691418707132416 WeRateDogs™ 47 185 False None False None False
2315 666337882303524864 WeRateDogs™ 88 192 False None False None False
2316 666273097616637952 WeRateDogs™ 76 170 False None False None False
2317 667044094246576128 WeRateDogs™ 50 182 False None False None False
2318 666411507551481857 WeRateDogs™ 321 432 False None False None False
2319 666786068205871104 WeRateDogs™ 484 750 False None False None False
2320 666428276349472768 WeRateDogs™ 81 162 False None False None False
2321 666353288456101888 WeRateDogs™ 71 214 False None False None False
2322 666826780179869698 WeRateDogs™ 97 251 False None False None False
2323 666804364988780544 WeRateDogs™ 92 236 False None False None False
2324 666817836334096384 WeRateDogs™ 249 512 False None False None False
2325 666396247373291520 WeRateDogs™ 83 161 False None False None False
2326 666644823164719104 WeRateDogs™ 79 228 False None False None False
2327 666649482315059201 WeRateDogs™ 617 1105 False None False None False
2328 666055525042405380 WeRateDogs™ 235 427 False None False None False
2329 666051853826850816 WeRateDogs™ 828 1193 False None False None False
2330 666094000022159362 WeRateDogs™ 72 160 False None False None False
2331 666430724426358785 WeRateDogs™ 194 311 False None False None False
2332 666287406224695296 WeRateDogs™ 63 142 False None False None False
2333 666437273139982337 WeRateDogs™ 47 122 False None False None False
2334 666099513787052032 WeRateDogs™ 66 150 False None False None False
2335 666268910803644416 WeRateDogs™ 35 100 False None False None False
2336 666362758909284353 WeRateDogs™ 559 756 False None False None False
2337 666835007768551424 WeRateDogs™ 80 211 False None False None False
2338 666104133288665088 WeRateDogs™ 6355 13987 False None False None False

2339 rows × 9 columns

In [13]:
df_additional.describe()
Out[13]:
tweet_id retweet_count favorite_count
count 2.339000e+03 2339.000000 2339.000000
mean 7.421620e+17 2897.746045 7884.252672
std 6.828729e+16 4886.564270 12218.674683
min 6.660209e+17 1.000000 0.000000
25% 6.783378e+17 582.500000 1365.500000
50% 7.186133e+17 1353.000000 3430.000000
75% 7.986907e+17 3377.500000 9653.500000
max 8.924206e+17 82838.000000 162741.000000

retweet_count and favorite_count of 0 seem low compared to their respective median values- what are these?

In [14]:
df_additional[df_additional.favorite_count==0].head(10)
Out[14]:
tweet_id user_name retweet_count favorite_count is_retweet retweet_user is_quote quote_user is_reply
2 877611172832227328 WeRateDogs™ 78 0 True Rachel Baker False None False
22 873337748698140672 WeRateDogs™ 1549 0 True WeRateDogs™ False None False
37 874434818259525634 WeRateDogs™ 14349 0 True WeRateDogs™ False None False
50 879130579576475649 WeRateDogs™ 6620 0 True WeRateDogs™ False None False
53 878404777348136964 WeRateDogs™ 1250 0 True WeRateDogs™ False None False
54 886054160059072513 WeRateDogs™ 105 0 True Ramón Laureano’s Arm 💪🏽 False None False
60 885311592912609280 WeRateDogs™ 17974 0 True WeRateDogs™ False None False
85 878316110768087041 WeRateDogs™ 6464 0 True WeRateDogs™ False None False
98 858860390427611136 WeRateDogs™ 8229 0 True WeRateDogs™ False None False
99 868639477480148993 WeRateDogs™ 2069 0 True WeRateDogs™ False None False
In [15]:
df_additional[df_additional.favorite_count==0].is_retweet.value_counts()
Out[15]:
True    167
Name: is_retweet, dtype: int64

Low favorite_count values are seen on retweets - because favourites are added to the original tweet rather than the retweet. This is true whether the retweet is of another user's tweet or a tweet by WeRateDogs. These low favorite_count values are not reflective of a data quality issue - but are important to understand the data.

In [16]:
df_additional[df_additional.retweet_count<10]
Out[16]:
tweet_id user_name retweet_count favorite_count is_retweet retweet_user is_quote quote_user is_reply
21 886267009285017600 WeRateDogs™ 4 116 False None False None True
38 881633300179243008 WeRateDogs™ 7 124 False None False None True
187 870726314365509632 WeRateDogs™ 3 117 False None False None True
232 840698636975636481 WeRateDogs™ 2 183 False None False None True
252 838085839343206401 WeRateDogs™ 1 147 False None False None True
293 832088576586297345 WeRateDogs™ 2 69 False None False None True
1030 738891149612572673 WeRateDogs™ 6 112 False None False None True
1266 707983188426153984 WeRateDogs™ 2 51 False None False None True

Low retweet_count values are seen on tweets which are replies and presumably often of lesser interest in isolation therfore generally not worth retweeting. Again: not reflective of a data quality issue - but are important to understand the data.

In [17]:
df_additional[df_additional.favorite_count==0].head(10)
Out[17]:
tweet_id user_name retweet_count favorite_count is_retweet retweet_user is_quote quote_user is_reply
2 877611172832227328 WeRateDogs™ 78 0 True Rachel Baker False None False
22 873337748698140672 WeRateDogs™ 1549 0 True WeRateDogs™ False None False
37 874434818259525634 WeRateDogs™ 14349 0 True WeRateDogs™ False None False
50 879130579576475649 WeRateDogs™ 6620 0 True WeRateDogs™ False None False
53 878404777348136964 WeRateDogs™ 1250 0 True WeRateDogs™ False None False
54 886054160059072513 WeRateDogs™ 105 0 True Ramón Laureano’s Arm 💪🏽 False None False
60 885311592912609280 WeRateDogs™ 17974 0 True WeRateDogs™ False None False
85 878316110768087041 WeRateDogs™ 6464 0 True WeRateDogs™ False None False
98 858860390427611136 WeRateDogs™ 8229 0 True WeRateDogs™ False None False
99 868639477480148993 WeRateDogs™ 2069 0 True WeRateDogs™ False None False
In [18]:
df_additional.retweet_user.value_counts()
Out[18]:
WeRateDogs™                           144
We Rate Dog Ratings                     2
Jon 🇬🇧                                  1
Rachel Baker                            1
tally                                   1
Ramón Laureano’s Arm 💪🏽                 1
Rachael                                 1
Jeannette                               1
Patreon                                 1
♪{•.•}* • banditø 8/3/19                1
Ellen M                                 1
Jenna Marbles                           1
Lin-Manuel Miranda                      1
Kiba                                    1
Eddie                                   1
Twitter                                 1
Kate Gray                               1
katie                                   1
Allison Schafer                         1
Gia 👩🏼‍🍳                                1
That wasn't very cash money of you      1
Emilie Shepherd                         1
Aaron Chewning                          1
Name: retweet_user, dtype: int64
In [19]:
df_additional.quote_user.value_counts()
Out[19]:
NBC News                   1
Shomari Stone              1
Codeine Bryant             1
Yahoo News                 1
Animal Cognition           1
Tower Hamlets Police       1
Bellevue, WA Police        1
ABC News                   1
Thug on a thot tin roof    1
Mark Halperin              1
WSAZ NewsChannel 3         1
FOX Deportes               1
Badlands Nat'l Park        1
NBC Sports Capitals        1
KTLA                       1
Micah Grimes               1
StickerGrub                1
WGN TV News                1
Deadspin                   1
Kai Johnson                1
BBC News (World)           1
The Telegraph              1
KING 5 News                1
CHP San Francisco          1
Name: quote_user, dtype: int64

WeRateDogs doesn't quote-tweet their own tweets, but commonly retweets their own tweets - this is a popular way for tweeters to boost their recent posts.

No major data quality issues stand out with the additional tweet data in df_additional. Data completeness and tidiness is assessed below.

Assessing the WeRateDogs Twitter archive (df_archive)

In [20]:
#visual assessment of the dataframe in Pandas - also done in separate spreadsheet software
df_archive
Out[20]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12 10 Franklin None None None None
5 891087950875897856 NaN NaN 2017-07-29 00:08:17 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a majestic great white breaching ... NaN NaN NaN https://twitter.com/dog_rates/status/891087950... 13 10 None None None None None
6 890971913173991426 NaN NaN 2017-07-28 16:27:12 +0000 <a href="http://twitter.com/download/iphone" r... Meet Jax. He enjoys ice cream so much he gets ... NaN NaN NaN https://gofundme.com/ydvmve-surgery-for-jax,ht... 13 10 Jax None None None None
7 890729181411237888 NaN NaN 2017-07-28 00:22:40 +0000 <a href="http://twitter.com/download/iphone" r... When you watch your owner call another dog a g... NaN NaN NaN https://twitter.com/dog_rates/status/890729181... 13 10 None None None None None
8 890609185150312448 NaN NaN 2017-07-27 16:25:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Zoey. She doesn't want to be one of th... NaN NaN NaN https://twitter.com/dog_rates/status/890609185... 13 10 Zoey None None None None
9 890240255349198849 NaN NaN 2017-07-26 15:59:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Cassie. She is a college pup. Studying... NaN NaN NaN https://twitter.com/dog_rates/status/890240255... 14 10 Cassie doggo None None None
10 890006608113172480 NaN NaN 2017-07-26 00:31:25 +0000 <a href="http://twitter.com/download/iphone" r... This is Koda. He is a South Australian decksha... NaN NaN NaN https://twitter.com/dog_rates/status/890006608... 13 10 Koda None None None None
11 889880896479866881 NaN NaN 2017-07-25 16:11:53 +0000 <a href="http://twitter.com/download/iphone" r... This is Bruno. He is a service shark. Only get... NaN NaN NaN https://twitter.com/dog_rates/status/889880896... 13 10 Bruno None None None None
12 889665388333682689 NaN NaN 2017-07-25 01:55:32 +0000 <a href="http://twitter.com/download/iphone" r... Here's a puppo that seems to be on the fence a... NaN NaN NaN https://twitter.com/dog_rates/status/889665388... 13 10 None None None None puppo
13 889638837579907072 NaN NaN 2017-07-25 00:10:02 +0000 <a href="http://twitter.com/download/iphone" r... This is Ted. He does his best. Sometimes that'... NaN NaN NaN https://twitter.com/dog_rates/status/889638837... 12 10 Ted None None None None
14 889531135344209921 NaN NaN 2017-07-24 17:02:04 +0000 <a href="http://twitter.com/download/iphone" r... This is Stuart. He's sporting his favorite fan... NaN NaN NaN https://twitter.com/dog_rates/status/889531135... 13 10 Stuart None None None puppo
15 889278841981685760 NaN NaN 2017-07-24 00:19:32 +0000 <a href="http://twitter.com/download/iphone" r... This is Oliver. You're witnessing one of his m... NaN NaN NaN https://twitter.com/dog_rates/status/889278841... 13 10 Oliver None None None None
16 888917238123831296 NaN NaN 2017-07-23 00:22:39 +0000 <a href="http://twitter.com/download/iphone" r... This is Jim. He found a fren. Taught him how t... NaN NaN NaN https://twitter.com/dog_rates/status/888917238... 12 10 Jim None None None None
17 888804989199671297 NaN NaN 2017-07-22 16:56:37 +0000 <a href="http://twitter.com/download/iphone" r... This is Zeke. He has a new stick. Very proud o... NaN NaN NaN https://twitter.com/dog_rates/status/888804989... 13 10 Zeke None None None None
18 888554962724278272 NaN NaN 2017-07-22 00:23:06 +0000 <a href="http://twitter.com/download/iphone" r... This is Ralphus. He's powering up. Attempting ... NaN NaN NaN https://twitter.com/dog_rates/status/888554962... 13 10 Ralphus None None None None
19 888202515573088257 NaN NaN 2017-07-21 01:02:36 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: This is Canela. She attempted s... 8.874740e+17 4.196984e+09 2017-07-19 00:47:34 +0000 https://twitter.com/dog_rates/status/887473957... 13 10 Canela None None None None
20 888078434458587136 NaN NaN 2017-07-20 16:49:33 +0000 <a href="http://twitter.com/download/iphone" r... This is Gerald. He was just told he didn't get... NaN NaN NaN https://twitter.com/dog_rates/status/888078434... 12 10 Gerald None None None None
21 887705289381826560 NaN NaN 2017-07-19 16:06:48 +0000 <a href="http://twitter.com/download/iphone" r... This is Jeffrey. He has a monopoly on the pool... NaN NaN NaN https://twitter.com/dog_rates/status/887705289... 13 10 Jeffrey None None None None
22 887517139158093824 NaN NaN 2017-07-19 03:39:09 +0000 <a href="http://twitter.com/download/iphone" r... I've yet to rate a Venezuelan Hover Wiener. Th... NaN NaN NaN https://twitter.com/dog_rates/status/887517139... 14 10 such None None None None
23 887473957103951883 NaN NaN 2017-07-19 00:47:34 +0000 <a href="http://twitter.com/download/iphone" r... This is Canela. She attempted some fancy porch... NaN NaN NaN https://twitter.com/dog_rates/status/887473957... 13 10 Canela None None None None
24 887343217045368832 NaN NaN 2017-07-18 16:08:03 +0000 <a href="http://twitter.com/download/iphone" r... You may not have known you needed to see this ... NaN NaN NaN https://twitter.com/dog_rates/status/887343217... 13 10 None None None None None
25 887101392804085760 NaN NaN 2017-07-18 00:07:08 +0000 <a href="http://twitter.com/download/iphone" r... This... is a Jubilant Antarctic House Bear. We... NaN NaN NaN https://twitter.com/dog_rates/status/887101392... 12 10 None None None None None
26 886983233522544640 NaN NaN 2017-07-17 16:17:36 +0000 <a href="http://twitter.com/download/iphone" r... This is Maya. She's very shy. Rarely leaves he... NaN NaN NaN https://twitter.com/dog_rates/status/886983233... 13 10 Maya None None None None
27 886736880519319552 NaN NaN 2017-07-16 23:58:41 +0000 <a href="http://twitter.com/download/iphone" r... This is Mingus. He's a wonderful father to his... NaN NaN NaN https://www.gofundme.com/mingusneedsus,https:/... 13 10 Mingus None None None None
28 886680336477933568 NaN NaN 2017-07-16 20:14:00 +0000 <a href="http://twitter.com/download/iphone" r... This is Derek. He's late for a dog meeting. 13... NaN NaN NaN https://twitter.com/dog_rates/status/886680336... 13 10 Derek None None None None
29 886366144734445568 NaN NaN 2017-07-15 23:25:31 +0000 <a href="http://twitter.com/download/iphone" r... This is Roscoe. Another pupper fallen victim t... NaN NaN NaN https://twitter.com/dog_rates/status/886366144... 12 10 Roscoe None None pupper None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2326 666411507551481857 NaN NaN 2015-11-17 00:24:19 +0000 <a href="http://twitter.com/download/iphone" r... This is quite the dog. Gets really excited whe... NaN NaN NaN https://twitter.com/dog_rates/status/666411507... 2 10 quite None None None None
2327 666407126856765440 NaN NaN 2015-11-17 00:06:54 +0000 <a href="http://twitter.com/download/iphone" r... This is a southern Vesuvius bumblegruff. Can d... NaN NaN NaN https://twitter.com/dog_rates/status/666407126... 7 10 a None None None None
2328 666396247373291520 NaN NaN 2015-11-16 23:23:41 +0000 <a href="http://twitter.com/download/iphone" r... Oh goodness. A super rare northeast Qdoba kang... NaN NaN NaN https://twitter.com/dog_rates/status/666396247... 9 10 None None None None None
2329 666373753744588802 NaN NaN 2015-11-16 21:54:18 +0000 <a href="http://twitter.com/download/iphone" r... Those are sunglasses and a jean jacket. 11/10 ... NaN NaN NaN https://twitter.com/dog_rates/status/666373753... 11 10 None None None None None
2330 666362758909284353 NaN NaN 2015-11-16 21:10:36 +0000 <a href="http://twitter.com/download/iphone" r... Unique dog here. Very small. Lives in containe... NaN NaN NaN https://twitter.com/dog_rates/status/666362758... 6 10 None None None None None
2331 666353288456101888 NaN NaN 2015-11-16 20:32:58 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a mixed Asiago from the Galápagos... NaN NaN NaN https://twitter.com/dog_rates/status/666353288... 8 10 None None None None None
2332 666345417576210432 NaN NaN 2015-11-16 20:01:42 +0000 <a href="http://twitter.com/download/iphone" r... Look at this jokester thinking seat belt laws ... NaN NaN NaN https://twitter.com/dog_rates/status/666345417... 10 10 None None None None None
2333 666337882303524864 NaN NaN 2015-11-16 19:31:45 +0000 <a href="http://twitter.com/download/iphone" r... This is an extremely rare horned Parthenon. No... NaN NaN NaN https://twitter.com/dog_rates/status/666337882... 9 10 an None None None None
2334 666293911632134144 NaN NaN 2015-11-16 16:37:02 +0000 <a href="http://twitter.com/download/iphone" r... This is a funny dog. Weird toes. Won't come do... NaN NaN NaN https://twitter.com/dog_rates/status/666293911... 3 10 a None None None None
2335 666287406224695296 NaN NaN 2015-11-16 16:11:11 +0000 <a href="http://twitter.com/download/iphone" r... This is an Albanian 3 1/2 legged Episcopalian... NaN NaN NaN https://twitter.com/dog_rates/status/666287406... 1 2 an None None None None
2336 666273097616637952 NaN NaN 2015-11-16 15:14:19 +0000 <a href="http://twitter.com/download/iphone" r... Can take selfies 11/10 https://t.co/ws2AMaNwPW NaN NaN NaN https://twitter.com/dog_rates/status/666273097... 11 10 None None None None None
2337 666268910803644416 NaN NaN 2015-11-16 14:57:41 +0000 <a href="http://twitter.com/download/iphone" r... Very concerned about fellow dog trapped in com... NaN NaN NaN https://twitter.com/dog_rates/status/666268910... 10 10 None None None None None
2338 666104133288665088 NaN NaN 2015-11-16 04:02:55 +0000 <a href="http://twitter.com/download/iphone" r... Not familiar with this breed. No tail (weird).... NaN NaN NaN https://twitter.com/dog_rates/status/666104133... 1 10 None None None None None
2339 666102155909144576 NaN NaN 2015-11-16 03:55:04 +0000 <a href="http://twitter.com/download/iphone" r... Oh my. Here you are seeing an Adobe Setter giv... NaN NaN NaN https://twitter.com/dog_rates/status/666102155... 11 10 None None None None None
2340 666099513787052032 NaN NaN 2015-11-16 03:44:34 +0000 <a href="http://twitter.com/download/iphone" r... Can stand on stump for what seems like a while... NaN NaN NaN https://twitter.com/dog_rates/status/666099513... 8 10 None None None None None
2341 666094000022159362 NaN NaN 2015-11-16 03:22:39 +0000 <a href="http://twitter.com/download/iphone" r... This appears to be a Mongolian Presbyterian mi... NaN NaN NaN https://twitter.com/dog_rates/status/666094000... 9 10 None None None None None
2342 666082916733198337 NaN NaN 2015-11-16 02:38:37 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a well-established sunblockerspan... NaN NaN NaN https://twitter.com/dog_rates/status/666082916... 6 10 None None None None None
2343 666073100786774016 NaN NaN 2015-11-16 01:59:36 +0000 <a href="http://twitter.com/download/iphone" r... Let's hope this flight isn't Malaysian (lol). ... NaN NaN NaN https://twitter.com/dog_rates/status/666073100... 10 10 None None None None None
2344 666071193221509120 NaN NaN 2015-11-16 01:52:02 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a northern speckled Rhododendron.... NaN NaN NaN https://twitter.com/dog_rates/status/666071193... 9 10 None None None None None
2345 666063827256086533 NaN NaN 2015-11-16 01:22:45 +0000 <a href="http://twitter.com/download/iphone" r... This is the happiest dog you will ever see. Ve... NaN NaN NaN https://twitter.com/dog_rates/status/666063827... 10 10 the None None None None
2346 666058600524156928 NaN NaN 2015-11-16 01:01:59 +0000 <a href="http://twitter.com/download/iphone" r... Here is the Rand Paul of retrievers folks! He'... NaN NaN NaN https://twitter.com/dog_rates/status/666058600... 8 10 the None None None None
2347 666057090499244032 NaN NaN 2015-11-16 00:55:59 +0000 <a href="http://twitter.com/download/iphone" r... My oh my. This is a rare blond Canadian terrie... NaN NaN NaN https://twitter.com/dog_rates/status/666057090... 9 10 a None None None None
2348 666055525042405380 NaN NaN 2015-11-16 00:49:46 +0000 <a href="http://twitter.com/download/iphone" r... Here is a Siberian heavily armored polar bear ... NaN NaN NaN https://twitter.com/dog_rates/status/666055525... 10 10 a None None None None
2349 666051853826850816 NaN NaN 2015-11-16 00:35:11 +0000 <a href="http://twitter.com/download/iphone" r... This is an odd dog. Hard on the outside but lo... NaN NaN NaN https://twitter.com/dog_rates/status/666051853... 2 10 an None None None None
2350 666050758794694657 NaN NaN 2015-11-16 00:30:50 +0000 <a href="http://twitter.com/download/iphone" r... This is a truly beautiful English Wilson Staff... NaN NaN NaN https://twitter.com/dog_rates/status/666050758... 10 10 a None None None None
2351 666049248165822465 NaN NaN 2015-11-16 00:24:50 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a 1949 1st generation vulpix. Enj... NaN NaN NaN https://twitter.com/dog_rates/status/666049248... 5 10 None None None None None
2352 666044226329800704 NaN NaN 2015-11-16 00:04:52 +0000 <a href="http://twitter.com/download/iphone" r... This is a purebred Piers Morgan. Loves to Netf... NaN NaN NaN https://twitter.com/dog_rates/status/666044226... 6 10 a None None None None
2353 666033412701032449 NaN NaN 2015-11-15 23:21:54 +0000 <a href="http://twitter.com/download/iphone" r... Here is a very happy pup. Big fan of well-main... NaN NaN NaN https://twitter.com/dog_rates/status/666033412... 9 10 a None None None None
2354 666029285002620928 NaN NaN 2015-11-15 23:05:30 +0000 <a href="http://twitter.com/download/iphone" r... This is a western brown Mitsubishi terrier. Up... NaN NaN NaN https://twitter.com/dog_rates/status/666029285... 7 10 a None None None None
2355 666020888022790149 NaN NaN 2015-11-15 22:32:08 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a Japanese Irish Setter. Lost eye... NaN NaN NaN https://twitter.com/dog_rates/status/666020888... 8 10 None None None None None

2356 rows × 17 columns

In [21]:
#looking at missing fields and if data types are appropriate
df_archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB

No data are available in a large number of fields relating to replies (e.g. 78 in_reply_to_status_id), retweets (181 retweeted_status_id), expanded urls (2297 expanded_urls).

The following data types are not appropriate:

  1. timestamp and retweeted_status_timestamp should not be strings (object) but datetime
  • in_reply_to_status_id, retweeted_status_id should probably not be floats -
  • the same is true for in_reply_to_user_id, retweeted_status_user_id
In [22]:
#replies 
df_archive[df_archive.in_reply_to_status_id.notnull()][['tweet_id','in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp']].head()
Out[22]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp
30 886267009285017600 8.862664e+17 2.281182e+09 2017-07-15 16:51:35 +0000
55 881633300179243008 8.816070e+17 4.738443e+07 2017-07-02 21:58:53 +0000
64 879674319642796034 8.795538e+17 3.105441e+09 2017-06-27 12:14:36 +0000
113 870726314365509632 8.707262e+17 1.648776e+07 2017-06-02 19:38:25 +0000
148 863427515083354112 8.634256e+17 7.759620e+07 2017-05-13 16:15:35 +0000
In [23]:
#retweets
df_archive[df_archive.retweeted_status_id.notnull()][['tweet_id','retweeted_status_id','retweeted_status_user_id', 'retweeted_status_timestamp']].head()
Out[23]:
tweet_id retweeted_status_id retweeted_status_user_id retweeted_status_timestamp
19 888202515573088257 8.874740e+17 4.196984e+09 2017-07-19 00:47:34 +0000
32 886054160059072513 8.860537e+17 1.960740e+07 2017-07-15 02:44:07 +0000
36 885311592912609280 8.305833e+17 4.196984e+09 2017-02-12 01:04:29 +0000
68 879130579576475649 8.780576e+17 4.196984e+09 2017-06-23 01:10:23 +0000
73 878404777348136964 8.782815e+17 4.196984e+09 2017-06-23 16:00:04 +0000

Twitter uses the same process to generate tweet (status) IDs and user IDs. These are unique IDs based on time https://developer.twitter.com/en/docs/basics/twitter-ids.html

Their purpose is to create a unique identifier rather than serve mathematical calculations. However when IDs are encoded as floats, they are displayed in scientific notation, obscuring most of the digits - undermining the pupose of serving as unique identifiers.

Unlike the other ID variables, the tweet_id field has the data type of integer and is displayed correctly. The reason the other ID variables were read in as floats by read_csv is that NaN can't be handled by the integer data type (see https://stackoverflow.com/a/11548224) (NB. The latest version of Pandas, 0.24, has introduced experimental support for NA in integer data type https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

In [24]:
#test if tweet_id really is unique
sum(df_archive.tweet_id.duplicated())
Out[24]:
0
In [25]:
sum(df_archive.text.duplicated())
Out[25]:
0
In [26]:
#check a few sample tweet texts
df_archive.text.sample(5)
Out[26]:
1703    Meet Griswold. He's dapper as hell. Already pu...
1847    Here's a handful of sleepy puppers. All look u...
179                          @Marc_IRL pixelated af 12/10
944     Nothing better than a doggo and a sunset. 10/1...
2251    This is Filup. He is overcome with joy after f...
Name: text, dtype: object
In [27]:
#using for loop to print the whole text of tweets
for text in df_archive.text.sample(10):
        print(text)
This is Olivia. She just saw an adult wearing crocs. 11/10 poor pupper. No one should witness such a thing https://t.co/yJVTi1DjJc
My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O
Who leaves the last cupcake just sitting there? 9/10 https://t.co/PWMqAoEx2a
This dog can't see its haters. 11/10 https://t.co/35BcGFdEAK
"Yes hello I'ma just snag this here toasted bagel real quick. carry on." 9/10 https://t.co/Cuz0Osnekp
This is Ozzie. He was doing fine until he lost traction in those festive socks. Now he's tired. 9/10 still killin it https://t.co/u4FYdIRKnY
This is Livvie. Someone should tell her it's been 47 years since Woodstock. Magical eyes tho 11/10 would stare into https://t.co/qw07vhVHuO
Say hello to Rorie. She's zen af. Just enjoying a treat in the sunlight. 10/10 would immediately trade lives with https://t.co/yctnFptdQ1
This pupper just descended from heaven. 12/10 can probably fly https://t.co/X6X9wM7NuS
"AND IIIIIIIIIIIEIIIIIIIIIIIII WILL ALWAYS LOVE YOUUUUU" 11/10 https://t.co/rSNCEiTtfI

The standard format of tweets is to start tweets with "This is ...(name of dog)", where the name is known, and ending in a rating out of 10, followed by a photo of the dog.

In [28]:
#checking the ratings numerator and denominator - how many unqiue values?
df_archive.rating_numerator.nunique(), df_archive.rating_denominator.nunique()
Out[28]:
(40, 18)
In [29]:
#frequency of different values of the numerator
df_archive.rating_numerator.value_counts()
Out[29]:
12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64
In [30]:
#checking tweet text where the numerators is 15 or more
for tweet in df_archive[df_archive.rating_numerator>14].text.sample(10):
    print(tweet)
RT @KibaDva: I collected all the good dogs!! 15/10 @dog_rates #GoodDogs https://t.co/6UCGFczlOI
This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
@markhoppus 182/10
@roushfenway These are good dogs but 17/10 is an emotional impulse rating. More like 13/10s
This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS
Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
From left to right:
Cletus, Jerome, Alejandro, Burp, &amp; Titson
None know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK
Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. 
Keep Sam smiling by clicking and sharing this link:
https://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
After so many requests... here you go.

Good dogg. 420/10 https://t.co/yfAAo1gdeY
In [31]:
#checking tweet text where the numerators is less than 10
for tweet in df_archive[df_archive.rating_numerator<10].text.sample(10):
    print(tweet)
This is Gus. He's super stoked about being an elephant. Couldn't be happier. 9/10 for elephant pupper https://t.co/gJS1qU0jP7
This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI
Say hello to Oliver. He thought what was inside the pillow should be outside the pillow. Blurry since birth. 8/10 https://t.co/lFU9W31Fg9
This is Billl. He's trying to be a ghost but he's not very good at it. 6/10 c'mon Billl https://t.co/ero0XfdGtY
Two obedient dogs here. Left one has extra leg sticking out of its back. They each get 9/10. Would pet both at once https://t.co/RGcNPsmAfY
This is Philippe from Soviet Russia. Commanding leader. Misplaced other boot. Hung flag himself. 9/10 charismatic af https://t.co/5NhPV8E45i
Army of water dogs here. None of them know where they're going. Have no real purpose. Aggressive barks. 5/10 for all https://t.co/A88x73TwMN
Say hello to Penny &amp; Gizmo. They are practicing their caroling. The ambition in the room is tangible. 9/10 for both https://t.co/aqBHjjh5VD
Meet Stanley. He's an inverted Uzbekistani water pup. Hella exotic. Floats around all day. 8/10 I want to be Stanley https://t.co/XpYMBQ1FD8
This is Keurig. He's a rare dog. Laughs like an idiot tho. Head is basically a weapon. Poorly maintained goatee 4/10 https://t.co/xOrUyj7K30

Most values of the rating_numerator are between 10 and 14. Values outside of that range are often from tweets that do not follow the standard dog ratings format of this accounts.

Based on a sample, values much above 15 are often for pics with several dogs, and can be taken to be a multiple of an actual rating.

Based on a sample, values below 10 are often not pictures of a dog but other animals, but can also be of somewhat ugly dogs.

In [32]:
#checking the values of the ratings denominator
df_archive.rating_denominator.value_counts()
Out[32]:
10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64
In [33]:
for text in df_archive.query("rating_denominator != 10").text:
    if "/10" in text:
        print(text)
@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https:/…
After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ
Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a
This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq
This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5
This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv
In [34]:
for text in df_archive.query("rating_denominator != 10").text:
    if "/10" not in text:
        print(text)
@docmisterio account started on 11/15/15
The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd
Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. 
Keep Sam smiling by clicking and sharing this link:
https://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1
Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12
From left to right:
Cletus, Jerome, Alejandro, Burp, &amp; Titson
None know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK
Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa
Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ
Yes I do realize a rating of 4/20 would've been fitting. However, it would be unjust to give these cooperative pups that low of a rating
Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3
Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55
I'm aware that I could've said 20/16, but here at WeRateDogs we are very professional. An inconsistent rating scale is simply irresponsible
IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq
Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw

The rating_denominator is almost always 10.

In cases where it isn't 10, but the tweet text contains a fraction out of 10, it is a mistake with how the ratings were extracted due to the presence of another faction. This is an inaccuracy that can be fixed since we still have the tweet text available. In these cases the numerator is also wrong and can be cleaned.

In cases where it isn't 10, but the tweet text does not contain a fraction out of 10 but a multiple of 10, the rating is a deviation from the normal dog rating format applied to multiple dogs - again this can be fixed, for both the numerator and denominator.

In [35]:
#Checking the name variable
df_archive.name.value_counts()
Out[35]:
None         745
a             55
Charlie       12
Cooper        11
Oliver        11
Lucy          11
Penny         10
Lola          10
Tucker        10
Bo             9
Winston        9
Sadie          8
the            8
Bailey         7
Buddy          7
an             7
Daisy          7
Toby           7
Stanley        6
Dave           6
Rusty          6
Oscar          6
Milo           6
Bella          6
Koda           6
Leo            6
Scout          6
Jack           6
Jax            6
Oakley         5
            ... 
Durg           1
space          1
Furzey         1
Sunshine       1
Maxwell        1
Jonah          1
Chloe          1
Divine         1
Ginger         1
Tug            1
Kara           1
Julio          1
Vinnie         1
Gordon         1
Bonaparte      1
Brandi         1
Lipton         1
Crawford       1
Blanket        1
Ace            1
Brooks         1
Milky          1
Clifford       1
Kathmandu      1
Rudy           1
Jameson        1
Bauer          1
Boston         1
Ferg           1
Lucia          1
Name: name, Length: 957, dtype: int64
In [36]:
#checking all words in the names column that are not capitalised
df_archive[df_archive.name.str.islower()].name.value_counts()
Out[36]:
a               55
the              8
an               7
very             5
just             4
quite            4
one              4
mad              2
not              2
getting          2
actually         2
unacceptable     1
such             1
officially       1
incredibly       1
my               1
all              1
life             1
old              1
his              1
this             1
infuriating      1
light            1
space            1
by               1
Name: name, dtype: int64
In [37]:
#for tweet_text in df_archive[df_archive.name == 'a'].text.sample(5):
for tweet_text in df_archive[df_archive.name.str.islower()].text.sample(10):
    print(tweet_text)
Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af 
(IG: puffie_the_chow) https://t.co/ghXBIIeQZF
This is a very rare Great Alaskan Bush Pupper. Hard to stumble upon without spooking. 12/10 would pet passionately https://t.co/xOBKCdpzaa
This is the newly formed pupper a capella group. They're just starting out but I see tons of potential. 8/10 for all https://t.co/wbAcvFoNtn
This is quite the dog. Gets really excited when not in water. Not very soft tho. Bad at fetch. Can't do tricks. 2/10 https://t.co/aMCTNWO94t
I've yet to rate a Venezuelan Hover Wiener. This is such an honor. 14/10 paw-inspiring af (IG: roxy.thedoxy) https://t.co/20VrLAA8ba
This is an Irish Rigatoni terrier named Berta. Completely made of rope. No eyes. Quite large. Loves to dance. 10/10 https://t.co/EM5fDykrJg
Guys this is getting so out of hand. We only rate dogs. This is a Galapagos Speed Panda. Pls only send dogs... 10/10 https://t.co/8lpAGaZRFn
This is a baby Rand Paul. Curls for days. 11/10 would cuddle the hell out of https://t.co/xHXNaPAYRe
This is space pup. He's very confused. Tries to moonwalk at one point. Super spiffy uniform. 13/10 I love space pup https://t.co/SfPQ2KeLdq
This is the happiest pupper I've ever seen. 10/10 would trade lives with https://t.co/ep8ATEJwRb

It appears that the name was extracted by using a rather simplistic method of looking for the construction "This is"..., assuming the next word would be a dog's name. These cases are easily identified by looking for the absence of a capitalised named.

In [38]:
#create a mask to select records where the name is either None or lower-case
mask = (df_archive.name == 'None') | (df_archive.name.str.islower())

#where there is no correct name, check for names after a different construction:
for tweet_text in df_archive.loc[mask].text.sample(100):
    if ("named" in tweet_text) | ("name is" in tweet_text) :
        print(tweet_text)
This is a Birmingham Quagmire named Chuk. Loves to relax and watch the game while sippin on that iced mocha. 10/10 https://t.co/HvNg9JWxFt
This pup's name is Sabertooth (parents must be cool). Ears for days. Jumps unannounced. 9/10 would pet diligently https://t.co/iazoiNUviP
This is a Sizzlin Menorah spaniel from Brooklyn named Wylie. Lovable eyes. Chiller as hell. 10/10 and I'm out.. poof https://t.co/7E0AiJXPmI
We normally don't rate bears but this one seems nice. Her name is Thea. Appears rather fluffy. 10/10 good bear https://t.co/fZc7MixeeT
This is a Tuscaloosa Alcatraz named Jacob (Yacōb). Loves to sit in swing. Stellar tongue. 11/10 look at his feet https://t.co/2IslQ8ZSc7
This is a southwest Coriander named Klint. Hat looks expensive. Still on house arrest :(
9/10 https://t.co/IQTOMqDUIe

Some records with a misattributed lower-case name or with "None" do still contain a dog's name - these can be extracted from the original tweet.

In [39]:
#checking the 4 last columns
df_archive.iloc[:,13:].head()
Out[39]:
doggo floofer pupper puppo
0 None None None None
1 None None None None
2 None None None None
3 None None None None
4 None None None None
In [40]:
df_archive.doggo.value_counts()
Out[40]:
None     2259
doggo      97
Name: doggo, dtype: int64
In [41]:
df_archive.floofer.value_counts()
Out[41]:
None       2346
floofer      10
Name: floofer, dtype: int64
In [42]:
df_archive.pupper.value_counts()
Out[42]:
None      2099
pupper     257
Name: pupper, dtype: int64
In [43]:
df_archive.puppo.value_counts()
Out[43]:
None     2326
puppo      30
Name: puppo, dtype: int64
In [44]:
df_archive[(df_archive.doggo!="None") & (df_archive.pupper!="None")].iloc[:,13:]
Out[44]:
doggo floofer pupper puppo
460 doggo None pupper None
531 doggo None pupper None
565 doggo None pupper None
575 doggo None pupper None
705 doggo None pupper None
733 doggo None pupper None
778 doggo None pupper None
822 doggo None pupper None
889 doggo None pupper None
956 doggo None pupper None
1063 doggo None pupper None
1113 doggo None pupper None

The four last columns represent dog stage. Usually they have at most one dog stage, but sometimes there is more than one. The structure of these columns is not consistent with the principles of data tidiness: the column names are values - there should be a single variable instead.

In [45]:
df_archive.source.value_counts()
Out[45]:
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64
In [46]:
df_archive.expanded_urls.value_counts()
Out[46]:
https://twitter.com/dog_rates/status/753375668877008896/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/775733305207554048/photo/1                                                                                                                                                                                                                                                                                                                                    2
http://www.gofundme.com/bluethewhitehusky,https://twitter.com/dog_rates/status/831650051525054464/photo/1,https://twitter.com/dog_rates/status/831650051525054464/photo/1,https://twitter.com/dog_rates/status/831650051525054464/photo/1,https://twitter.com/dog_rates/status/831650051525054464/photo/1                                                                                          2
https://twitter.com/dog_rates/status/839549326359670784/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/762464539388485633/photo/1,https://twitter.com/dog_rates/status/762464539388485633/photo/1,https://twitter.com/dog_rates/status/762464539388485633/photo/1,https://twitter.com/dog_rates/status/762464539388485633/photo/1                                                                                                                                    2
https://twitter.com/dog_rates/status/817827839487737858/video/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/681523177663676416/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/878057613040115712/photo/1,https://twitter.com/dog_rates/status/878057613040115712/photo/1                                                                                                                                                                                                                                                                    2
https://www.gofundme.com/3ti3nps,https://twitter.com/dog_rates/status/868552278524837888/photo/1,https://twitter.com/dog_rates/status/868552278524837888/photo/1                                                                                                                                                                                                                                   2
https://twitter.com/dog_rates/status/774314403806253056/photo/1,https://twitter.com/dog_rates/status/774314403806253056/photo/1,https://twitter.com/dog_rates/status/774314403806253056/photo/1,https://twitter.com/dog_rates/status/774314403806253056/photo/1                                                                                                                                    2
https://twitter.com/dog_rates/status/829501995190984704/photo/1,https://twitter.com/dog_rates/status/829501995190984704/photo/1                                                                                                                                                                                                                                                                    2
https://www.gofundme.com/help-lorenzo-beat-cancer,https://twitter.com/dog_rates/status/860563773140209665/photo/1,https://twitter.com/dog_rates/status/860563773140209665/photo/1                                                                                                                                                                                                                  2
https://twitter.com/dog_rates/status/694669722378485760/photo/1,https://twitter.com/dog_rates/status/694669722378485760/photo/1                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/820749716845686786/photo/1,https://twitter.com/dog_rates/status/820749716845686786/photo/1                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/700747788515020802/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/667152164079423490/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://www.gofundme.com/helpquinny,https://twitter.com/dog_rates/status/863062471531167744/photo/1,https://twitter.com/dog_rates/status/863062471531167744/photo/1,https://twitter.com/dog_rates/status/863062471531167744/photo/1,https://twitter.com/dog_rates/status/863062471531167744/photo/1                                                                                                2
https://twitter.com/dog_rates/status/809920764300447744/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/718631497683582976/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/866450705531457537/photo/1,https://twitter.com/dog_rates/status/866450705531457537/photo/1                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/781524693396357120/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/750719632563142656/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://www.gofundme.com/3hgsuu0,https://twitter.com/dog_rates/status/840632337062862849/photo/1                                                                                                                                                                                                                                                                                                   2
https://twitter.com/dog_rates/status/833124694597443584/photo/1,https://twitter.com/dog_rates/status/833124694597443584/photo/1,https://twitter.com/dog_rates/status/833124694597443584/photo/1                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/680055455951884288/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/819227688460238848/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/759923798737051648/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/786963064373534720/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/759447681597108224/photo/1                                                                                                                                                                                                                                                                                                                                    2
https://twitter.com/dog_rates/status/768193404517830656/photo/1                                                                                                                                                                                                                                                                                                                                    2
                                                                                                                                                                                                                                                                                                                                                                                                  ..
https://twitter.com/dog_rates/status/674045139690631169/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/744971049620602880/photo/1,https://twitter.com/dog_rates/status/744971049620602880/photo/1,https://twitter.com/dog_rates/status/744971049620602880/photo/1                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/712717840512598017/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/693280720173801472/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/687818504314159109/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/795076730285391872/photo/1,https://twitter.com/dog_rates/status/795076730285391872/photo/1,https://twitter.com/dog_rates/status/795076730285391872/photo/1,https://twitter.com/dog_rates/status/795076730285391872/photo/1,https://twitter.com/dog_rates/status/795076730285391872/photo/1,https://twitter.com/dog_rates/status/795076730285391872/photo/1    1
https://twitter.com/dog_rates/status/680889648562991104/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/884162670584377345/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/669972011175813120/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/814153002265309185/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1                                                                                                                                                                                                                                                                    1
https://twitter.com/ChinoChinako/status/800065028116385792/photo/1,https://twitter.com/ChinoChinako/status/800065028116385792/photo/1,https://twitter.com/ChinoChinako/status/800065028116385792/photo/1                                                                                                                                                                                           1
https://twitter.com/dog_rates/status/687704180304273409/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/674754018082705410/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/718460005985447936/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/668226093875376128/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/860524505164394496/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/886258384151887873/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/884562892145688576/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/856282028240666624/photo/1,https://twitter.com/dog_rates/status/856282028240666624/photo/1,https://twitter.com/dog_rates/status/856282028240666624/photo/1,https://twitter.com/dog_rates/status/856282028240666624/photo/1                                                                                                                                    1
https://twitter.com/dog_rates/status/671159727754231808/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/739606147276148736/photo/1,https://twitter.com/dog_rates/status/739606147276148736/photo/1,https://twitter.com/dog_rates/status/739606147276148736/photo/1                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/742385895052087300/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/737678689543020544/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/813112105746448384/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/rolltidered/status/832434358292209665/photo/1                                                                                                                                                                                                                                                                                                                                  1
https://twitter.com/dog_rates/status/677530072887205888/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/883482846933004288/photo/1,https://twitter.com/dog_rates/status/883482846933004288/photo/1                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/667393430834667520/photo/1                                                                                                                                                                                                                                                                                                                                    1
https://twitter.com/dog_rates/status/692901601640583168/photo/1                                                                                                                                                                                                                                                                                                                                    1
Name: expanded_urls, Length: 2218, dtype: int64

The usefulness of some columns is not clear, e.g. the source column and expanded_urls

Assessing image predictions (df_predictions)

In [47]:
df_predictions
Out[47]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
5 666050758794694657 https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg 1 Bernese_mountain_dog 0.651137 True English_springer 0.263788 True Greater_Swiss_Mountain_dog 0.016199 True
6 666051853826850816 https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg 1 box_turtle 0.933012 False mud_turtle 0.045885 False terrapin 0.017885 False
7 666055525042405380 https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg 1 chow 0.692517 True Tibetan_mastiff 0.058279 True fur_coat 0.054449 False
8 666057090499244032 https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg 1 shopping_cart 0.962465 False shopping_basket 0.014594 False golden_retriever 0.007959 True
9 666058600524156928 https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg 1 miniature_poodle 0.201493 True komondor 0.192305 True soft-coated_wheaten_terrier 0.082086 True
10 666063827256086533 https://pbs.twimg.com/media/CT5Vg_wXIAAXfnj.jpg 1 golden_retriever 0.775930 True Tibetan_mastiff 0.093718 True Labrador_retriever 0.072427 True
11 666071193221509120 https://pbs.twimg.com/media/CT5cN_3WEAAlOoZ.jpg 1 Gordon_setter 0.503672 True Yorkshire_terrier 0.174201 True Pekinese 0.109454 True
12 666073100786774016 https://pbs.twimg.com/media/CT5d9DZXAAALcwe.jpg 1 Walker_hound 0.260857 True English_foxhound 0.175382 True Ibizan_hound 0.097471 True
13 666082916733198337 https://pbs.twimg.com/media/CT5m4VGWEAAtKc8.jpg 1 pug 0.489814 True bull_mastiff 0.404722 True French_bulldog 0.048960 True
14 666094000022159362 https://pbs.twimg.com/media/CT5w9gUW4AAsBNN.jpg 1 bloodhound 0.195217 True German_shepherd 0.078260 True malinois 0.075628 True
15 666099513787052032 https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg 1 Lhasa 0.582330 True Shih-Tzu 0.166192 True Dandie_Dinmont 0.089688 True
16 666102155909144576 https://pbs.twimg.com/media/CT54YGiWUAEZnoK.jpg 1 English_setter 0.298617 True Newfoundland 0.149842 True borzoi 0.133649 True
17 666104133288665088 https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg 1 hen 0.965932 False cock 0.033919 False partridge 0.000052 False
18 666268910803644416 https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg 1 desktop_computer 0.086502 False desk 0.085547 False bookcase 0.079480 False
19 666273097616637952 https://pbs.twimg.com/media/CT8T1mtUwAA3aqm.jpg 1 Italian_greyhound 0.176053 True toy_terrier 0.111884 True basenji 0.111152 True
20 666287406224695296 https://pbs.twimg.com/media/CT8g3BpUEAAuFjg.jpg 1 Maltese_dog 0.857531 True toy_poodle 0.063064 True miniature_poodle 0.025581 True
21 666293911632134144 https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg 1 three-toed_sloth 0.914671 False otter 0.015250 False great_grey_owl 0.013207 False
22 666337882303524864 https://pbs.twimg.com/media/CT9OwFIWEAMuRje.jpg 1 ox 0.416669 False Newfoundland 0.278407 True groenendael 0.102643 True
23 666345417576210432 https://pbs.twimg.com/media/CT9Vn7PWoAA_ZCM.jpg 1 golden_retriever 0.858744 True Chesapeake_Bay_retriever 0.054787 True Labrador_retriever 0.014241 True
24 666353288456101888 https://pbs.twimg.com/media/CT9cx0tUEAAhNN_.jpg 1 malamute 0.336874 True Siberian_husky 0.147655 True Eskimo_dog 0.093412 True
25 666362758909284353 https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg 1 guinea_pig 0.996496 False skunk 0.002402 False hamster 0.000461 False
26 666373753744588802 https://pbs.twimg.com/media/CT9vZEYWUAAlZ05.jpg 1 soft-coated_wheaten_terrier 0.326467 True Afghan_hound 0.259551 True briard 0.206803 True
27 666396247373291520 https://pbs.twimg.com/media/CT-D2ZHWIAA3gK1.jpg 1 Chihuahua 0.978108 True toy_terrier 0.009397 True papillon 0.004577 True
28 666407126856765440 https://pbs.twimg.com/media/CT-NvwmW4AAugGZ.jpg 1 black-and-tan_coonhound 0.529139 True bloodhound 0.244220 True flat-coated_retriever 0.173810 True
29 666411507551481857 https://pbs.twimg.com/media/CT-RugiWIAELEaq.jpg 1 coho 0.404640 False barracouta 0.271485 False gar 0.189945 False
... ... ... ... ... ... ... ... ... ... ... ... ...
2045 886366144734445568 https://pbs.twimg.com/media/DE0BTnQUwAApKEH.jpg 1 French_bulldog 0.999201 True Chihuahua 0.000361 True Boston_bull 0.000076 True
2046 886680336477933568 https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg 1 convertible 0.738995 False sports_car 0.139952 False car_wheel 0.044173 False
2047 886736880519319552 https://pbs.twimg.com/media/DE5Se8FXcAAJFx4.jpg 1 kuvasz 0.309706 True Great_Pyrenees 0.186136 True Dandie_Dinmont 0.086346 True
2048 886983233522544640 https://pbs.twimg.com/media/DE8yicJW0AAAvBJ.jpg 2 Chihuahua 0.793469 True toy_terrier 0.143528 True can_opener 0.032253 False
2049 887101392804085760 https://pbs.twimg.com/media/DE-eAq6UwAA-jaE.jpg 1 Samoyed 0.733942 True Eskimo_dog 0.035029 True Staffordshire_bullterrier 0.029705 True
2050 887343217045368832 https://pbs.twimg.com/ext_tw_video_thumb/88734... 1 Mexican_hairless 0.330741 True sea_lion 0.275645 False Weimaraner 0.134203 True
2051 887473957103951883 https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg 2 Pembroke 0.809197 True Rhodesian_ridgeback 0.054950 True beagle 0.038915 True
2052 887517139158093824 https://pbs.twimg.com/ext_tw_video_thumb/88751... 1 limousine 0.130432 False tow_truck 0.029175 False shopping_cart 0.026321 False
2053 887705289381826560 https://pbs.twimg.com/media/DFHDQBbXgAEqY7t.jpg 1 basset 0.821664 True redbone 0.087582 True Weimaraner 0.026236 True
2054 888078434458587136 https://pbs.twimg.com/media/DFMWn56WsAAkA7B.jpg 1 French_bulldog 0.995026 True pug 0.000932 True bull_mastiff 0.000903 True
2055 888202515573088257 https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg 2 Pembroke 0.809197 True Rhodesian_ridgeback 0.054950 True beagle 0.038915 True
2056 888554962724278272 https://pbs.twimg.com/media/DFTH_O-UQAACu20.jpg 3 Siberian_husky 0.700377 True Eskimo_dog 0.166511 True malamute 0.111411 True
2057 888804989199671297 https://pbs.twimg.com/media/DFWra-3VYAA2piG.jpg 1 golden_retriever 0.469760 True Labrador_retriever 0.184172 True English_setter 0.073482 True
2058 888917238123831296 https://pbs.twimg.com/media/DFYRgsOUQAARGhO.jpg 1 golden_retriever 0.714719 True Tibetan_mastiff 0.120184 True Labrador_retriever 0.105506 True
2059 889278841981685760 https://pbs.twimg.com/ext_tw_video_thumb/88927... 1 whippet 0.626152 True borzoi 0.194742 True Saluki 0.027351 True
2060 889531135344209921 https://pbs.twimg.com/media/DFg_2PVW0AEHN3p.jpg 1 golden_retriever 0.953442 True Labrador_retriever 0.013834 True redbone 0.007958 True
2061 889638837579907072 https://pbs.twimg.com/media/DFihzFfXsAYGDPR.jpg 1 French_bulldog 0.991650 True boxer 0.002129 True Staffordshire_bullterrier 0.001498 True
2062 889665388333682689 https://pbs.twimg.com/media/DFi579UWsAAatzw.jpg 1 Pembroke 0.966327 True Cardigan 0.027356 True basenji 0.004633 True
2063 889880896479866881 https://pbs.twimg.com/media/DFl99B1WsAITKsg.jpg 1 French_bulldog 0.377417 True Labrador_retriever 0.151317 True muzzle 0.082981 False
2064 890006608113172480 https://pbs.twimg.com/media/DFnwSY4WAAAMliS.jpg 1 Samoyed 0.957979 True Pomeranian 0.013884 True chow 0.008167 True
2065 890240255349198849 https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg 1 Pembroke 0.511319 True Cardigan 0.451038 True Chihuahua 0.029248 True
2066 890609185150312448 https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg 1 Irish_terrier 0.487574 True Irish_setter 0.193054 True Chesapeake_Bay_retriever 0.118184 True
2067 890729181411237888 https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg 2 Pomeranian 0.566142 True Eskimo_dog 0.178406 True Pembroke 0.076507 True
2068 890971913173991426 https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg 1 Appenzeller 0.341703 True Border_collie 0.199287 True ice_lolly 0.193548 False
2069 891087950875897856 https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg 1 Chesapeake_Bay_retriever 0.425595 True Irish_terrier 0.116317 True Indian_elephant 0.076902 False
2070 891327558926688256 https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg 2 basset 0.555712 True English_springer 0.225770 True German_short-haired_pointer 0.175219 True
2071 891689557279858688 https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg 1 paper_towel 0.170278 False Labrador_retriever 0.168086 True spatula 0.040836 False
2072 891815181378084864 https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg 1 Chihuahua 0.716012 True malamute 0.078253 True kelpie 0.031379 True
2073 892177421306343426 https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg 1 Chihuahua 0.323581 True Pekinese 0.090647 True papillon 0.068957 True
2074 892420643555336193 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg 1 orange 0.097049 False bagel 0.085851 False banana 0.076110 False

2075 rows × 12 columns

In [48]:
df_predictions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

data types are fine, there are no apparent missing fields, however the column names are not very descriptive

In [49]:
df_predictions.img_num.nunique()
Out[49]:
4
In [50]:
df_predictions.img_num.value_counts()
Out[50]:
1    1780
2     198
3      66
4      31
Name: img_num, dtype: int64
In [51]:
df_predictions[df_predictions.img_num>1].sample(5)
Out[51]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
810 692187005137076224 https://pbs.twimg.com/media/CZskaEIWIAUeTr5.jpg 2 Siberian_husky 0.810592 True malamute 0.119745 True Eskimo_dog 0.029265 True
609 680085611152338944 https://pbs.twimg.com/media/CXAiiHUWkAIN_28.jpg 3 pillow 0.778113 False apron 0.095023 False wallet 0.049326 False
976 707021089608753152 https://pbs.twimg.com/media/Cc_XtkRW8AEE7Fn.jpg 2 cocker_spaniel 0.559658 True golden_retriever 0.314673 True Pekinese 0.066672 True
997 708356463048204288 https://pbs.twimg.com/media/CdSWcc1XIAAXc6H.jpg 2 pug 0.871283 True French_bulldog 0.041820 True bath_towel 0.015228 False
1332 757725642876129280 https://pbs.twimg.com/media/CoP7c4bWcAAr55g.jpg 2 seat_belt 0.425176 False Labrador_retriever 0.128128 True Siamese_cat 0.091241 False
In [52]:
sum(df_predictions.tweet_id.duplicated())
Out[52]:
0
In [53]:
#check duplicated images
sum(df_predictions.jpg_url.duplicated()), sum(df_predictions.jpg_url.duplicated(keep=False))
Out[53]:
(66, 132)
In [54]:
df_predictions[df_predictions.jpg_url.duplicated(keep=False)].sort_values(by=['jpg_url']).head(10)
Out[54]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
480 675354435921575936 https://pbs.twimg.com/ext_tw_video_thumb/67535... 1 upright 0.303415 False golden_retriever 0.181351 True Brittany_spaniel 0.162084 True
1297 752309394570878976 https://pbs.twimg.com/ext_tw_video_thumb/67535... 1 upright 0.303415 False golden_retriever 0.181351 True Brittany_spaniel 0.162084 True
1864 842892208864923648 https://pbs.twimg.com/ext_tw_video_thumb/80710... 1 Chihuahua 0.505370 True Pomeranian 0.120358 True toy_terrier 0.077008 True
1641 807106840509214720 https://pbs.twimg.com/ext_tw_video_thumb/80710... 1 Chihuahua 0.505370 True Pomeranian 0.120358 True toy_terrier 0.077008 True
1703 817181837579653120 https://pbs.twimg.com/ext_tw_video_thumb/81596... 1 Tibetan_mastiff 0.506312 True Tibetan_terrier 0.295690 True otterhound 0.036251 True
1691 815966073409433600 https://pbs.twimg.com/ext_tw_video_thumb/81596... 1 Tibetan_mastiff 0.506312 True Tibetan_terrier 0.295690 True otterhound 0.036251 True
1705 817423860136083457 https://pbs.twimg.com/ext_tw_video_thumb/81742... 1 ice_bear 0.336200 False Samoyed 0.201358 True Eskimo_dog 0.186789 True
1858 841833993020538882 https://pbs.twimg.com/ext_tw_video_thumb/81742... 1 ice_bear 0.336200 False Samoyed 0.201358 True Eskimo_dog 0.186789 True
1715 819004803107983360 https://pbs.twimg.com/media/C12whDoVEAALRxa.jpg 1 standard_poodle 0.351308 True toy_poodle 0.271929 True Tibetan_terrier 0.094759 True
1718 819015337530290176 https://pbs.twimg.com/media/C12whDoVEAALRxa.jpg 1 standard_poodle 0.351308 True toy_poodle 0.271929 True Tibetan_terrier 0.094759 True
  • some images are duplicated, even though the tweet IDs are not
  • each is duplicated only once, the records are not consecutive
  • this may result from retweeting or sth similar
In [55]:
df_predictions[(df_predictions.p1_dog==False) & (df_predictions.p2_dog==False) & (df_predictions.p3_dog==False)].sample(10)
Out[55]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
1393 767191397493538821 https://pbs.twimg.com/media/CqWcgcqWcAI43jm.jpg 1 patio 0.708665 False boathouse 0.110056 False pier 0.039532 False
701 684880619965411328 https://pbs.twimg.com/media/CYEvSaRWwAAukZ_.jpg 1 clog 0.081101 False spindle 0.066957 False agama 0.060884 False
230 670408998013820928 https://pbs.twimg.com/media/CU3FbQgVAAACdCQ.jpg 1 ping-pong_ball 0.999945 False tennis_ball 0.000018 False racket 0.000015 False
1696 816450570814898180 https://pbs.twimg.com/media/C1SddosXUAQcVR1.jpg 1 web_site 0.352857 False envelope 0.060107 False nail 0.031291 False
1818 834167344700198914 https://pbs.twimg.com/media/C5OOxY6WAAAxERz.jpg 1 ox 0.991682 False bison 0.005335 False water_buffalo 0.001130 False
390 673363615379013632 https://pbs.twimg.com/media/CVhEoq4WcAE8pBm.jpg 1 ox 0.193431 False warthog 0.123827 False bison 0.111177 False
6 666051853826850816 https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg 1 box_turtle 0.933012 False mud_turtle 0.045885 False terrapin 0.017885 False
537 676948236477857792 https://pbs.twimg.com/media/CWUA1GFW4AAowiq.jpg 1 guenon 0.611603 False macaque 0.135176 False squirrel_monkey 0.083247 False
990 707995814724026368 https://pbs.twimg.com/media/CdNOb17WwAA5z4A.jpg 1 agama 0.172087 False Gila_monster 0.126978 False lumbermill 0.050400 False
533 676897532954456065 https://pbs.twimg.com/media/CWTSt0UW4AALMNB.jpg 1 hamster 0.628255 False guinea_pig 0.318646 False macaque 0.013058 False
In [56]:
len(df_predictions[(df_predictions.p1_dog==False) & (df_predictions.p2_dog==False) & (df_predictions.p3_dog==False)])
Out[56]:
324

For 324 records, no dog has been predicted. Looking at the pictures in question, this is often accurate, but at other times they are pictures in which dogs are very hard to predict because they only occupy a small proportion of the photo, or are somehow disguised, eg. wearing glasses, wigs etc.

In [57]:
#for how many is confidence of prediction 1 higher than 2, and 2 higher than 3?
sum(df_predictions.p1_conf > df_predictions.p2_conf), sum(df_predictions.p2_conf > df_predictions.p3_conf)
Out[57]:
(2075, 2075)

The predictions are ranked according the confidence

Assessing dataframe tidiness

the number of rows differ between the 3 dataframes:

In [58]:
df_archive.shape, df_additional.shape, df_predictions.shape
Out[58]:
((2356, 17), (2339, 9), (2075, 12))
In [59]:
#df_additional was constructed using tweet_ids from df_archive, yet it has fewer rows
#sample of tweets missing from df_additional, using isin() and tilde for boolean indexing

df_archive[~df_archive.tweet_id.isin(df_additional.tweet_id)].sample(5) #https://stackoverflow.com/a/19960116
Out[59]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
818 770743923962707968 NaN NaN 2016-08-30 22:04:05 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: Here's a doggo blowing bubbles.... 7.392382e+17 4.196984e+09 2016-06-04 23:31:25 +0000 https://twitter.com/dog_rates/status/739238157... 13 10 None doggo None None None
247 845459076796616705 NaN NaN 2017-03-25 02:15:26 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: Here's a heartwarming scene of ... 7.562885e+17 4.196984e+09 2016-07-22 00:43:32 +0000 https://twitter.com/dog_rates/status/756288534... 12 10 None None None None None
95 873697596434513921 NaN NaN 2017-06-11 00:25:14 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: This is Walter. He won't start ... 8.688804e+17 4.196984e+09 2017-05-28 17:23:24 +0000 https://twitter.com/dog_rates/status/868880397... 14 10 Walter None None None None
506 812747805718642688 NaN NaN 2016-12-24 19:52:31 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: Meet Sammy. At first I was like... 6.800555e+17 4.196984e+09 2015-12-24 16:00:30 +0000 https://twitter.com/dog_rates/status/680055455... 10 10 Sammy None None None None
101 872668790621863937 NaN NaN 2017-06-08 04:17:07 +0000 <a href="http://twitter.com/download/iphone" r... RT @loganamnosis: Penelope here is doing me qu... 8.726576e+17 1.547674e+08 2017-06-08 03:32:35 +0000 https://twitter.com/loganamnosis/status/872657... 14 10 None None None None None
In [60]:
#make a pd.Series of the deleted tweets and check if it explains the difference in number of records
deleted_tweets=df_archive[~df_archive.tweet_id.isin(df_additional.tweet_id)].tweet_id
deleted_tweets.size+df_additional.shape[0]==df_archive.shape[0]
Out[60]:
True
In [61]:
#are the deleted tweets in the image prediction?
df_predictions[df_predictions.tweet_id.isin(deleted_tweets)]
Out[61]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
607 680055455951884288 https://pbs.twimg.com/media/CW-ZRC_WQAAyFrL.jpg 1 Samoyed 0.995466 True Great_Pyrenees 0.001834 True Pomeranian 0.000667 True
1309 754011816964026368 https://pbs.twimg.com/media/CnbJuPoXEAAjcVF.jpg 1 French_bulldog 0.600985 True Boston_bull 0.273176 True boxer 0.056772 True
1615 802247111496568832 https://pbs.twimg.com/media/Cs_DYr1XEAA54Pu.jpg 1 Chihuahua 0.721188 True toy_terrier 0.112943 True kelpie 0.053365 True
1835 837012587749474308 https://pbs.twimg.com/media/C52pYJXWgAA2BEf.jpg 1 toilet_tissue 0.186387 False cowboy_hat 0.158555 False sombrero 0.149470 False
1864 842892208864923648 https://pbs.twimg.com/ext_tw_video_thumb/80710... 1 Chihuahua 0.505370 True Pomeranian 0.120358 True toy_terrier 0.077008 True
1944 861769973181624320 https://pbs.twimg.com/media/CzG425nWgAAnP7P.jpg 2 Arabian_camel 0.366248 False house_finch 0.209852 False cocker_spaniel 0.046403 True
1985 872261713294495745 https://pbs.twimg.com/media/DBrlZk2UQAAfAkd.jpg 2 Labrador_retriever 0.972019 True flat-coated_retriever 0.008178 True Chesapeake_Bay_retriever 0.007359 True
1992 873697596434513921 https://pbs.twimg.com/media/DA7iHL5U0AA1OQo.jpg 1 laptop 0.153718 False French_bulldog 0.099984 True printer 0.077130 False
2055 888202515573088257 https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg 2 Pembroke 0.809197 True Rhodesian_ridgeback 0.054950 True beagle 0.038915 True
In [62]:
#number of deleted tweets in image predictions, and total number of deleted tweets
df_predictions[df_predictions.tweet_id.isin(deleted_tweets)].shape[0],deleted_tweets.size
Out[62]:
(9, 17)

9 of the 17 deleted tweets are also in df_predictions.

What about the difference in record numbers between df_predictions and the other two dataframes?

In [63]:
df_predictions[~df_predictions.tweet_id.isin(df_archive.tweet_id)].shape[0]
Out[63]:
0
In [64]:
len(df_archive[~df_archive.tweet_id.isin(df_predictions.tweet_id)])
Out[64]:
281
In [65]:
df_additional[~df_additional.tweet_id.isin(df_predictions.tweet_id)][['tweet_id','is_retweet','is_quote','is_reply']].head()
Out[65]:
tweet_id is_retweet is_quote is_reply
6 878604707211726852 False True False
7 879674319642796034 False False True
12 885518971528720385 False False False
21 886267009285017600 False False True
22 873337748698140672 True False False

to sum up:

  • 17 tweets are missing from df_additional compared to df_archive
  • 9 of these tweets missing from df_additional are present in df_predictions
  • all tweet ids in df_predictions are also present in df_archive
  • however there is no image prediction for 281 tweets present in df_additional - these are commonly either a reply, retweet or quote-tweet
  • 66 image predictions are duplicated images - though the tweet IDs are unique
  • 324 of the image predictions don't contain a dog

Why is there no prediction for 281 tweets?

In [66]:
#calculate the tweed ids that are missing for which there is also no expanded_url - i.e. no image
missing = (df_archive.expanded_urls.isnull())  & (~df_archive.tweet_id.isin(df_predictions.tweet_id)) 
sum(missing)
Out[66]:
59
In [67]:
#list of Tweet IDs not present in predictions
missing_predictions = df_archive[~df_archive.tweet_id.isin(df_predictions.tweet_id)].tweet_id.tolist()

#IDs of tweets that are either a retweet, quote, or reply
not_original_tweet = df_additional[df_additional.is_retweet | df_additional.is_quote | df_additional.is_reply].tweet_id.tolist()

#how many IDs are a retweet, quote, or reply, i.e. in the intersection of missing_predictions and not_original_tweet
len(set(missing_predictions).intersection(set(not_original_tweet)))
Out[67]:
170

Many tweets for which there is no image prediction available either did not have a photo attached that is available, or they are a reply/retweet/quote-tweet -- these tweets tend to not follow the standard format of the WeRateDogs dog rating tweets.

In [ ]:
 

Summary

discuss 4 data quality dimensions: missing, accuracy etc

Data quality

df_additional (tweepy):

  1. missing rows compared to df_archive, due to deleted tweets

df_archive:

  1. timestamp and retweeted_status_timestamp are strings not datetime
  • tweet_id is integer but other unique identifier variables (retweeted_status_id,retweeted_status_user_id, in_reply_to_status_id and in_reply_to_user_id) are floats (wrong dtype)
  • some dog names are not names, some have been missed
  • some dog ratings are invalid or inaccurate, usually affecting rating_denominator and rating_numerator

image predictions:

  1. image predictions: column names are not descriptive
  • image predictions: inconsistent capitalisation of breed names
  • duplicated images lead to identical predictions, despite unique tweet_id

Data tidyness

  1. dog stage variable is spread over several columns (doggo,floofer,pupper, puppo)
  • df_additional: is_reply, is_retweet, is_quote could be same variable
  • tweet each table an observational unit : df_archive and df_additional are same observational unit (tweet), while dog predictions could potentially regarded as separate obsevrational unit

4 - Data cleaning

In [68]:
##make copies of all dataframes to be cleaned
df_archive_copy = df_archive.copy()
df_predictions_copy = df_predictions.copy()
df_additional_copy = df_additional.copy()

4.1 - Data quality : Incompleteness - dealing with missing rows in df_additional

17 rows in df_archive are missing from df_additional because the corresponding tweets have since been deleted, and 9 of these are also present in df_predictions. There are several ways of dealing with this.

One possibility is to reconstruct data about the missing tweets from df_archive and add those to df_additional. However, the main purpose of obtaining additional data using the Twitter API (in df_additional) is to complement df_archive, and this is unrecoverable for the missing tweets. Nothing would be gained with this approach.

Therefore a better choice would be to delete the rows representing deleted tweets in df_archive and df_predictions.

Define:

  • remove 17 rows from df_archive
  • remove 9 rows from df_predictions

Code:

In [69]:
df_archive_copy = df_archive_copy[df_archive_copy.tweet_id.isin(df_additional.tweet_id)]
df_predictions_copy = df_predictions_copy[df_predictions_copy.tweet_id.isin(df_additional.tweet_id)]

Test:

In [70]:
#check nb of rows has decreased
df_archive.shape, df_archive_copy.shape, df_archive.shape > df_archive_copy.shape
Out[70]:
((2356, 17), (2339, 17), True)
In [71]:
df_predictions.shape, df_predictions_copy.shape, df_predictions.shape > df_predictions_copy.shape
Out[71]:
((2075, 12), (2066, 12), True)
In [72]:
#check row number of cleaned df_archive is the same as df_additional
df_archive_copy.shape[0] == df_additional.shape[0]
Out[72]:
True
In [73]:
#make dataframe of tweets absent from df_additional
deleted_tweets = df_archive[~df_archive.tweet_id.isin(df_additional.tweet_id)]

#check all the tweet_ids in the cleaned dataframes are absent from the deleted tweets dataframe
assert sum(df_archive_copy.tweet_id.isin(deleted_tweets.tweet_id)) == 0
assert sum(df_predictions_copy.tweet_id.isin(deleted_tweets.tweet_id)) == 0
In [74]:
assert sum(df_archive_copy.tweet_id.isin(deleted_tweets.tweet_id)) == 0
In [75]:
assert sum(df_predictions_copy.tweet_id.isin(deleted_tweets.tweet_id)) == 0
In [76]:
#compute the number of Tweet IDs for which there is no image prediction after removing missing tweets
df_additional_copy.shape[0] - sum(df_additional_copy.tweet_id.isin(df_predictions_copy.tweet_id))
Out[76]:
273

4.2 - Data tidying: dog stage

Define

Four columns represent "dog stage" i.e. the column names are values that represent a single variable of dog-stage. However, there are more than 4 different values which this variable can take, as many have no dog stage associated, and in some cases there are several dogs in the picture. This makes the task more complex.

  1. Use a system where the variable dog stage takes on one of 6 different values: "doggo", "floofer", "pupper", "puppo", "several dogs", "no stage".
    • Create columns with booleans stating whether or not the tweet has the dog stage of the column name
    • Then, from the four columns with booleans, create a new column to state whether there are several dog stages, and a further column stating whether there is no dog stage
    • ensure there's only one dog stage in each of the 6 columns, or "None"
  • use merge to create temporary table with a column for the variable "dog stage" and tweet_id
    • merge leads to duplicated tweet_ids due to "None" values, so drop rows with "None"
    • convert type to category
    • check there are no duplicated and no missing tweetids
  • merge dog stage variable from temporary df back into df_archive_copy, omitting redundant columns

Code

In [77]:
#check head of 4 last columns ie. dog stage
df_archive_copy.iloc[:1,13:]
Out[77]:
doggo floofer pupper puppo
0 None None None None
In [78]:
#create columns with boolean values for each dog stage
df_archive_copy['doggo_b'] = df_archive_copy.iloc[:,13]!="None"
df_archive_copy['floofer_b'] = df_archive_copy.iloc[:,14]!="None"
df_archive_copy['pupper_b'] = df_archive_copy.iloc[:,15]!="None"
df_archive_copy['puppo_b'] = df_archive_copy.iloc[:,16]!="None"
In [79]:
#calculate sum of boolean values for dog stage
#new column - will be 0 when there is none, and 2,3 or 4 if there is more than 1 dog stage value
df_archive_copy['several'] = df_archive_copy['doggo_b'].astype(int) + df_archive_copy['floofer_b'].astype(int) + df_archive_copy['pupper_b'].astype(int) + df_archive_copy['puppo_b'].astype(int) 

#new column with boolean to say when there has been no dog stage  defined
df_archive_copy['none'] = df_archive_copy['several']==0 

#convert column to boolean - True = there are several stages
df_archive_copy['several'] = df_archive_copy['several']>1
In [80]:
#Look at records with several dog stages
mask=df_archive_copy.several==True
columns=['doggo', 'floofer', 'pupper', 'puppo']
df_archive_copy.loc[mask,columns]
Out[80]:
doggo floofer pupper puppo
191 doggo None None puppo
200 doggo floofer None None
460 doggo None pupper None
531 doggo None pupper None
565 doggo None pupper None
575 doggo None pupper None
705 doggo None pupper None
733 doggo None pupper None
778 doggo None pupper None
822 doggo None pupper None
889 doggo None pupper None
956 doggo None pupper None
1063 doggo None pupper None
1113 doggo None pupper None
In [81]:
#need to ensure the final dog stage variable can only take one value
#where there are several dog stages, overwrite values in the original columns with "None"
df_archive_copy.loc[mask,columns] = df_archive_copy.loc[mask,columns].replace(to_replace=['doggo','floofer','pupper','puppo'],
                                                                 value='None')
#check
df_archive_copy.loc[mask,columns]
Out[81]:
doggo floofer pupper puppo
191 None None None None
200 None None None None
460 None None None None
531 None None None None
565 None None None None
575 None None None None
705 None None None None
733 None None None None
778 None None None None
822 None None None None
889 None None None None
956 None None None None
1063 None None None None
1113 None None None None
In [82]:
#Now change boolean values in 'several' and 'no stage' to strings 
#so that they can be used by merge in the same way as the other columns
df_archive_copy.loc[:,'several'].replace(True,"several dogs", inplace=True)
df_archive_copy.loc[:,'none'].replace(True,"no stage", inplace=True)
df_archive_copy.loc[:,'several'].replace(False,"None", inplace=True)
df_archive_copy.loc[:,'none'].replace(False,"None", inplace=True)

#drop the columns with booleans no longer needed
df_archive_copy.drop([ 'doggo_b', 'floofer_b', 'pupper_b', 'puppo_b'],axis=1, inplace=True)
In [83]:
#check values are either None or dog stage in single column
df_archive_copy.iloc[:15,13:]
Out[83]:
doggo floofer pupper puppo several none
0 None None None None None no stage
1 None None None None None no stage
2 None None None None None no stage
3 None None None None None no stage
4 None None None None None no stage
5 None None None None None no stage
6 None None None None None no stage
7 None None None None None no stage
8 None None None None None no stage
9 doggo None None None None None
10 None None None None None no stage
11 None None None None None no stage
12 None None None puppo None None
13 None None None None None no stage
14 None None None puppo None None
In [84]:
#use melt to create temporary dataframe column for dog stage variable
df_dog_stages = pd.melt(df_archive_copy, id_vars='tweet_id',
        value_vars=['doggo','floofer','pupper','puppo','several','none'],
        var_name = 'stages', value_name = 'dog_stage')
#check 
df_dog_stages.head(15)
Out[84]:
tweet_id stages dog_stage
0 892420643555336193 doggo None
1 892177421306343426 doggo None
2 891815181378084864 doggo None
3 891689557279858688 doggo None
4 891327558926688256 doggo None
5 891087950875897856 doggo None
6 890971913173991426 doggo None
7 890729181411237888 doggo None
8 890609185150312448 doggo None
9 890240255349198849 doggo doggo
10 890006608113172480 doggo None
11 889880896479866881 doggo None
12 889665388333682689 doggo None
13 889638837579907072 doggo None
14 889531135344209921 doggo None
In [85]:
#there should be duplication due to 5 "None" values for each record
df_dog_stages.dog_stage.value_counts()
Out[85]:
None            11695
no stage         1961
pupper            244
doggo              82
puppo              29
several dogs       14
floofer             9
Name: dog_stage, dtype: int64
In [86]:
#delete rows with "None", drop redundant column
df_dog_stages=df_dog_stages[df_dog_stages.dog_stage!="None"]
df_dog_stages.drop('stages', axis=1, inplace=True)
#convert to category
df_dog_stages.dog_stage = df_dog_stages.dog_stage.astype('category')
#check 'None' is removed
df_dog_stages.dog_stage.value_counts()
Out[86]:
no stage        1961
pupper           244
doggo             82
puppo             29
several dogs      14
floofer            9
Name: dog_stage, dtype: int64
In [87]:
#check df_dog_stages and df_archive_copy have same length, 
df_dog_stages.shape[0],  df_archive_copy.shape[0], df_dog_stages.shape[0] == df_archive_copy.shape[0]
Out[87]:
(2339, 2339, True)
In [88]:
df_dog_stages.head()
Out[88]:
tweet_id dog_stage
9 890240255349198849 doggo
42 884162670584377345 doggo
97 872967104147763200 doggo
104 871515927908634625 doggo
106 871102520638267392 doggo
In [89]:
#merge into df_archive_copy, using tweet_id, removing now redundant old dog stage columns
df_archive_copy = pd.merge(df_archive_copy.iloc[:,:13],df_dog_stages,on='tweet_id', how='left')

Test:

In [90]:
#check column structure
list(df_archive_copy)
Out[90]:
['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'dog_stage']
In [91]:
#dtype should be categorical
df_archive_copy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2339 entries, 0 to 2338
Data columns (total 14 columns):
tweet_id                      2339 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2339 non-null object
source                        2339 non-null object
text                          2339 non-null object
retweeted_status_id           167 non-null float64
retweeted_status_user_id      167 non-null float64
retweeted_status_timestamp    167 non-null object
expanded_urls                 2280 non-null object
rating_numerator              2339 non-null int64
rating_denominator            2339 non-null int64
name                          2339 non-null object
dog_stage                     2339 non-null category
dtypes: category(1), float64(4), int64(3), object(6)
memory usage: 258.3+ KB
In [92]:
#check values
df_archive_copy.dog_stage.value_counts()
Out[92]:
no stage        1961
pupper           244
doggo             82
puppo             29
several dogs      14
floofer            9
Name: dog_stage, dtype: int64

4.3 - Data quality: Type of tweet - reply, retweet, quote tweet, or original tweet

Define

df_additional has several columns with boolean values stating if a tweet is a retweet, reply or quote. These can be better expressed as a single variable of different exclusive categories of tweet: Original tweet, retweet of own tweet (SelfRT), retweet of tweet from another source (OtherRT), reply, or quote tweet.

Code

In [93]:
#using boolean indexing for self-retweets
df_additional_copy[(df_additional_copy.retweet_user.notnull()) & (df_additional_copy.retweet_user == df_additional_copy.user_name)].head()
Out[93]:
tweet_id user_name retweet_count favorite_count is_retweet retweet_user is_quote quote_user is_reply
22 873337748698140672 WeRateDogs™ 1549 0 True WeRateDogs™ False None False
37 874434818259525634 WeRateDogs™ 14349 0 True WeRateDogs™ False None False
50 879130579576475649 WeRateDogs™ 6620 0 True WeRateDogs™ False None False
53 878404777348136964 WeRateDogs™ 1250 0 True WeRateDogs™ False None False
60 885311592912609280 WeRateDogs™ 17974 0 True WeRateDogs™ False None False
In [94]:
#new column with boolean to state if a tweet is a self-retweet
df_additional_copy['is_self_retweet'] = (df_additional_copy.retweet_user.notnull()) & (df_additional_copy.retweet_user == df_additional_copy.user_name)
In [95]:
#using boolean indexing for retweets from other source
df_additional_copy[(df_additional_copy.retweet_user.notnull()) & (df_additional_copy.retweet_user != df_additional_copy.user_name)].head()
Out[95]:
tweet_id user_name retweet_count favorite_count is_retweet retweet_user is_quote quote_user is_reply is_self_retweet
2 877611172832227328 WeRateDogs™ 78 0 True Rachel Baker False None False False
54 886054160059072513 WeRateDogs™ 105 0 True Ramón Laureano’s Arm 💪🏽 False None False False
102 856330835276025856 WeRateDogs™ 689 0 True Jenna Marbles False None False False
113 860924035999428608 WeRateDogs™ 823 0 True tally False None False False
153 855138241867124737 WeRateDogs™ 46 0 True ♪{•.•}* • banditø 8/3/19 False None False False
In [96]:
#new column with boolean to state if a tweet is a retweet from other source
df_additional_copy['is_other_retweet'] = (df_additional_copy.retweet_user.notnull()) & (df_additional_copy.retweet_user != df_additional_copy.user_name)
In [97]:
#new column to state if tweet is original tweet, i.e. those where is_retweet, is_quote, is_reply are all False
df_additional_copy['is_original'] = (df_additional_copy.is_retweet | df_additional_copy.is_quote | df_additional_copy.is_reply) == False
In [98]:
#each tweet should belong to one of the following categories: 'is_quote', 'is_reply', 'is_self_retweet', 
#'is_other_retweet', 'is_original'
#therefore their sum should be the same as the number of rows.
df_additional_copy.shape[0] == sum(df_additional_copy.is_quote) + sum(df_additional_copy.is_reply) + sum(df_additional_copy.is_self_retweet) + sum(df_additional_copy.is_other_retweet) + sum(df_additional_copy.is_original)
Out[98]:
True
In [99]:
#Replace True with the corresponding string
df_additional_copy.is_quote = df_additional_copy.is_quote.replace(True,"Quote")
df_additional_copy.is_reply = df_additional_copy.is_reply.replace(True,"Reply")
df_additional_copy.is_self_retweet = df_additional_copy.is_self_retweet.replace(True,"SelfRT")
df_additional_copy.is_other_retweet = df_additional_copy.is_other_retweet.replace(True,"OtherRT")
df_additional_copy.is_original = df_additional_copy.is_original.replace(True,"Original")

#Replace bool False with str
df_additional_copy[['is_quote', 'is_reply', 'is_self_retweet', 'is_other_retweet', 'is_original']] = df_additional_copy[['is_quote', 'is_reply', 'is_self_retweet', 'is_other_retweet', 'is_original']].replace(False,"False")
In [100]:
#use melt, also dropping variables no longer needed by leaving them out from id_vars
df_additional_copy = pd.melt(df_additional_copy, id_vars=['tweet_id', 'retweet_count', 'favorite_count'],#['tweet_id', 'user_name', 'retweet_count', 'favorite_count', 'is_retweet', 'retweet_user', 'quote_user'],
        value_vars=[ 'is_quote', 'is_reply', 'is_self_retweet', 'is_other_retweet', 'is_original'],
       var_name = 'is_type', value_name = 'tweet_type')
#check result
df_additional_copy.head()
Out[100]:
tweet_id retweet_count favorite_count is_type tweet_type
0 876484053909872640 2327 18309 is_quote False
1 886366144734445568 3100 20637 is_quote False
2 877611172832227328 78 0 is_quote False
3 888078434458587136 3378 21193 is_quote False
4 891689557279858688 8373 41028 is_quote False
In [101]:
#drop rows with False tweet_type
df_additional_copy = df_additional_copy[df_additional_copy.tweet_type!='False']
#convert to category
df_additional_copy.tweet_type = df_additional_copy.tweet_type.astype('category')
#drop unneeded column
df_additional_copy.drop('is_type', axis=1, inplace=True)

Test

In [102]:
#check cleaned dataframe structure
df_additional_copy.head()
Out[102]:
tweet_id retweet_count favorite_count tweet_type
6 878604707211726852 6996 29473 Quote
68 876537666061221889 4519 22980 Quote
75 875097192612077568 5903 26755 Quote
93 884247878851493888 19881 71022 Quote
114 866720684873056260 4775 19841 Quote
In [103]:
#check data type
df_additional_copy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2339 entries, 6 to 11694
Data columns (total 4 columns):
tweet_id          2339 non-null int64
retweet_count     2339 non-null int64
favorite_count    2339 non-null int64
tweet_type        2339 non-null category
dtypes: category(1), int64(3)
memory usage: 75.6 KB
In [104]:
#check there are no False values left
df_additional_copy.tweet_type.value_counts()
Out[104]:
Original    2071
SelfRT       144
Reply         77
Quote         24
OtherRT       23
Name: tweet_type, dtype: int64
In [105]:
#check no records have been lost by checking all tweet_ids of cleaned df are in original df
df_additional_copy.tweet_id.isin(df_additional.tweet_id).shape[0] == df_additional.shape[0]
Out[105]:
True

4.4 - Data tidying: each observational unit a table

According to the principles of data tidiness, each table should match an observational unit. The current structure of our data is in contradiction with this principle, because it is a relatively heterogeneous data set, relating to:

  • tweets, with information on author, whether retweet, reply, quote tweet, its popularity,
  • predictions based on the tweet content, i.e. tweet text (for the name of dog, dog "stage", rating) and tweeted images (whether it's a dog, dog breed)

There are different ways of restructuring the data to better follow tidiness principles :

One possibility is to merge the three dataframes on the basis of the unique tweet IDs. In this case the observational unit would be the data associated with each tweet. However there is no image prediction for 273 tweets, so these records would need to be removed, and we may be interested in analysing these tweets.

A second possibility is to re-structure the data into two dataframes, on the basis of two observational units : (1) tweets from the WeRateDogs Twitter user, including data provided by Twitter ; (2) dog predictions derived from the content of WeRateDogs tweets, in particular, predictions from images posted by the WeRateDogs user, as well as ratings, dog names and dog stages extracted from tweet text.

Only the first option is deemed to be correct by the project reviewer, although it implies losing data on tweets for which there are no image predictions and which will not be available for later analysis.

Define

  1. merge df_archive_copy and df_additional_copy - both dataframes should have the same IDs due to the previous cleaning step (4.1)
  • merge df_archive_copy and df_predictions_copy - this will lead to the removal of several hundred tweets

Code:

In [106]:
#first merge step
df_archive_copy = pd.merge(df_archive_copy,df_additional_copy,on='tweet_id')
#before moving to second step, check the number of rows is unchanged
df_archive_copy.shape[0] == df_additional_copy.shape[0]
Out[106]:
True
In [107]:
#second merge step - only keep rows that are in df_predictions_copy
df_archive_copy = pd.merge(df_archive_copy, df_predictions_copy,on='tweet_id', how='right')

Test:

In [108]:
#check the number of rows is now the  same as df_predictions_copy
df_archive_copy.shape[0] == df_predictions_copy.shape[0]
Out[108]:
True
In [109]:
#check all the column names of df_additional_copy and df_predictions_copy are still in df_archive_copy
assert set(df_additional_copy).intersection(set(df_archive_copy)) == set(df_additional_copy)
assert set(df_predictions_copy).intersection(set(df_archive_copy)) == set(df_predictions_copy)
In [110]:
#check the new dataframe structure
df_archive_copy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2066 entries, 0 to 2065
Data columns (total 28 columns):
tweet_id                      2066 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2066 non-null object
source                        2066 non-null object
text                          2066 non-null object
retweeted_status_id           75 non-null float64
retweeted_status_user_id      75 non-null float64
retweeted_status_timestamp    75 non-null object
expanded_urls                 2066 non-null object
rating_numerator              2066 non-null int64
rating_denominator            2066 non-null int64
name                          2066 non-null object
dog_stage                     2066 non-null category
retweet_count                 2066 non-null int64
favorite_count                2066 non-null int64
tweet_type                    2066 non-null category
jpg_url                       2066 non-null object
img_num                       2066 non-null int64
p1                            2066 non-null object
p1_conf                       2066 non-null float64
p1_dog                        2066 non-null bool
p2                            2066 non-null object
p2_conf                       2066 non-null float64
p2_dog                        2066 non-null bool
p3                            2066 non-null object
p3_conf                       2066 non-null float64
p3_dog                        2066 non-null bool
dtypes: bool(3), category(2), float64(7), int64(6), object(10)
memory usage: 397.9+ KB
In [111]:
#find out number of rows, and how many were lost
df_archive_copy.shape[0], df_additional_copy.shape[0] - df_archive_copy.shape[0]
Out[111]:
(2066, 273)
In [112]:
#which types of tweets were deleted?
df_additional_copy[~df_additional_copy.tweet_id.isin(df_archive_copy.tweet_id)].tweet_type.value_counts()
Out[112]:
Original    103
SelfRT       83
Reply        54
Quote        24
OtherRT       9
Name: tweet_type, dtype: int64
In [113]:
#which types of tweets remain?
df_archive_copy.tweet_type.value_counts()
Out[113]:
Original    1968
SelfRT        61
Reply         23
OtherRT       14
Quote          0
Name: tweet_type, dtype: int64

4.5 - Data quality: format of timestamps

Define:

Two columns with timestamps of when a tweet was posted,or a retweet was originally posted, were read in as strings.

  • Convert timestamp and retweeted_status_timestamp to datetime.

Code:

In [114]:
df_archive_copy.timestamp = pd.to_datetime(df_archive_copy.timestamp)
df_archive_copy.retweeted_status_timestamp = pd.to_datetime(df_archive_copy.retweeted_status_timestamp)

Test

In [115]:
df_archive_copy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2066 entries, 0 to 2065
Data columns (total 28 columns):
tweet_id                      2066 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2066 non-null datetime64[ns]
source                        2066 non-null object
text                          2066 non-null object
retweeted_status_id           75 non-null float64
retweeted_status_user_id      75 non-null float64
retweeted_status_timestamp    75 non-null datetime64[ns]
expanded_urls                 2066 non-null object
rating_numerator              2066 non-null int64
rating_denominator            2066 non-null int64
name                          2066 non-null object
dog_stage                     2066 non-null category
retweet_count                 2066 non-null int64
favorite_count                2066 non-null int64
tweet_type                    2066 non-null category
jpg_url                       2066 non-null object
img_num                       2066 non-null int64
p1                            2066 non-null object
p1_conf                       2066 non-null float64
p1_dog                        2066 non-null bool
p2                            2066 non-null object
p2_conf                       2066 non-null float64
p2_dog                        2066 non-null bool
p3                            2066 non-null object
p3_conf                       2066 non-null float64
p3_dog                        2066 non-null bool
dtypes: bool(3), category(2), datetime64[ns](2), float64(7), int64(6), object(8)
memory usage: 397.9+ KB
In [116]:
df_archive_copy[['timestamp','retweeted_status_timestamp']].sample(10)
Out[116]:
timestamp retweeted_status_timestamp
313 2017-01-27 01:49:15 2016-11-06 01:33:58
2030 2015-11-17 02:06:42 NaT
607 2016-09-20 01:12:28 NaT
539 2016-10-20 16:15:26 NaT
1576 2015-12-12 03:47:46 NaT
1536 2015-12-15 17:11:09 NaT
880 2016-06-04 23:31:25 NaT
1605 2015-12-10 03:30:58 NaT
1617 2015-12-09 20:40:38 NaT
1254 2016-01-28 02:12:04 NaT

note that null values are handled correctly

4.6 - Data quality: status IDs should not be floats

Define

Twitter uses unique IDs for user and tweet id. read_csv assigned inconsistent data types : float where values are missing, and integer to the others. Floats are displayed in scientific notation, which obscures their only utility of being unique identifiers.

To achieve a consistent data type, all of these could be converted to strings.

  • replace missing values by 0
  • convert IDs to strings

Code

In [117]:
#check current data types and column names
df_archive_copy.iloc[:,[0,1,2,6,7]].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2066 entries, 0 to 2065
Data columns (total 5 columns):
tweet_id                    2066 non-null int64
in_reply_to_status_id       23 non-null float64
in_reply_to_user_id         23 non-null float64
retweeted_status_id         75 non-null float64
retweeted_status_user_id    75 non-null float64
dtypes: float64(4), int64(1)
memory usage: 96.8 KB
In [118]:
#convert to strings, replace missing with "none"
df_archive_copy[['tweet_id','in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 
            'retweeted_status_user_id']] =  df_archive_copy[['tweet_id','in_reply_to_status_id','in_reply_to_user_id',
                                                        'retweeted_status_id',
                                                        'retweeted_status_user_id']].fillna(0).astype(int).astype(str).replace("0","none")

Test

In [119]:
df_archive_copy.iloc[:,[0,1,2,6,7]].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2066 entries, 0 to 2065
Data columns (total 5 columns):
tweet_id                    2066 non-null object
in_reply_to_status_id       2066 non-null object
in_reply_to_user_id         2066 non-null object
retweeted_status_id         2066 non-null object
retweeted_status_user_id    2066 non-null object
dtypes: object(5)
memory usage: 96.8+ KB
In [120]:
df_archive_copy[['tweet_id','in_reply_to_status_id', 'in_reply_to_user_id', 
            'retweeted_status_id', 'retweeted_status_user_id']].sample(10)
Out[120]:
tweet_id in_reply_to_status_id in_reply_to_user_id retweeted_status_id retweeted_status_user_id
46 882268110199369728 none none none none
277 830583320585068544 none none none none
559 785872687017132033 none none none none
944 726935089318363137 none none none none
44 882762694511734784 none none none none
696 762316489655476224 none none none none
447 803276597545603072 none none none none
690 763837565564780549 none none none none
1262 691675652215414786 none none none none
931 728751179681943552 none none none none

4.7 - Data quality: Wrong dog names

Define:

Some dog names have not been extracted properly. Some are not names - identifiably in lower-case - and some may have been missed in fields with the value "None".

  • Where names are wrong or missing in , attempt to re-extract names from tweet text

Code:

In [121]:
#create a mask to select records where the name is either None or lower-case
mask = (df_archive_copy.name == 'None') | (df_archive_copy.name.str.islower())
sum(mask)
Out[121]:
675
In [122]:
df_archive_copy[mask].name.value_counts()
Out[122]:
None            575
a                55
the               7
an                7
very              4
just              4
one               4
quite             3
getting           2
not               1
space             1
my                1
incredibly        1
light             1
all               1
this              1
actually          1
his               1
infuriating       1
by                1
unacceptable      1
officially        1
such              1
Name: name, dtype: int64
In [123]:
#re-extract names as the first word that follows the construction "named" or "name is", and fill null values with "None"
df_archive_copy.loc[mask,'name'] = df_archive_copy[mask].text.str.extract(r'.*named?\s?i?s?\s(?P<name>\w+).*.*', expand=True).fillna("None")
In [124]:
#check for lower-case "names"
df_archive_copy[df_archive_copy.name.str.islower()].name
Out[124]:
121      to
1800     of
2016    are
Name: name, dtype: object
In [125]:
#replace with "None"
mask_lower=df_archive_copy.name.str.islower()
df_archive_copy.loc[mask_lower,"name"] = "None"

Test

In [126]:
#check name values that were previously None or wrong
df_archive_copy[mask].name.value_counts()
Out[126]:
None          645
Tickles         2
Zoey            2
Berta           1
Zeus            1
Wylie           1
Rufus           1
Johm            1
Guss            1
Alphred         1
Pepe            1
Sabertooth      1
Cheryl          1
Klint           1
Alfonso         1
Jessiga         1
Leroi           1
Kohl            1
Spork           1
Daryl           1
Cherokee        1
Thea            1
Big             1
Hemry           1
Kip             1
Jacob           1
Octaviath       1
Chuk            1
Alfredo         1
Name: name, dtype: int64
In [127]:
#check if there are any lower-case names in the dataframe
sum(df_archive_copy.name.str.islower())
Out[127]:
0

4.8 - Data quality: inaccurate/invalid ratings (rating_denominator and rating_numerator)

Define:

The rating_denominator is almost always 10 - where it isn't, it is indicative of either a mistake while extracting the rating (inaccurate), or of a non-standard rating being used (invalid). In both cases the numerator is also affected.

  1. Select rows in which the denominator is different from 10 to make a temporary dataframe
    • where the tweet text contains a rating with a denominator of 10, extract the actual rating
    • where the denominator is divisble by 10, it is due to several dogs being present in the picture - both the denominator and numerator can be corrected
  • Delete records without rating
    • assess the remaining tweets with inaccurate denominators and remove them if they don't follow the standard WRD rating scheme

Code:

In [128]:
#define mask
mask = df_archive_copy.rating_denominator != 10
#create temporary dataframe
wrong_ratings = df_archive_copy.loc[mask,['rating_numerator','rating_denominator','text']]
wrong_ratings
Out[128]:
rating_numerator rating_denominator text
340 84 70 The floofs have been released I repeat the flo...
410 24 7 Meet Sam. She smiles 24/7 &amp; secretly aspir...
729 165 150 Why does this never happen at my front door......
870 9 11 After so many requests, this is Bretagne. She ...
918 204 170 Say hello to this unbelievably well behaved sq...
961 4 20 Happy 4/20 from the squad! 13/10 for all https...
995 50 50 This is Bluebert. He just saw that both #Final...
1016 99 90 Happy Saturday here's 9 puppers on a bench. 99...
1041 80 80 Here's a brigade of puppers. All look very pre...
1059 45 50 From left to right:\nCletus, Jerome, Alejandro...
1125 60 50 Here is a whole flock of puppers. 60/50 I'll ...
1201 44 40 Happy Wednesday here's a bucket of pups. 44/40...
1373 143 130 Two sneaky puppers were not initially seen, mo...
1374 121 110 Someone help the girl is being mugged. Several...
1399 7 11 This is Darrel. He just robbed a 7/11 and is i...
1505 144 120 IT'S PUPPERGEDDON. Total of 144/120 ...I think...
1564 88 80 Here we have an entire platoon of puppers. Tot...
2045 1 2 This is an Albanian 3 1/2 legged Episcopalian...
In [129]:
#extract ratings that follow the pattern numerator/10
wrong_ratings[['corr_numerator','corr_denominator']] = wrong_ratings.text.str.extract(r'.*\s(?P<corr_numerator>\d+)\/(?P<corr_denominator>10).*.*', expand=True)
wrong_ratings
Out[129]:
rating_numerator rating_denominator text corr_numerator corr_denominator
340 84 70 The floofs have been released I repeat the flo... NaN NaN
410 24 7 Meet Sam. She smiles 24/7 &amp; secretly aspir... NaN NaN
729 165 150 Why does this never happen at my front door...... NaN NaN
870 9 11 After so many requests, this is Bretagne. She ... 14 10
918 204 170 Say hello to this unbelievably well behaved sq... NaN NaN
961 4 20 Happy 4/20 from the squad! 13/10 for all https... 13 10
995 50 50 This is Bluebert. He just saw that both #Final... 11 10
1016 99 90 Happy Saturday here's 9 puppers on a bench. 99... NaN NaN
1041 80 80 Here's a brigade of puppers. All look very pre... NaN NaN
1059 45 50 From left to right:\nCletus, Jerome, Alejandro... NaN NaN
1125 60 50 Here is a whole flock of puppers. 60/50 I'll ... NaN NaN
1201 44 40 Happy Wednesday here's a bucket of pups. 44/40... NaN NaN
1373 143 130 Two sneaky puppers were not initially seen, mo... NaN NaN
1374 121 110 Someone help the girl is being mugged. Several... NaN NaN
1399 7 11 This is Darrel. He just robbed a 7/11 and is i... 10 10
1505 144 120 IT'S PUPPERGEDDON. Total of 144/120 ...I think... NaN NaN
1564 88 80 Here we have an entire platoon of puppers. Tot... NaN NaN
2045 1 2 This is an Albanian 3 1/2 legged Episcopalian... 9 10
In [130]:
#fix ratings that are a multiple of 10 - assume multiple is the number of dogs, redefine numerator and denominator
#sampling the pics of corrected records for the presence of multiple dogs
for i,denominator in zip(wrong_ratings.index, wrong_ratings.rating_denominator):
    if (denominator % 10 == 0) and (denominator != 0):
        n_dogs = denominator / 10
        wrong_ratings.loc[i,'corr_numerator'] = wrong_ratings.loc[i,'rating_numerator'] / n_dogs
        wrong_ratings.loc[i,'corr_denominator'] = wrong_ratings.loc[i,'rating_denominator'] / n_dogs
        print("index: ", i)
        print("Tweet text:",wrong_ratings.loc[i,'text']) #check some of the URLs
        print("nb of dogs: {}, corrected numerator : {}, corrected denominator: {}".format(n_dogs,wrong_ratings.loc[i,'corr_numerator'], wrong_ratings.loc[i,'corr_denominator']))
index:  340
Tweet text: The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd
nb of dogs: 7.0, corrected numerator : 12.0, corrected denominator: 10.0
index:  729
Tweet text: Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
nb of dogs: 15.0, corrected numerator : 11.0, corrected denominator: 10.0
index:  918
Tweet text: Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
nb of dogs: 17.0, corrected numerator : 12.0, corrected denominator: 10.0
index:  961
Tweet text: Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a
nb of dogs: 2.0, corrected numerator : 2.0, corrected denominator: 10.0
index:  995
Tweet text: This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq
nb of dogs: 5.0, corrected numerator : 10.0, corrected denominator: 10.0
index:  1016
Tweet text: Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1
nb of dogs: 9.0, corrected numerator : 11.0, corrected denominator: 10.0
index:  1041
Tweet text: Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12
nb of dogs: 8.0, corrected numerator : 10.0, corrected denominator: 10.0
index:  1059
Tweet text: From left to right:
Cletus, Jerome, Alejandro, Burp, &amp; Titson
None know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK
nb of dogs: 5.0, corrected numerator : 9.0, corrected denominator: 10.0
index:  1125
Tweet text: Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa
nb of dogs: 5.0, corrected numerator : 12.0, corrected denominator: 10.0
index:  1201
Tweet text: Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ
nb of dogs: 4.0, corrected numerator : 11.0, corrected denominator: 10.0
index:  1373
Tweet text: Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3
nb of dogs: 13.0, corrected numerator : 11.0, corrected denominator: 10.0
index:  1374
Tweet text: Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55
nb of dogs: 11.0, corrected numerator : 11.0, corrected denominator: 10.0
index:  1505
Tweet text: IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq
nb of dogs: 12.0, corrected numerator : 12.0, corrected denominator: 10.0
index:  1564
Tweet text: Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw
nb of dogs: 8.0, corrected numerator : 11.0, corrected denominator: 10.0
In [131]:
wrong_ratings
Out[131]:
rating_numerator rating_denominator text corr_numerator corr_denominator
340 84 70 The floofs have been released I repeat the flo... 12 10
410 24 7 Meet Sam. She smiles 24/7 &amp; secretly aspir... NaN NaN
729 165 150 Why does this never happen at my front door...... 11 10
870 9 11 After so many requests, this is Bretagne. She ... 14 10
918 204 170 Say hello to this unbelievably well behaved sq... 12 10
961 4 20 Happy 4/20 from the squad! 13/10 for all https... 2 10
995 50 50 This is Bluebert. He just saw that both #Final... 10 10
1016 99 90 Happy Saturday here's 9 puppers on a bench. 99... 11 10
1041 80 80 Here's a brigade of puppers. All look very pre... 10 10
1059 45 50 From left to right:\nCletus, Jerome, Alejandro... 9 10
1125 60 50 Here is a whole flock of puppers. 60/50 I'll ... 12 10
1201 44 40 Happy Wednesday here's a bucket of pups. 44/40... 11 10
1373 143 130 Two sneaky puppers were not initially seen, mo... 11 10
1374 121 110 Someone help the girl is being mugged. Several... 11 10
1399 7 11 This is Darrel. He just robbed a 7/11 and is i... 10 10
1505 144 120 IT'S PUPPERGEDDON. Total of 144/120 ...I think... 12 10
1564 88 80 Here we have an entire platoon of puppers. Tot... 11 10
2045 1 2 This is an Albanian 3 1/2 legged Episcopalian... 9 10
In [132]:
#check the text of remaining tweets for ratings, make list of tweet_ids to delete
to_delete=[]
for i in wrong_ratings[wrong_ratings.corr_denominator.isnull()].index:
    print("index: ", i)
    print("Tweet text:",wrong_ratings.loc[i,'text'])
    to_delete.append(df_archive_copy.loc[i,'tweet_id']) #get tweetd_id from df_archive_copy as index is the same
print(to_delete)
index:  410
Tweet text: Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. 
Keep Sam smiling by clicking and sharing this link:
https://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
['810984652412424192']
In [133]:
#check tweets in df_archive_copy for which no rating can be extracted 
df_archive_copy[df_archive_copy.tweet_id.isin(to_delete)]
Out[133]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls ... img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
410 810984652412424192 none none 2016-12-19 23:06:23 <a href="http://twitter.com/download/iphone" r... Meet Sam. She smiles 24/7 &amp; secretly aspir... none none NaT https://www.gofundme.com/sams-smile,https://tw... ... 1 golden_retriever 0.871342 True Tibetan_mastiff 0.036708 True Labrador_retriever 0.025823 True

1 rows × 28 columns

In [134]:
wrong_ratings.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18 entries, 340 to 2045
Data columns (total 5 columns):
rating_numerator      18 non-null int64
rating_denominator    18 non-null int64
text                  18 non-null object
corr_numerator        17 non-null object
corr_denominator      17 non-null object
dtypes: int64(2), object(3)
memory usage: 1.5+ KB

The corrected numerator and denominator are strings. Since integers can't handle NaN, replace NaN values with 0, convert to integers, insert into original dataframe, then delete the records.

In [135]:
wrong_ratings[['corr_numerator', 'corr_denominator']] = wrong_ratings[['corr_numerator', 'corr_denominator']].fillna(0).astype(int)
In [136]:
wrong_ratings.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18 entries, 340 to 2045
Data columns (total 5 columns):
rating_numerator      18 non-null int64
rating_denominator    18 non-null int64
text                  18 non-null object
corr_numerator        18 non-null int64
corr_denominator      18 non-null int64
dtypes: int64(4), object(1)
memory usage: 1.5+ KB
In [137]:
#replace re-extracted correct ratings in df_archive_copy
df_archive_copy.loc[mask,'rating_numerator'] = wrong_ratings.corr_numerator
df_archive_copy.loc[mask,'rating_denominator'] = wrong_ratings.corr_denominator
In [138]:
#delete rows relating to tweets without an extractable rating
df_archive_copy = df_archive_copy[~df_archive_copy.tweet_id.isin(to_delete)]

Test

In [139]:
df_archive_copy.rating_denominator.value_counts()
Out[139]:
10    2065
Name: rating_denominator, dtype: int64
In [140]:
sum(df_archive_copy.rating_denominator.isnull())
Out[140]:
0
In [141]:
df_archive_copy[['rating_numerator','rating_denominator']].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2065 entries, 0 to 2065
Data columns (total 2 columns):
rating_numerator      2065 non-null int64
rating_denominator    2065 non-null int64
dtypes: int64(2)
memory usage: 48.4 KB

4.9 - Data quality: Inaccurate rating numerators

Define:

All ratings numerators were extracted as integers, but some are the decimal of a rating.

  • re-extract ratings denominator that follow a decimal pattern, replace after converting column to float

Code:

In [142]:
#extract decimal ratings into new column
df_archive_copy['decimal_numerator'] = df_archive_copy.text.str.extract(r'.*\s(\d+\.\d+)\/10.*.*', expand=False).astype(float)
#extraction results
df_archive_copy[df_archive_copy.decimal_numerator.notnull()]
Out[142]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls ... p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog decimal_numerator
40 883482846933004288 none none 2017-07-08 00:28:19 <a href="http://twitter.com/download/iphone" r... This is Bella. She hopes her smile made you sm... none none NaT https://twitter.com/dog_rates/status/883482846... ... golden_retriever 0.943082 True Labrador_retriever 0.032409 True kuvasz 0.005501 True 13.50
552 786709082849828864 none none 2016-10-13 23:23:56 <a href="http://twitter.com/download/iphone" r... This is Logan, the Chow who lived. He solemnly... none none NaT https://twitter.com/dog_rates/status/786709082... ... Pomeranian 0.467321 True Persian_cat 0.122978 False chow 0.102654 True 9.75
608 778027034220126208 none none 2016-09-20 00:24:34 <a href="http://twitter.com/download/iphone" r... This is Sophie. She's a Jubilant Bush Pupper. ... none none NaT https://twitter.com/dog_rates/status/778027034... ... clumber 0.946718 True cocker_spaniel 0.015950 True Lhasa 0.006519 True 11.27
1445 680494726643068929 none none 2015-12-25 21:06:00 <a href="http://twitter.com/download/iphone" r... Here we have uncovered an entire battalion of ... none none NaT https://twitter.com/dog_rates/status/680494726... ... kuvasz 0.438627 True Samoyed 0.111622 True Great_Pyrenees 0.064061 True 11.26

4 rows × 29 columns

In [143]:
#define mask of extracted decimal ratings:
mask = df_archive_copy.decimal_numerator.notnull()
df_archive_copy.loc[mask,['tweet_id','text','decimal_numerator']]
Out[143]:
tweet_id text decimal_numerator
40 883482846933004288 This is Bella. She hopes her smile made you sm... 13.50
552 786709082849828864 This is Logan, the Chow who lived. He solemnly... 9.75
608 778027034220126208 This is Sophie. She's a Jubilant Bush Pupper. ... 11.27
1445 680494726643068929 Here we have uncovered an entire battalion of ... 11.26
In [144]:
#print text to check extraction
for t in df_archive_copy[mask].text:
    print(t)
This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948
This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS
This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq
Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD
In [145]:
#confirm original extraction was inaccurate
df_archive_copy.loc[mask,['rating_numerator','decimal_numerator']]
Out[145]:
rating_numerator decimal_numerator
40 5 13.50
552 75 9.75
608 27 11.27
1445 26 11.26
In [146]:
#convert rating_numerator to floats
df_archive_copy['rating_numerator'] = df_archive_copy['rating_numerator'].astype(float)
#replace original numerators
df_archive_copy.loc[mask,'rating_numerator'] = df_archive_copy.loc[mask,'decimal_numerator']
#drop unnecessary decimal_numerator column
df_archive_copy.drop('decimal_numerator',axis=1, inplace=True)

Test:

In [147]:
#check for presence of decimal numerators
df_archive_copy.rating_numerator.value_counts()
Out[147]:
12.00      477
10.00      431
11.00      419
13.00      279
9.00       152
8.00        95
7.00        51
14.00       39
5.00        33
6.00        32
3.00        19
4.00        15
2.00        10
1.00         4
0.00         2
9.75         1
15.00        1
11.26        1
13.50        1
11.27        1
420.00       1
1776.00      1
Name: rating_numerator, dtype: int64
In [148]:
df_archive_copy.loc[mask,'rating_numerator']
Out[148]:
40      13.50
552      9.75
608     11.27
1445    11.26
Name: rating_numerator, dtype: float64
In [149]:
#check correct dtype and unnecessary column has been dropped
df_archive_copy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2065 entries, 0 to 2065
Data columns (total 28 columns):
tweet_id                      2065 non-null object
in_reply_to_status_id         2065 non-null object
in_reply_to_user_id           2065 non-null object
timestamp                     2065 non-null datetime64[ns]
source                        2065 non-null object
text                          2065 non-null object
retweeted_status_id           2065 non-null object
retweeted_status_user_id      2065 non-null object
retweeted_status_timestamp    75 non-null datetime64[ns]
expanded_urls                 2065 non-null object
rating_numerator              2065 non-null float64
rating_denominator            2065 non-null int64
name                          2065 non-null object
dog_stage                     2065 non-null category
retweet_count                 2065 non-null int64
favorite_count                2065 non-null int64
tweet_type                    2065 non-null category
jpg_url                       2065 non-null object
img_num                       2065 non-null int64
p1                            2065 non-null object
p1_conf                       2065 non-null float64
p1_dog                        2065 non-null bool
p2                            2065 non-null object
p2_conf                       2065 non-null float64
p2_dog                        2065 non-null bool
p3                            2065 non-null object
p3_conf                       2065 non-null float64
p3_dog                        2065 non-null bool
dtypes: bool(3), category(2), datetime64[ns](2), float64(4), int64(4), object(13)
memory usage: 397.7+ KB

4.10 - Data quality: dog prediction column names

Define

Replace the column names with more descriptive names :

  • jpg_url:this is the URL of the image used in the prediction
  • img_num : out of a maximum of 4 images, the number of the image used in the prediction
  • p1,p2,p3 :make clear they are predictions
  • conf : confidence
  • dog : use is_dog, consistent with is_tweet, is_reply, is_quote

Code

In [150]:
#img_num and the image number that corresponded to the most
#confident prediction (numbered 1 to 4 since tweets can have up to four images
df_archive_copy = df_archive_copy.rename (columns={'jpg_url':'image_used_url',
                                                   'img_num':'image_used_num',
                                                   'p1': 'prediction_p1',
                                                   'p1_conf': 'confidence_p1',
                                                   'p1_dog': 'is_dog_p1',
                                                   'p2': 'prediction_p2',
                                                   'p2_conf': 'confidence_p2',
                                                   'p2_dog': 'is_dog_p2',
                                                   'p3': 'prediction_p3',
                                                   'p3_conf': 'confidence_p3',
                                                   'p3_dog': 'is_dog_p3'})

Test

In [151]:
list(df_archive_copy)
Out[151]:
['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'dog_stage',
 'retweet_count',
 'favorite_count',
 'tweet_type',
 'image_used_url',
 'image_used_num',
 'prediction_p1',
 'confidence_p1',
 'is_dog_p1',
 'prediction_p2',
 'confidence_p2',
 'is_dog_p2',
 'prediction_p3',
 'confidence_p3',
 'is_dog_p3']

4.11 - Data quality: dog prediction breed names are not consistent

Define:

Remove underscores, capitalise all dog breed names

Code:

In [152]:
df_archive_copy.prediction_p1 = df_archive_copy.prediction_p1.str.replace("_", " ")
df_archive_copy.prediction_p2 = df_archive_copy.prediction_p2.str.replace("_", " ")
df_archive_copy.prediction_p3 = df_archive_copy.prediction_p3.str.replace("_", " ")
In [153]:
#capitalise the dog breed names only, using is_dog_px for boolean indexing
df_archive_copy.loc[df_archive_copy.is_dog_p1,'prediction_p1'] = df_archive_copy[df_archive_copy.is_dog_p1].prediction_p1.str.title()
df_archive_copy.loc[df_archive_copy.is_dog_p2,'prediction_p2'] = df_archive_copy[df_archive_copy.is_dog_p2].prediction_p2.str.title()
df_archive_copy.loc[df_archive_copy.is_dog_p3,'prediction_p3'] = df_archive_copy[df_archive_copy.is_dog_p3].prediction_p3.str.title()

Test

In [154]:
df_archive_copy[df_archive_copy.is_dog_p1][['prediction_p1','prediction_p2','prediction_p3']].sample(5)
Out[154]:
prediction_p1 prediction_p2 prediction_p3
526 Malamute jean Keeshond
788 Golden Retriever llama Labrador Retriever
319 Cairn snorkel Norfolk Terrier
1407 Vizsla paddle Rhodesian Ridgeback
776 Golden Retriever Afghan Hound Chow

4.12 - Data quality: remove column redundancy dog predictions

Define:

The df_archive_copy dataframe contains 3 columns each for the 3 preductions, one is the prediction name, one whether it is a dog, and one of the confidence of the prediction. For the present analysis, this level of detailed information about the machine learning process is unnecessary as we are only interested in the top prediction if it is of a dog.

  1. Create column for the best prediction of a dog (they are ranked according to confidence).
  • remove redundant columns

Code:

In [155]:
#Get dog prediction from best prediction of a dog
mask = df_archive_copy.is_dog_p1
df_archive_copy.loc[mask,'top_prediction'] = df_archive_copy.loc[mask,'prediction_p1']
mask = (df_archive_copy.is_dog_p1==False) & df_archive_copy.is_dog_p2
df_archive_copy.loc[mask,'top_prediction'] = df_archive_copy.loc[mask,'prediction_p2']
mask = (df_archive_copy.is_dog_p1==False) & (df_archive_copy.is_dog_p2==False) & df_archive_copy.is_dog_p3
df_archive_copy.loc[mask,'top_prediction'] = df_archive_copy.loc[mask,'prediction_p3']
#fill in rows for which there is no predicted dog with "No dog predicted"
df_archive_copy.top_prediction = df_archive_copy.top_prediction.fillna("None")
In [156]:
#remove redundant columns
df_archive_copy.drop(['prediction_p1', 'confidence_p1', 'is_dog_p1', 'prediction_p2', 'confidence_p2', 
                      'is_dog_p2', 'prediction_p3', 'confidence_p3', 'is_dog_p3'],axis=1, inplace=True)

Test:

In [157]:
#check remaining columns
list(df_archive_copy)
Out[157]:
['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'dog_stage',
 'retweet_count',
 'favorite_count',
 'tweet_type',
 'image_used_url',
 'image_used_num',
 'top_prediction']

Define:

There are also a number of duplicated dog predictions: the same image was used, with an identical prediction, though the unique tweet_id is different. These may be the result of WeRataDogs retweeting their own tweets. The project description suggests that retweeted images should be removed.

  1. Record in new variable if a WeRateDogs tweet was retweeted by WeRateDogs
  • remove rows with retweets - predictions are the result of retweeted images
  • remove redundant columns related to retweets

Code:

In [158]:
#variable to record self-retweeting
df_archive_copy['was_retweeted'] = df_archive_copy.tweet_id.isin(df_archive_copy.retweeted_status_id)
In [159]:
#check how many duplicates are self-retweets
df_archive_copy[df_archive_copy.image_used_url.duplicated(keep=False)].tweet_type.value_counts()
Out[159]:
SelfRT      61
Original    61
Reply        0
Quote        0
OtherRT      0
Name: tweet_type, dtype: int64
In [160]:
df_archive_copy.tweet_type.value_counts()
Out[160]:
Original    1967
SelfRT        61
Reply         23
OtherRT       14
Quote          0
Name: tweet_type, dtype: int64
In [161]:
#remove retweets
df_archive_copy = df_archive_copy[~df_archive_copy.tweet_type.isin(['SelfRT','OtherRT'])]
In [162]:
#remove unnecessary columns related to retweets and replies
df_archive_copy.drop(['retweeted_status_id', 'retweeted_status_user_id', 
                      'retweeted_status_timestamp', 'in_reply_to_status_id', 'in_reply_to_user_id'],axis=1, inplace=True)

Test:

In [163]:
#check new variable
df_archive_copy['was_retweeted'].value_counts()
Out[163]:
False    1944
True       46
Name: was_retweeted, dtype: int64
In [164]:
list(df_archive_copy)
Out[164]:
['tweet_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'dog_stage',
 'retweet_count',
 'favorite_count',
 'tweet_type',
 'image_used_url',
 'image_used_num',
 'top_prediction',
 'was_retweeted']
In [165]:
#ensure no duplicated images remain
sum(df_archive_copy.image_used_url.duplicated())
Out[165]:
0
In [166]:
#check tweet_types remaining
df_archive_copy.tweet_type.value_counts()
Out[166]:
Original    1967
Reply         23
SelfRT         0
Quote          0
OtherRT        0
Name: tweet_type, dtype: int64

5 - Storage of cleaned data

Given the relatively small size of the data set, the most straighforward way to store the cleaned data is as csv files :

In [167]:
df_archive_copy.to_csv('twitter_archive_master.csv', index=False)

6 - Analysis and visualisation

The counts of favourites, retweets approximate a log-normal distribution

On Twitter, favourite and retweet counts are an important metric of audience engagement. Here I want to use them to to analyse the popularity of tweets and of dogs. To get a better idea of how they behave, I want to look at their distribution first.

In [168]:
#distribution of retweet and favourite counts
n_rt= df_archive_copy.retweet_count
n_fav= df_archive_copy.favorite_count

fig = plt.figure(figsize=(12,3))

ax1 = fig.add_subplot(121)
stats.probplot(n_rt, plot=plt) #https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html
ax1.set_title("Distribution of Retweet count")

ax2 = fig.add_subplot(122)
stats.probplot(n_fav, plot=plt)
ax2.set_title("Distribution of Favourite count");

#calculate logarithm for non-zero counts
log_rt= np.log(df_archive_copy[df_archive_copy.retweet_count>0].retweet_count)
log_fav= np.log(df_archive_copy[df_archive_copy.favorite_count>0].favorite_count)

fig = plt.figure(figsize=(12,3))

ax3 = fig.add_subplot(121)
stats.probplot(log_rt, plot=plt)
ax3.set_title("Distribution of Retweet count Logarithm")

ax4 = fig.add_subplot(122)
stats.probplot(log_fav, plot=plt);
ax4.set_title("Distribution of Favourite count Logarithm");

The counts of favourites and of retweets are approximately log-normally distributed (though retweets, deviate somewhat at very low counts).

Favourites and retweets are strongly correlated

Seaborn's jointplot function is a good way to look at the relationship between retweet and favourite count

In [169]:
#only consider non-zero values
x=(df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.retweet_count>0)].favorite_count)
y=(df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.retweet_count>0)].retweet_count)

corr_coeff= stats.pearsonr(x,y) #https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

ax=sns.jointplot(x,y,kind="reg")
ax.fig.text(x=0.3,y=0.7, s="Pearson: r={0:.2f}, p={1:.1f}".format(corr_coeff[0],corr_coeff[1]), ha="center", va="center")
                                    #decimals https://stackoverflow.com/a/8940627
ax.fig.subplots_adjust(top=0.93) #title https://stackoverflow.com/a/29814281
ax.fig.suptitle('Correlation of favourites vs retweets', fontsize=14);

Plot the log values of retweet count and of favourite count - to reduce overplotting and get a better distribution of the counts

In [170]:
x=np.log(df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.retweet_count>0)].favorite_count)
y=np.log(df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.retweet_count>0)].retweet_count)

corr_coeff= stats.pearsonr(x,y)

ax=sns.jointplot(x,y,kind="reg")

ax.fig.text(x=0.3,y=0.7, s="Pearson: r={0:.2f}, p={1:.1f}".format(corr_coeff[0],corr_coeff[1]), ha="center", va="center")
ax.fig.subplots_adjust(top=0.93)
ax.fig.suptitle('Fvourites vs retweets (log)', fontsize=14);

There is a very stong correlation (r=0.93) between retweet count and favourite count. This can easily be understood as both retweets and favourite are expressions of audience engagement, the former being a stronger form of engagement. There is also a mutual re-inforcement mechanism at play, as tweets that are retweeted are shown to a wider audience, increasing the pool of tweets who may favourite the tweet ; and popular tweets, as expressed by a higher favorite count are given higher prominence in the Twitter timeline. This feedback mechanism is likely to be the basis for the log-normal distribution observed.

The popularity of WeRateDogs increases steadily over a 20 month period

With nearly 8 million followers, @dog_rates (WeRateDogs) is a very successful Twitter account. How did it get there?

The plot below shows the evolution of audience engagement over the first 20 months. Only original tweets are considered - i.e. replies, quote-tweets and retweets are left out, as audience engagement is fundamentally different with them.

The retweet and favourite counts are shown for each tweet as a point in the scatter plot, on a logarithmic y-axis, with a line shown a rolling mean of a window size of 50 tweets.

In [171]:
fig, ax = plt.subplots(figsize=(14,6))

x = df_archive_copy[df_archive_copy.tweet_type=='Original'].timestamp

y1 = df_archive_copy[df_archive_copy.tweet_type=='Original'].favorite_count
y1m = y1.rolling(50).median()

y2 = df_archive_copy[df_archive_copy.tweet_type=='Original'].retweet_count
y2m = y2.rolling(50).median()

ax.plot_date(x,y1, color='b', markersize = 1)
ax.plot_date(x,y2, color='y', markersize = 1)

ax.plot(x,y1m, color='k')
ax.plot(x,y2m,color='r')
plt.yscale("log")

ax.set_title("Popularity of @dog_rates (WeRateDogs) over 2 years")
ax.set_ylabel("retweet / favourite count (log axis)")
ax.set_xlabel("date")
ax.legend(['Favourites','Retweets','Favourites - Rolling mean','Retweets - Rolling mean']);

During the first month or two, there's a rapid increase in popularity - from tens to thousands of retweets. Thereafter, the slope of the curve become much less steep but increases steadily until the end of the period studied. The approximately linear aspect of the rolling means expresses an exponential increase since the y-axis is logarithmic. The distance between the lines for favourites and retweets is relatively consistent but widens slowly.

Tweet popularity is associated with naming of dogs, dog "stage", and retweeting

WeRateDogs commonly states the name of the dog in the picture, as well as the "dog stage". Does this make the tweets more popular?

In [172]:
#create new variable stating if name and dog stage are used in text
df_archive_copy['has_name'] = df_archive_copy.name != "None"
df_archive_copy['has_stage_defined']=df_archive_copy.dog_stage != "no stage"

data = df_archive_copy[(df_archive_copy.favorite_count>0) & (df_archive_copy.tweet_type=='Original')]

#correlation https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pointbiserialr.html
corr_name = stats.pointbiserialr(data.has_name,data.favorite_count)
corr_stage = stats.pointbiserialr(data.has_stage_defined,data.favorite_count)

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(12,8))
sns.boxplot(y='favorite_count',x='has_name',data= data,ax=ax1).set_title('Favourite count: without/with named dogs')
fig.get_axes()[0].set_yscale('log')

sns.boxplot(y='favorite_count',x='has_stage_defined',data= data,ax=ax2).set_title('Favourite count: without/with dog stage')
fig.get_axes()[1].set_yscale('log')

ax1.text(x=0.6,y=65, s="Point-biserial: r={0:.2f}, p={1:.2f}".format(corr_name[0],corr_name[1]), ha="center", va="center")
ax2.text(x=0.6,y=65, s="Point-biserial: r={0:.2f}, p={1:.2f}".format(corr_stage[0],corr_stage[1]), ha="center", va="center");

Tweets that include a named dog or mention a dog stage are more popular - however the the difference is marginal. The correlation coefficient (r=0.02) and p-value (p=0.28) suggest the difference is not significant for named dogs, and it is only slightly higher for unnamed vs named dogs (r=0.08, p=0)

Is there a discernible effect of retweeting, and of the type of tweet?

In [173]:
data = df_archive_copy[df_archive_copy.favorite_count>0]

corr_coeff = stats.pointbiserialr(data.was_retweeted,data.favorite_count)

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(12,8))
sns.boxplot(y='favorite_count',x='was_retweeted',data= data,ax=ax1).set_title('Favourite count: retweets')
fig.get_axes()[0].set_yscale('log')
ax1.text(x=0.6,y=60, s="Point-biserial: r={0:.2f}, p={1:.2f}".format(corr_coeff[0],corr_coeff[1]), ha="center", va="center")

sns.boxplot(y='favorite_count',x='tweet_type',data= data,ax=ax2).set_title('Favourite count: Tweet type')
fig.get_axes()[1].set_yscale('log')

The box-plot shows that WeRateDogs tweets which are retweeted by the same account have a higher favourite count than those that are not. Thus retweeting is plausibly a good strategy for increasing a tweets popularity.However, the analysis here can't establish causality, because it's certainly possible that popular tweets are more likely to be chosen for retweeting.

As expected replies are less popular than original WeRateDog tweets, but quote tweets are surprisingly more popular than the standard original WeRateDog tweets.

The WeRateDogs ratings system is meaningful

WeRateDogs uses an ideosyncratic rating system for dogs, in which ratings of usually between 10 to 13 out of 10 are used. Lower rating do exist as well as 14, but ratings above 14 are outside of the standard used.

Do these ratings express anything meanful about the dog(s) in question? Below is a regression plot of the rating (numerator) against the dog's popularity (log of favorite count). Only ratings below 15/10 are considered.

In [174]:
data = df_archive_copy[df_archive_copy.rating_numerator<15]

x = data.rating_numerator
y = np.log(data.favorite_count)

pearson = stats.pearsonr(x,y)
spearman = stats.spearmanr(x,y) #https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.spearmanr.html

ax=sns.jointplot(x,y,kind="reg")
ax.fig.text(x=0.17,y=0.77, s="Pearson: r={0:.2f}, p={1:.1f}".format(pearson[0],pearson[1]), ha="left", va="center")
ax.fig.text(x=0.17,y=0.74, s="Spearman: r={0:.2f}, p={1:.1f}".format(spearman[0],spearman[1]), ha="left", va="center")

ax.fig.subplots_adjust(top=0.93)
ax.fig.suptitle('Correlation of Dog rating vs popularity', fontsize=14);

The plot suggests that the rating system is meaningful and there is in fact a relationship between the rating and the popularity of a dog. The Pearson correlation coefficient of r=0.49 is lower than the Spearman correlation coefficient r=0.60, since Pearson only measure linear relationships. Again, it is not possible to draw any conclusions re. causality here - it may be that a higher rating biases the audience to favour a dog picture, or it may be that a high/low favourite count and a high/low dog rating are both independently caused by how great a dog picture is.

Dog stage and breed have little impact on popularity

Are young dogs more popular? Are certain breeds more popular? The box plot below looks at this question.

In [175]:
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(12,8))

x= df_archive_copy.dog_stage
y= np.log(df_archive_copy.favorite_count)
sns.boxplot(x,y,data = df_archive_copy,ax=ax1).set_title('Dog stage and popularity')

top_dogs=df_archive_copy.top_prediction.value_counts().head(8).index.tolist()

data=df_archive_copy[df_archive_copy.top_prediction.isin(top_dogs)]
x= data.dog_stage
y= np.log(data.favorite_count)

sns.boxplot(x,y,data = data,ax=ax2).set_title('Dog breed and popularity')
plt.xticks(rotation=45);

The most popular dog stages are "puppo", "doggo" and "floofer", whose median popularity is about one unit above that of "pupper" and photos of dogs where no 'dog stage' is mentioned. It is perhaps surprising that "pupper" has the lowest median popularity rating - this suggests that it isn't simply the fact that a dog staage is mentioned in a tweet that contributes to its popularity.

Amongst the 7 most common dog breeds, Golden Retrievers tend to be the most popular, with Pugs and Chihuahuas at the lowest median popularity - these are roughly equal with pics that have no detected dog breed, which usually do not have a dog in the photo at all.

However, the effect of dog stage and dog breed on the popularity of dog pics is relatively small, considering that the ranges overlap.

The most common breed is Golden Retriever

Our cleaned data enable us to find out what the most common dog breeds are - here are the top 20 breeds. This analysis assumes that the breed predictions are quite accurate and don't have any systematic bias.

In [176]:
df_archive_copy.top_prediction.value_counts()[1:20].plot.barh(figsize=(8,5))
plt.gca().invert_yaxis()

The most common dog names Cooper, Lucy, Charlie and Oliver

Similarly, our cleaned data enable us to find out what the most common dog names are.

In [177]:
df_archive_copy.name.value_counts()[1:20].plot.barh(figsize=(8,5))
plt.gca().invert_yaxis();