DAND Project: Investigate no-show medical appointments

- Introduction

This is a project for the Introduction to Data Analysis module in the Udacity Data Analysis Nanodegree. The aim is to "conduct data analysis on a dataset, answering a number of questions and create a file to share the findings. Since inferential statistics or machine learning are not used, those findings will be tentative". The dataset chosen for analysis regards medical appointments:

This dataset collects information from 100k medical in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row. (from the Udacity project description)

How come patients miss scheduled medical appointments?

This is the central question I will try to gain a better understanding of. A preponderance of missed medical appointments make the medical system less efficient, potentially wasting precious resources. Identifying factors associated with missed appointments may hint at measures that could be taken to reduce the level of no-shows.

My approach will be to distinguish between patient characteristics, and factors related to the scheduling of medical appointments. The latter may point towards simple organisational steps that could be implemented to reduce no-shows. By contrast, patient characteristics - such as health or socio-economic status, gender or age - are not liable to intervention by the health system, and would point towards changes in the level of support given to patient groups identified in the analysis.

In [1]:
#import statements for packages to be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#set up matplotlib to work interactively
%matplotlib inline  
sns.set()           # switch on seaborn defaults
In [2]:
import matplotlib
pd.__version__, np.__version__, matplotlib.__version__, sns.__version__
Out[2]:
('0.22.0', '1.13.3', '2.2.2', '0.8.1')

Data Wrangling

General Properties

We'll start by trying to get an insight into the dataset and its structure

In [3]:
#Load the data from the csv file into the dataframe
#The data file was obtained from Kaggle: https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd2e9a_noshowappointments-kagglev2-may-2016/noshowappointments-kagglev2-may-2016.csv
df=pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.shape   # attribute telling us the number of rows and columns
Out[3]:
(110527, 14)
In [4]:
df.head()  # checking the first few rows, including column names
Out[4]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 1 0 0 0 No
In [5]:
df.info()  # checking data types and identify any missing values NaN
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB
In [6]:
sum(df.duplicated())   # function counting duplicated rows
Out[6]:
0

This initial inspection of data reveals that we have :

  • 110527 records with 14 columns
  • there are no duplicated rows
  • there are no missing data in any of the rows (NaN)
  • column names contain both capitalised and lower case letters, as well as some typos. Typos should be corrected, and column names may be converted to all lower case.

Next I will examine the different columns.

ID numbers for patients and appointments

In [7]:
#Let's study the first two columns, their type and unique values
df[['PatientId','AppointmentID']].nunique()
Out[7]:
PatientId         62299
AppointmentID    110527
dtype: int64

There are 110527 unique appointment IDs, the same as the number of rows in the dataframe - consistent with the description of the dataset, each row represents a medical appointment.

There are 62299 unique patient IDs - this implies that some patients have had several appointments.

From the result of df.info(), we can see that, although both appointment and patient IDs are numeric, Pandas converted the former to integer but the latter to floats - but one would expect them to take the same format.

In [8]:
df[['PatientId','AppointmentID']].head()
Out[8]:
PatientId AppointmentID
0 2.987250e+13 5642903
1 5.589978e+14 5642503
2 4.262962e+12 5642549
3 8.679512e+11 5642828
4 8.841186e+12 5642494
In [9]:
df['PatientId'].astype(int).head() #display the patient IDs by formating as integers (potentially removing decimals)
Out[9]:
0     29872499824296
1    558997776694438
2      4262962299951
3       867951213174
4      8841186448183
Name: PatientId, dtype: int64
In [10]:
df[['PatientId','AppointmentID']].describe()
Out[10]:
PatientId AppointmentID
count 1.105270e+05 1.105270e+05
mean 1.474963e+14 5.675305e+06
std 2.560949e+14 7.129575e+04
min 3.921784e+04 5.030230e+06
25% 4.172614e+12 5.640286e+06
50% 3.173184e+13 5.680573e+06
75% 9.439172e+13 5.725524e+06
max 9.999816e+14 5.790484e+06

Appointment IDs take the format of a 7 digit integer starting with 5 as the first digit .

Most patient IDs are distributed over 2 or 3 orders of magnitude, but a few apparent outliers are much smaller. It is unknown how many are integers, as one might expect from an ID number

In [11]:
#using numpy to test if most patient IDs are integers through an elementwise comparison of arrays of patientid with  patientid converted to integers 
np.array_equal(df['PatientId'], df['PatientId'].astype(int)) #inspired by https://stackoverflow.com/a/49249910
Out[11]:
False
In [12]:
#count the rows in which patient ID is not an integer
sum(df['PatientId'] != df['PatientId'].astype(int)) 
Out[12]:
5
In [13]:
#using conditional indexing to display the 5 rows in which patient IDs are not integers
df[df['PatientId']!=df['PatientId'].astype(int)]
Out[13]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
3950 93779.52927 5712759 F 2016-05-18T09:12:29Z 2016-05-18T00:00:00Z 33 CENTRO 0 0 0 0 0 0 No
73228 537615.28476 5637728 F 2016-04-29T07:19:57Z 2016-05-06T00:00:00Z 14 FORTE SÃO JOÃO 0 0 0 0 0 1 No
73303 141724.16655 5637648 M 2016-04-29T07:13:36Z 2016-05-02T00:00:00Z 12 FORTE SÃO JOÃO 0 0 0 0 0 0 No
100517 39217.84439 5751990 F 2016-05-31T10:56:41Z 2016-06-03T00:00:00Z 44 PRAIA DO SUÁ 0 0 0 0 0 0 No
105430 43741.75652 5760144 M 2016-06-01T14:22:58Z 2016-06-01T00:00:00Z 39 MARIA ORTIZ 0 0 1 0 0 0 No

Almost all patient IDs are integers, with the exception of 5 records in which a data entry error appears to have taken place with a . inserted. These 5 rows could either be dropped, or, given that the error seems obvious, manually corrected.

Patient gender and age

In [14]:
df['Gender'].nunique()   #display unique values for the gender column
Out[14]:
2
In [15]:
df['Gender'].value_counts()  #count rows for each gender
Out[15]:
F    71840
M    38687
Name: Gender, dtype: int64
In [16]:
df['Gender'].value_counts(normalize=True)   #normalised to obtain proportion for each gender
Out[16]:
F    0.649977
M    0.350023
Name: Gender, dtype: float64

Patient gender is recorded as a string taking the value of either F or M. 65% of rows represent appointments for female patients. A binary gender system is used, implying that the present data are not suitable for gaining insight into the health needs specific to trans or non-binary patients.

In [17]:
df['Age'].nunique() #unique age values
Out[17]:
104
In [18]:
df['Age'].describe() #descriptive stats on age
Out[18]:
count    110527.000000
mean         37.088874
std          23.110205
min          -1.000000
25%          18.000000
50%          37.000000
75%          55.000000
max         115.000000
Name: Age, dtype: float64
In [19]:
df['Age'].value_counts().head() #check the most common values
Out[19]:
0     3539
1     2273
52    1746
49    1652
53    1651
Name: Age, dtype: int64
In [20]:
df['Age'].value_counts().tail() #check the least common values - some could be errors
Out[20]:
 115    5
 100    4
 102    2
 99     1
-1      1
Name: Age, dtype: int64
In [21]:
df['Age'].hist(); #visualise the age distribution with a histogram
In [22]:
df[df.Age<0] #checking records for appointments with patients aged below 0 
Out[22]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
99832 4.659432e+14 5775010 F 2016-06-06T08:58:13Z 2016-06-06T00:00:00Z -1 ROMÃO 0 0 0 0 0 0 No
In [23]:
df[df.Age>100] #checking records for appointments with patients aged aboce 100 years 
Out[23]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
58014 9.762948e+14 5651757 F 2016-05-03T09:14:53Z 2016-05-03T00:00:00Z 102 CONQUISTA 0 0 0 0 0 0 No
63912 3.196321e+13 5700278 F 2016-05-16T09:17:44Z 2016-05-19T00:00:00Z 115 ANDORINHAS 0 0 0 0 1 0 Yes
63915 3.196321e+13 5700279 F 2016-05-16T09:17:44Z 2016-05-19T00:00:00Z 115 ANDORINHAS 0 0 0 0 1 0 Yes
68127 3.196321e+13 5562812 F 2016-04-08T14:29:17Z 2016-05-16T00:00:00Z 115 ANDORINHAS 0 0 0 0 1 0 Yes
76284 3.196321e+13 5744037 F 2016-05-30T09:44:51Z 2016-05-30T00:00:00Z 115 ANDORINHAS 0 0 0 0 1 0 No
90372 2.342836e+11 5751563 F 2016-05-31T10:19:49Z 2016-06-02T00:00:00Z 102 MARIA ORTIZ 0 0 0 0 0 0 No
97666 7.482346e+14 5717451 F 2016-05-19T07:57:56Z 2016-06-03T00:00:00Z 115 SÃO JOSÉ 0 1 0 0 0 1 No

Summary of patient age data:

  • Patient age is given as a an integer
  • since this doesn't allow for months to be recorded, newborns and babies under 1 year of age are apparently recorded as age 0 - this is the mode of the age distribution
  • the median age is 37 years, meaning that in 50% of appointments, the patient is aged less than 37 years. In 25% of appointments, the patient is a minor (below 18) and in 25% over 55 years.
  • there are a surprising number of rows with the age of 115, but these are multiple appointments with the same patient
  • there is one appointment recorded as age -1. It is not clear what this means but it may represent an unconventional way of recording a prenatal intervention although the expectant mother's age ought to be recordeded since she is the patient. Alternatively, it could be a mistaken way of recording a premature birth. In either case, one may consider removing this record.

No further data cleaning is required for the age column.

Dates

Investigating the two date columns:

In [24]:
df[['ScheduledDay','AppointmentDay']].dtypes
Out[24]:
ScheduledDay      object
AppointmentDay    object
dtype: object
In [25]:
df[['ScheduledDay','AppointmentDay']].head() 
Out[25]:
ScheduledDay AppointmentDay
0 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z
1 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z
2 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z
3 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z
4 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z
In [26]:
df[['ScheduledDay','AppointmentDay']].tail() 
Out[26]:
ScheduledDay AppointmentDay
110522 2016-05-03T09:15:35Z 2016-06-07T00:00:00Z
110523 2016-05-03T07:27:33Z 2016-06-07T00:00:00Z
110524 2016-04-27T16:03:52Z 2016-06-07T00:00:00Z
110525 2016-04-27T15:09:23Z 2016-06-07T00:00:00Z
110526 2016-04-27T13:30:56Z 2016-06-07T00:00:00Z

Date columns:

  • According to the description, ‘ScheduledDay’ tells us on what day the patient set up their appointment.
  • Currently both dates have the data type of strings - they should be converted to datetime.
  • It appears that the time is recorded for the day on which the appointment was set up but not of the appointment date itself. This is potentially problematic - for same-day appointments, if the same granularity is used for ScheduledDay and AppointmentDay and day times used, the appointment appears to take place prior to it being scheduled
  • Appointments apparently took place between 29. April and 7 of June 2016

Neighbourhood / hospital

According to the description, ‘Neighborhood’ indicates the location of the hospital at which the apointment took place.

In [27]:
df['Neighbourhood'].nunique() #Number of different hospitals
Out[27]:
81
In [28]:
df['Neighbourhood'].value_counts() #counts of appointments in each hospital
Out[28]:
JARDIM CAMBURI                 7717
MARIA ORTIZ                    5805
RESISTÊNCIA                    4431
JARDIM DA PENHA                3877
ITARARÉ                        3514
CENTRO                         3334
TABUAZEIRO                     3132
SANTA MARTHA                   3131
JESUS DE NAZARETH              2853
BONFIM                         2773
SANTO ANTÔNIO                  2746
SANTO ANDRÉ                    2571
CARATOÍRA                      2565
JABOUR                         2509
SÃO PEDRO                      2448
ILHA DO PRÍNCIPE               2266
NOVA PALESTINA                 2264
ANDORINHAS                     2262
DA PENHA                       2217
ROMÃO                          2215
GURIGICA                       2018
SÃO JOSÉ                       1977
BELA VISTA                     1907
MARUÍPE                        1902
FORTE SÃO JOÃO                 1889
ILHA DE SANTA MARIA            1885
SÃO CRISTÓVÃO                  1836
REDENÇÃO                       1553
SÃO BENEDITO                   1439
JOANA D´ARC                    1427
                               ... 
SANTOS REIS                     547
ESTRELINHA                      538
SANTA CLARA                     506
SOLON BORGES                    469
PIEDADE                         452
SANTA CECÍLIA                   448
SANTA LÚCIA                     438
SANTA LUÍZA                     428
BARRO VERMELHO                  423
DO MOSCOSO                      413
MÁRIO CYPRESTE                  371
BOA VISTA                       312
COMDUSA                         310
DE LOURDES                      305
ARIOVALDO FAVALESSA             282
ANTÔNIO HONÓRIO                 271
FRADINHOS                       258
ENSEADA DO SUÁ                  235
SANTA HELENA                    178
HORTO                           175
UNIVERSITÁRIO                   152
SEGURANÇA DO LAR                145
NAZARETH                        135
MORADA DE CAMBURI                96
PONTAL DE CAMBURI                69
ILHA DO BOI                      35
ILHA DO FRADE                    10
AEROPORTO                         8
ILHAS OCEÂNICAS DE TRINDADE       2
PARQUE INDUSTRIAL                 1
Name: Neighbourhood, Length: 81, dtype: int64
In [29]:
df['Neighbourhood'].value_counts().sort_index().index #sorted to identify if there are any typos 
Out[29]:
Index(['AEROPORTO', 'ANDORINHAS', 'ANTÔNIO HONÓRIO', 'ARIOVALDO FAVALESSA',
       'BARRO VERMELHO', 'BELA VISTA', 'BENTO FERREIRA', 'BOA VISTA', 'BONFIM',
       'CARATOÍRA', 'CENTRO', 'COMDUSA', 'CONQUISTA', 'CONSOLAÇÃO',
       'CRUZAMENTO', 'DA PENHA', 'DE LOURDES', 'DO CABRAL', 'DO MOSCOSO',
       'DO QUADRO', 'ENSEADA DO SUÁ', 'ESTRELINHA', 'FONTE GRANDE',
       'FORTE SÃO JOÃO', 'FRADINHOS', 'GOIABEIRAS', 'GRANDE VITÓRIA',
       'GURIGICA', 'HORTO', 'ILHA DAS CAIEIRAS', 'ILHA DE SANTA MARIA',
       'ILHA DO BOI', 'ILHA DO FRADE', 'ILHA DO PRÍNCIPE',
       'ILHAS OCEÂNICAS DE TRINDADE', 'INHANGUETÁ', 'ITARARÉ', 'JABOUR',
       'JARDIM CAMBURI', 'JARDIM DA PENHA', 'JESUS DE NAZARETH', 'JOANA D´ARC',
       'JUCUTUQUARA', 'MARIA ORTIZ', 'MARUÍPE', 'MATA DA PRAIA', 'MONTE BELO',
       'MORADA DE CAMBURI', 'MÁRIO CYPRESTE', 'NAZARETH', 'NOVA PALESTINA',
       'PARQUE INDUSTRIAL', 'PARQUE MOSCOSO', 'PIEDADE', 'PONTAL DE CAMBURI',
       'PRAIA DO CANTO', 'PRAIA DO SUÁ', 'REDENÇÃO', 'REPÚBLICA',
       'RESISTÊNCIA', 'ROMÃO', 'SANTA CECÍLIA', 'SANTA CLARA', 'SANTA HELENA',
       'SANTA LUÍZA', 'SANTA LÚCIA', 'SANTA MARTHA', 'SANTA TEREZA',
       'SANTO ANDRÉ', 'SANTO ANTÔNIO', 'SANTOS DUMONT', 'SANTOS REIS',
       'SEGURANÇA DO LAR', 'SOLON BORGES', 'SÃO BENEDITO', 'SÃO CRISTÓVÃO',
       'SÃO JOSÉ', 'SÃO PEDRO', 'TABUAZEIRO', 'UNIVERSITÁRIO', 'VILA RUBIM'],
      dtype='object')

'Neighbourhood' column:

  • There are apparently 81 different hospitals, with a very wide spread of appointments (ranging from single figures to several thousand)
  • Hospital names are all capitalised strings, including some non-standard letters - this should not pose a problem.

There is no immediately obvious need for data cleaning in this column.

Remaining patient characteristics

Checking the remaining columns related to patient characteristics:

In [30]:
df[['Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism', 'Handcap']].nunique() #unique values in each column
Out[30]:
Scholarship     2
Hipertension    2
Diabetes        2
Alcoholism      2
Handcap         5
dtype: int64
In [31]:
df[['Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism', 'Handcap']].dtypes #data types in each column
Out[31]:
Scholarship     int64
Hipertension    int64
Diabetes        int64
Alcoholism      int64
Handcap         int64
dtype: object
In [32]:
df['Scholarship'].value_counts() #counts for each value
Out[32]:
0    99666
1    10861
Name: Scholarship, dtype: int64
In [33]:
df['Hipertension'].value_counts()
Out[33]:
0    88726
1    21801
Name: Hipertension, dtype: int64
In [34]:
df['Diabetes'].value_counts()
Out[34]:
0    102584
1      7943
Name: Diabetes, dtype: int64
In [35]:
df['Alcoholism'].value_counts()
Out[35]:
0    107167
1      3360
Name: Alcoholism, dtype: int64
In [36]:
df['Handcap'].value_counts()
Out[36]:
0    108286
1      2042
2       183
3        13
4         3
Name: Handcap, dtype: int64

Several patient characteristics are recorded as integers. Scholarship, "hipertension", diabetes and alcoholism only take the values 0 and 1, and could therefore be converted into booleans. "Handcap" takes values from 0 to 4, apparently representing levels of increasingly severe disability. The numbers of patients recorded at levels 3 and 4 are very small. It may be useful to add a column with boolean values expressing whether or not a patient is disabled - this new column would make it possible to treat disability in the same way as the previous characteristics.

Remaining appointment characteristics

In [37]:
df[['SMS_received', 'No-show']].nunique()
Out[37]:
SMS_received    2
No-show         2
dtype: int64
In [38]:
df[['SMS_received', 'No-show']].dtypes
Out[38]:
SMS_received     int64
No-show         object
dtype: object
In [39]:
df['SMS_received'].value_counts()
Out[39]:
0    75045
1    35482
Name: SMS_received, dtype: int64
In [40]:
df['No-show'].value_counts()
Out[40]:
No     88208
Yes    22319
Name: No-show, dtype: int64
In [41]:
df['No-show'].value_counts(normalize=True)
Out[41]:
No     0.798067
Yes    0.201933
Name: No-show, dtype: float64

Appointments in which the patient received an SMS are recorded as integers taking values of 0 or 1 - this could be converted into booleans. No-show appointments are recorded as strings of either Yes (if the patient missed the appointment) or No, if the patient showed up. Overall 20.19% of appointments are missed.

Summary of initial inspection:

This is a dataset of 110527 medical appointments, apparently ordered chronologically, which took place between 29. April and 7 of June 2016. Each appointment has a unique ID, and there are 62299 patients with unique IDs. 14 columns record information relating to the appointment - the date it was scheduled, the date the appointment took place, the hospital, whether an SMS was received, and, importantly, whether the patient showed up for the appointment- as well as information relating to the patient - gender, age, enrolment in a welfare program ("scholarship"), hypertension, disability.

Data Cleaning

The data are quite clean in that there are no duplicated rows nor missing data in any of the columns. Only the following data cleaning will be undertaken:

  • correct malformed patient id numbers and convert the whole column into the integer data type
  • convert dates into the datetime data type for columns ScheduledDay, AppointmentDay
  • column names: typos will be corrected, all names changed to lower case, hyphen replaced by underscore
  • convert to boolean the columns 'Scholarship', 'Hipertension, Diabetes', 'Alcoholism', 'SMS_received'
  • add a new column with boolean values for disability to make it compatible with other patient characteristics
  • drop single record of appointment with patient with a negative age

Column names

In [42]:
df.columns # view current column names
Out[42]:
Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
       'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension',
       'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show'],
      dtype='object')
In [43]:
#correct typos in column names,replace hyphen and convert them to all lower-case for consistency

df=df.rename(columns={'Hipertension':'Hypertension','Handcap':'Disability_int', 'No-show':'No_show'}) #rename columns, remove hyphen
df=df.rename(columns=str.lower) # convert all column names to lower case
df.head(1)  # check changes
Out[43]:
patientid appointmentid gender scheduledday appointmentday age neighbourhood scholarship hypertension diabetes alcoholism disability_int sms_received no_show
0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No

Correct patient ids

Several rows contain malformed patient ids. Given the small number of rows, this can be done by replacing them manually.

In [44]:
#obtain row numbers and patient ids that are malformed patient ids:

(df[df['patientid']!=df['patientid'].astype(int)])['patientid']  # get rows that are not integers
Out[44]:
3950       93779.52927
73228     537615.28476
73303     141724.16655
100517     39217.84439
105430     43741.75652
Name: patientid, dtype: float64
In [45]:
#Rows that need correcting: [3950, 73228, 73303, 100517, 105430]
#Corrected patient ids for these rows: [9377952927, 53761528476, 14172416655, 3921784439, 4374175652]
#Simple loop to replace the malformed patient ids :

for (row_nb,pid) in [(3950,9377952927),(73228,53761528476,),(73303,14172416655,),(100517,3921784439,),(105430,4374175652)]:
    df.iloc[row_nb,0]=pid
In [46]:
#test to verify all patient ids are now integer, to confirm it's safe to confirm the dtype from float to integer
np.array_equal(df['patientid'], df['patientid'].astype(int))  #should now be true
Out[46]:
True
In [47]:
#since all patient ids are integers, convert data type to integer
df['patientid']=df['patientid'].astype(int)
df['patientid'].dtype # check change
Out[47]:
dtype('int64')
In [48]:
df.iloc[[3950, 73228, 73303, 100517, 105430]] # check corrected rows
Out[48]:
patientid appointmentid gender scheduledday appointmentday age neighbourhood scholarship hypertension diabetes alcoholism disability_int sms_received no_show
3950 9377952927 5712759 F 2016-05-18T09:12:29Z 2016-05-18T00:00:00Z 33 CENTRO 0 0 0 0 0 0 No
73228 53761528476 5637728 F 2016-04-29T07:19:57Z 2016-05-06T00:00:00Z 14 FORTE SÃO JOÃO 0 0 0 0 0 1 No
73303 14172416655 5637648 M 2016-04-29T07:13:36Z 2016-05-02T00:00:00Z 12 FORTE SÃO JOÃO 0 0 0 0 0 0 No
100517 3921784439 5751990 F 2016-05-31T10:56:41Z 2016-06-03T00:00:00Z 44 PRAIA DO SUÁ 0 0 0 0 0 0 No
105430 4374175652 5760144 M 2016-06-01T14:22:58Z 2016-06-01T00:00:00Z 39 MARIA ORTIZ 0 0 1 0 0 0 No

Convert to boolean type

Convert the data type of the columns scholarship, hypertension, diabetes, alcoholism and sms_received to boolean, and add a column with bolean values for disability:

In [49]:
df[['scholarship', 'hypertension', 'diabetes', 'alcoholism','sms_received']]=df[[ 'scholarship', 'hypertension', 'diabetes', 'alcoholism','sms_received']].astype(bool)
df['disability'] = df['disability_int']>0 #add disability column with boolean values
df.head() # check changes
Out[49]:
patientid appointmentid gender scheduledday appointmentday age neighbourhood scholarship hypertension diabetes alcoholism disability_int sms_received no_show disability
0 29872499824296 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA False True False False 0 False No False
1 558997776694438 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA False False False False 0 False No False
2 4262962299951 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA False False False False 0 False No False
3 867951213174 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI False False False False 0 False No False
4 8841186448183 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA False True True False 0 False No False

Convert dates to datetime

convert scheduledday and appointmentday to a datetime data type:

In [50]:
df['scheduledday']=pd.to_datetime(df['scheduledday'])
df['appointmentday']=pd.to_datetime(df['appointmentday'])
In [51]:
# As pointed out earlier, the different granularity of the two dates is potentially problematic - let's test for 
# how many the appointment dates precedes the date on which it was scheduled:
df.query("scheduledday>appointmentday").shape
Out[51]:
(38568, 15)
In [52]:
df.query("scheduledday>appointmentday").head()  # only checking a few rows - not  38K records  
Out[52]:
patientid appointmentid gender scheduledday appointmentday age neighbourhood scholarship hypertension diabetes alcoholism disability_int sms_received no_show disability
0 29872499824296 5642903 F 2016-04-29 18:38:08 2016-04-29 62 JARDIM DA PENHA False True False False 0 False No False
1 558997776694438 5642503 M 2016-04-29 16:08:27 2016-04-29 56 JARDIM DA PENHA False False False False 0 False No False
2 4262962299951 5642549 F 2016-04-29 16:19:04 2016-04-29 62 MATA DA PRAIA False False False False 0 False No False
3 867951213174 5642828 F 2016-04-29 17:29:31 2016-04-29 8 PONTAL DE CAMBURI False False False False 0 False No False
4 8841186448183 5642494 F 2016-04-29 16:07:23 2016-04-29 56 JARDIM DA PENHA False True True False 0 False No False

In 38568 rows, the appointment appears to take place prior to the appointment being scheduled. In fact this anomaly is likely due to same-day appointments, since the time that an appointement was scheduled is recorded, but not the time of the actual appointment. To verify that this is the case, we can repeat the above query but discarding the time information by rounding 'scheduledday' down :

In [53]:
#Repeating the query, disregarding the time appointments were scheduled 
#by rounding down scheduledday using dt.floor, from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.floor.html
df.query("scheduledday.dt.floor('1d')>appointmentday") 
Out[53]:
patientid appointmentid gender scheduledday appointmentday age neighbourhood scholarship hypertension diabetes alcoholism disability_int sms_received no_show disability
27033 7839272661752 5679978 M 2016-05-10 10:51:53 2016-05-09 38 RESISTÊNCIA False False False False 1 False Yes True
55226 7896293967868 5715660 F 2016-05-18 14:50:41 2016-05-17 19 SANTO ANTÔNIO False False False False 1 False Yes True
64175 24252258389979 5664962 F 2016-05-05 13:43:58 2016-05-04 22 CONSOLAÇÃO False False False False 0 False Yes False
71533 998231581612122 5686628 F 2016-05-11 13:49:20 2016-05-05 81 SANTO ANTÔNIO False False False False 0 False Yes False
72362 3787481966821 5655637 M 2016-05-04 06:50:57 2016-05-03 7 TABUAZEIRO False False False False 0 False Yes False

Using a strict 'greater than' in the query removes all same-day appointments, leaving only 5 appointments with the date anomaly. These are likely data entry errors and it does not seem possible to correct this error. The most likely explanation is that the two dates were mixed up. The best way to handle these records is to remove them from the dataframe.

In [54]:
drop_list = df.query("scheduledday.dt.floor('1d')>appointmentday").index  # list of records to be dropped
In [55]:
df = df.drop(drop_list)   # drop records with anomalous dates
In [56]:
df.query("scheduledday.dt.floor('1d')>appointmentday") # verify the records have been dropped
Out[56]:
patientid appointmentid gender scheduledday appointmentday age neighbourhood scholarship hypertension diabetes alcoholism disability_int sms_received no_show disability

Patient age

In [57]:
#get record with negative age patient
df[df['age']<0]
Out[57]:
patientid appointmentid gender scheduledday appointmentday age neighbourhood scholarship hypertension diabetes alcoholism disability_int sms_received no_show disability
99832 465943158731293 5775010 F 2016-06-06 08:58:13 2016-06-06 -1 ROMÃO False False False False 0 False No False
In [58]:
df = df.drop(99832) #drop record
df[df['age']<0] #check it's gone
Out[58]:
patientid appointmentid gender scheduledday appointmentday age neighbourhood scholarship hypertension diabetes alcoholism disability_int sms_received no_show disability

Exploratory Data Analysis

Approach used

Our datatset contains 110 medical appointments that have either been missed or not, collected together with data on patient health status (diabetes, hypertension, alcoholism, disability) as well as other patient characteristics (gender, age, welfare recipient status. The dataset also records information related to the appointment, such as appointment dates and location, a unique appointment ID and the patient ID.

The analysis approach will be to first try to identify associations between no-show appointments and factors of how medical appointments are organised (Part I). More specifically, I will try to answer the following questions: is the level of no-shows associated with the time patients need to wait? Is it associated with receiving an SMS? Is it associated with the day of the week of the appointment?

The second part (Part II) will aim to identify associations between no-shows and patient characteristics. More specifically, I will try to answer the following questions: is the level of no-shows associated with patient health status (diabetes, hypertension, alcoholism) ? Is it associated with gender, age or enrolement in a walfare program? A particular characteristics more common in the patients that most frequently miss appointments?

Finally, Part III regards differences between the 81 different hospitals in the dataset. Hospitals may differ in the rate of no-show appointments, as well as in how appointments are scheduled and in patient composition. Thus the final question is: is there a correlation between a hospital's level of no-show appointments and factors identified in parts 1 and 2?

Part I: Identify scheduling factors associated with no-show appointments

Question I (a) : Is the level of no-shows associated with appointment wait duration?

One might suspect that the longer a patient needs to wait for their appointment, the less likely they are to turn up. This wait will here be defined as the number of days a patient waits after scheduling their appointment.

This wait (in days) can be calculated as the difference between the date of the medical appointment and the date on which it was scheduled.

We'll define a new wait column that is the result of subtracting scheduledday from appointmentday, with day time removed from the former using dt.floor('1d'). Since the result of this operation has the data type of timedelta64 which cannot be used by certain functions (such as hist()), the result is divided by np.timedelta() which transforms the date into a numeric value of days. This is then added as a new column called 'wait'.

In [59]:
df['wait']=((df.appointmentday - df.scheduledday.dt.floor('1d')) / np.timedelta64(1, 'D')) #inspired by https://stackoverflow.com/a/37840872
df['wait'].describe()   #descriptive statistics summary of number of days patients wait for an appointment
Out[59]:
count    110521.000000
mean         10.184345
std          15.255153
min           0.000000
25%           0.000000
50%           4.000000
75%          15.000000
max         179.000000
Name: wait, dtype: float64

The median wait is 4 days, but there is a wide spread, with a maximum of 179 days. 75% of patients wait 15 days or less.

To visualise the spread, let's look at histograms for wait times.

In [60]:
df['wait'].plot(title="Histogram of wait duration (days)", kind='hist'); #histogram of wait days

Given the wide spread, let's look more closely at the first 15 days, and at the long tail:

In [61]:
f, (ax1, ax2) = plt.subplots(1, 2,  figsize=(15,5)) #figure with two adjacent subplots
ax1.hist(df[df['wait']<=15].wait, bins=15); #15 bins between 0 and 15 days - each bar is one day
ax1.set_ylabel('frequency')
ax1.set_xlabel('wait duration (days)')
ax1.set_title('wait duration up to 15 days')
ax1.text(0.8, 0.8,'each bar is one day', ha='center', va='center', transform=ax1.transAxes) #from https://stackoverflow.com/a/8482667

ax2.hist(df[df['wait']>=74].wait, bins=15); #15*7=105, so 15 bins between 74 and 179 days - each bar is one week
ax2.set_title('wait duration 74 days to 6 months')
ax2.set_ylabel('frequency')
ax2.set_xlabel('wait duration (days)')
ax2.text(0.8, 0.8,'each bar is one week', ha='center', va='center', transform=ax2.transAxes);
In [62]:
df['wait'].value_counts().head()
Out[62]:
0.0    38562
2.0     6725
4.0     5290
1.0     5213
7.0     4906
Name: wait, dtype: int64

The distribution of days patients have to wait for their appointment has a broad range and is strongly right-skewed. This is indicated by the large difference between mean (10.18 days) and median (4 days). Over a third (38.563) of appointments occur on the same day they are scheduled, then the wait duration drops from several thousand per day in the first week, to well below 50 a week in the range of 3-6 months wait. However the maximum wait durations are not outliers but part of a long tail.

Given this broad and skewed distribution of appointment wait duration, it is best to study no-shows after dividing the appointments into a small number of categories of wait durations that are more similar in size, using the Pandas cut() method:

In [63]:
wait_bins=[0,1,3,8,22,180] #define the edges of the categories
wait_labels=['same_day','2_days','same_week','3_weeks','6_months'] #category names

#create a new column categorising appointments according to wait duration
df['wait_duration']=pd.cut(df['wait'],wait_bins,labels=wait_labels,include_lowest=True,right = False) 
df.wait_duration.value_counts() #check number of appointments in each category
Out[63]:
same_day     38562
3_weeks      20899
same_week    20247
6_months     18875
2_days       11938
Name: wait_duration, dtype: int64

Here the appointments have been grouped into 5 different levels of waiting time :

  • same_day for appointments on the same day as they are scheduled - this group is significantly larger than the others but it can't be subdivided
  • 2_days for appointments after the day when they were scheduled but within the first 2 days ;
  • same_week for appointments after the third day when they were scheduled but within the first week ;
  • 3_weeks for appointments after the first week when they were scheduled but within the first 3 weeks ;
  • 6_months for appointments after the first 3 weeks when they were scheduled - all falling within 6 months. Although these 5 groups vary in size, they are large enough and similar enough to be studied further.

Next we'll calculate the number of no-show appointments by grouping appointments according to the wait time level and normalising the value_counts in the no-show category:

In [64]:
#the unstack function creates seperate columns for missed and kept appointments
df.groupby('wait_duration').no_show.value_counts().unstack()
Out[64]:
no_show No Yes
wait_duration
same_day 36770 1792
2_days 9223 2715
same_week 15190 5057
3_weeks 14374 6525
6_months 12650 6225
In [65]:
#normalising the values converts them to the proportions of missed and kept appointments:
df.groupby('wait_duration').no_show.value_counts(normalize=True).unstack() 
Out[65]:
no_show No Yes
wait_duration
same_day 0.953529 0.046471
2_days 0.772575 0.227425
same_week 0.750235 0.249765
3_weeks 0.687784 0.312216
6_months 0.670199 0.329801

For the remainder of the notebook, there will frequently be a need to calculate the proportion of missed appointments in different categories. This can be calculated using groupby(), normalised value_counts(), and unstack()['Yes]. To simplify this calculation, I'll use a function called NSP for no-show proportions:

In [66]:
###Added function following reviewer's comments
def NSP(group,dtframe=df,normalize=True,missed_only=True):
    """
    Returns the proportion of no-show appointments in different categories defined by
    the group parameter. 
    Optional paramters:
    dtframe :     dataframe to use, df by default
    normalize :   can be set to False if number of no-shows is required
    missed_only : can be set to False if both no-shows and kept appointments are required ; if 
                  set to False, a dataframe will be returned, otherwise a pandas series by default 
    """
    if missed_only: #no-shows only, by default
        result=dtframe.groupby(group).no_show.value_counts(normalize=normalize).unstack()['Yes']
    else: # return 'Yes' and 'No' columns for no_show
        result=dtframe.groupby(group).no_show.value_counts(normalize=normalize).unstack()
    return result
In [67]:
##df.groupby('wait_duration').no_show.value_counts(normalize=True).unstack() 
NSP('wait_duration')
Out[67]:
wait_duration
same_day     0.046471
2_days       0.227425
same_week    0.249765
3_weeks      0.312216
6_months     0.329801
Name: Yes, dtype: float64
In [68]:
#calculating the overall no-show rate:
df.no_show.value_counts(normalize=True)
Out[68]:
No     0.798102
Yes    0.201898
Name: no_show, dtype: float64

Let's calculate the proportion of missed appointments in each wait category and visualise it in a bar chart. I'll combine groupby() and value_counts() with the normalize parameter set to True to obtain proportions of missed ("Yes") appointments for each category. (This way of calculating proportions of missed appointments will be used in future.)

In [69]:
#we'll only consider the proportion of missed appointments by selecting the 'Yes' column after calling unstack():
##noshows_by_wait = df.groupby('wait_duration').no_show.value_counts(normalize=True).unstack()['Yes']
noshows_by_wait=NSP('wait_duration')

#calculate the mean no-show rate across all appointments:
overall_noshows = df.no_show.value_counts(normalize=True)['Yes']

plt.bar(wait_labels,noshows_by_wait) # bar plot for each category
plt.title('Proportion of no-shows and wait duration')
plt.xlabel("wait duration")
plt.ylabel("proportion of no-show appointments")
plt.axhline(y=overall_noshows, color='r', linestyle='--'); #line for mean proportion of missed appointments
plt.text(4.7, 0.21, "mean no-shows" , rotation=0, verticalalignment='center');

Given the overall percentage of no-shows of 20.19%, missed appointments are particularly low for same-day appointments at 4.64%. It then rises continuously from 22.74% for appointments within the first 3 days to 32.98% for appointments between 3 weeks and 6 months.

Let's look how significant this difference in no-show rates by wait times is for the bulk of appointments, by visualising the proportions of all missed appointments, as well as the total appointments as represented by the different groups of wait duration. This is best represented as a pie chart :

In [70]:
#using documentation https://matplotlib.org/gallery/pie_and_polar_charts/pie_features.html
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,6)) #create two adjacent pie charts with same y axis
explode=(0.1,0.1,0,0,0)  #emphasise the two first wedges

ax1.pie(df.groupby('wait_duration').no_show.value_counts().unstack()['Yes'], explode=explode, autopct='%.2f') #pie for no-shows, labelled with proportion of all no-shows
ax1.set_title('% of missed appointments')
ax1.legend(wait_labels,
          title="wait duration",
          loc="lower left",
          bbox_to_anchor=(1, 0.6, 0.4, 1))           
ax1.axis('equal')

ax2.pie(df.wait_duration.value_counts(), explode=explode, autopct='%.2f') #pie for all appointments, labelled with proportion of all appointments
ax2.set_title('% of all appointments')
ax2.axis('equal');

As an alternative to pie charts, below is a visualisation comparing the impact on total and no-show appointments using a stacked horizontal bar chart:

In [71]:
###Added following reviewers comments:
#create a dataframe with number of no-shows in each wait category
wait_appt_props=NSP('wait_duration',normalize=False).to_frame()
wait_appt_props=wait_appt_props/(wait_appt_props.sum()) #convert to proportions of no-shows
wait_appt_props['all appointments']=df.wait_duration.value_counts(normalize=True) #add total appointments, as proportions
wait_appt_props=wait_appt_props.transpose() #switch rows/columns for display in bar chart
wait_appt_props #check dataframe
Out[71]:
wait_duration same_day 2_days same_week 3_weeks 6_months
Yes 0.080308 0.121672 0.226629 0.292417 0.278973
all appointments 0.348911 0.108016 0.183196 0.189095 0.170782
In [72]:
#generate horizontal bar chart to compare wait category proportions of total and no-show appointments
ax=wait_appt_props.plot.barh(stacked=True, legend=None,figsize=(16,5))

#generate and format labels for stacked bar segments, adapted from https://stackoverflow.com/a/39280239
labels = [] 
for j in wait_appt_props.columns: #loop to format label text
    for i in wait_appt_props.index:
        label = "{0}:\n{1:.1f}%".format(j,wait_appt_props.loc[i][j]*100)
        labels.append(label)
patches = ax.patches
for label, rect in zip(labels, patches): #loop to position labels
    width = rect.get_width()
    if width > 0:
        x = rect.get_x()
        y = rect.get_y()
        height = rect.get_height()
        ax.text(x + width/2., y + height/2., label, ha='center', va='center',fontsize=14)

plt.xlim((0,1)) # 0 - 100%
plt.text(0.44,1.3,"All appointments",fontsize=16) #label whole bar in figure
plt.text(0.42,0.32,"Missed appointments",fontsize=16)
ax.set_yticklabels([""]) #don't label bar on y axis
ax.set_title('Comparison of wait time impact on total and missed appointments');

The two categories with the shortest wait - ie. same-day appointments and those with a wait of 1 or 2 days - represent about one fifth of all missed appointments, but they represent a majority of all appointments. The last two categories (3weeks and 6 months) account for a minority of appointments but a majority of missed appointments.

In conclusion, the time a patient has to wait for their appointment is associated with no-show rates. For same-day appointments, no-show rates are particularly low. By contrast, wait times of 2-3 weeks, or 3 weeks to 6 months, have a higher rate of missed appointments - and these two categories together are responsible for a majority of missed appointments, although they make up about a quarter of all appointments.

Question I (b): Do SMS reminders lower no-show rates?

In [73]:
df.sms_received.value_counts() #display the number of SMS reminders 
Out[73]:
False    75039
True     35482
Name: sms_received, dtype: int64

In 35.482 cases, SMS messages were received by patients, reminding them of their appointment. Such reminders would be expected to lower the proportion of missed appointments. Let's check the effect :

In [74]:
#display number of no-shows, grouped by whether an SMS reminder was sent
##df.groupby('sms_received').no_show.value_counts()
NSP('sms_received', normalize=False, missed_only=False)
Out[74]:
no_show No Yes
sms_received
False 62509 12530
True 25698 9784
In [75]:
#no-shows grouped by SMS, converted to proportions
##df.groupby('sms_received').no_show.value_counts(normalize=True) 
NSP('sms_received', missed_only=False)
Out[75]:
no_show No Yes
sms_received
False 0.833020 0.166980
True 0.724255 0.275745
In [76]:
##ax = df.groupby('sms_received').no_show.value_counts(normalize=True).unstack().plot.barh()
ax = NSP('sms_received', missed_only=False).plot.barh()

ax.set_xlabel('Proportion of no-show appointments')
ax.set_title('SMS reminders and no-show appointments')
plt.axvline( x=0.201933, color='r', linestyle='--') #mean proportion of missed appointments
handles, labels = ax.get_legend_handles_labels() # make legend and bar order consistent - from https://stackoverflow.com/a/34142297
ax.legend(reversed(handles), reversed(labels), loc='upper right') # reverse to keep order consistent
plt.xlim((0,1))
plt.legend(["overall no-show rate","not missed", "no-show"]);

Surprisingly, appointments in which an SMS has been received have a higher percentage of no-shows (27.57%) than appointments in which no reminder was received (16.69%).

Perhaps SMS reminders are not used for all appointments?

In [77]:
df.groupby('wait').sms_received.value_counts().head(19)
Out[77]:
wait  sms_received
0.0   False           38562
1.0   False            5213
2.0   False            6725
3.0   False            1831
      True              906
4.0   True             3054
      False            2236
5.0   True             2054
      False            1223
6.0   True             2684
      False            1353
7.0   True             2850
      False            2056
8.0   True             1440
      False             892
9.0   True              997
      False             608
10.0  True              698
      False             693
Name: sms_received, dtype: int64

It turns out that SMS reminders are not sent for appointments occurring in the first 2 days. Since these are also the appointments with the lowest no-show rate, this could distort the effect of SMS reminders.

Let's look at the different categories of appointment wait times, first how many SMS reminders were received vs not received, then at the proportion of no-shows associated with SMS reminders being received or not :

In [78]:
df.groupby('wait_duration').sms_received.value_counts() # total numbers of appointments with SMS received or not
Out[78]:
wait_duration  sms_received
same_day       False           38562
2_days         False           11938
same_week      True            11548
               False            8699
3_weeks        True            12235
               False            8664
6_months       True            11699
               False            7176
Name: sms_received, dtype: int64
In [79]:
#proportion of no-shows in categories defined by wait duration and SMS reminder
##df.groupby(['wait_duration', 'sms_received']).no_show.value_counts(normalize=True).unstack()['Yes'] 
NSP(['wait_duration', 'sms_received'])
Out[79]:
wait_duration  sms_received
same_day       False           0.046471
2_days         False           0.227425
same_week      False           0.265893
               True            0.237617
3_weeks        False           0.346722
               True            0.287781
6_months       False           0.377090
               True            0.300795
Name: Yes, dtype: float64
In [80]:
#create a dataframe with no-show rates as a proportion
##df_smsreminders=df.groupby(['wait_duration', 'sms_received']).no_show.value_counts(normalize=True).unstack()['Yes']
df_smsreminders=NSP(['wait_duration', 'sms_received'])

#use unstack to get one column for appointments with, and one without, SMS reminder 
df_smsreminders=df_smsreminders.unstack()
#make column names informative
df_smsreminders=df_smsreminders.rename(columns={False:'No reminder', True:'SMS reminder'})
df_smsreminders
Out[80]:
sms_received No reminder SMS reminder
wait_duration
same_day 0.046471 NaN
2_days 0.227425 NaN
same_week 0.265893 0.237617
3_weeks 0.346722 0.287781
6_months 0.377090 0.300795
In [81]:
ax = df_smsreminders.plot.barh()  #create plot with horizontal bars 

ax.set_xlabel('Proportion of no-show appointments')
ax.set_title('SMS reminders and No-show appointments')
plt.gca().invert_yaxis() #preserve chronological order of wait groups, from https://stackoverflow.com/a/34094001

plt.axvline( x=0.2019, color='r', linestyle='--') #mean proportion of missed appointments
#handles, labels = ax.get_legend_handles_labels() # make legend and bar order consistent - from https://stackoverflow.com/a/34142297
#ax.legend(reversed(handles), reversed(labels), loc='lower right') # reverse to keep bar and label order consistent
plt.xlim((0,0.4));

SMS reminders lower the rate of missed appointments from 26.59% to 23.76% for the same_week category ; 34.67% to 28.78% in the 3 weeks category and 37.7% to 30.08% in the 6 months category. However these no-show rates are all higher than for 2_days and same-day appointments.

In those categories of appointments where they are sent, SMS reminders are associated with significantly lower no-show rates. This effect is stronger the longer patients wait for their appointment.

Question I (c): Are no-show rates associated with public holidays or the day of the week?

During the period studied, there were two public holidays in Brazil: International Workers Day on 1. May, which fell on a Sunday in 2016, and the optional holiday of Corpus Christi on 26. May, which was on a Thursday in 2016 (from https://www.timeanddate.com/holidays/brazil/2016)

Let's first look at the number of appointments on each day with appointments:

In [82]:
df.appointmentday.value_counts().sort_index() #sort_index() to have consecutive dates
Out[82]:
2016-04-29    3235
2016-05-02    4376
2016-05-03    4255
2016-05-04    4167
2016-05-05    4272
2016-05-06    3879
2016-05-09    4519
2016-05-10    4308
2016-05-11    4474
2016-05-12    4394
2016-05-13    3987
2016-05-14      39
2016-05-16    4613
2016-05-17    4371
2016-05-18    4373
2016-05-19    4270
2016-05-20    3828
2016-05-24    4009
2016-05-25    3909
2016-05-30    4514
2016-05-31    4279
2016-06-01    4464
2016-06-02    4310
2016-06-03    4090
2016-06-06    4691
2016-06-07    4416
2016-06-08    4479
Name: appointmentday, dtype: int64
In [83]:
#descriptive statistics of daily number of appointments
df.appointmentday.value_counts().describe()
Out[83]:
count      27.00000
mean     4093.37037
std       864.05743
min        39.00000
25%      4049.50000
50%      4308.00000
75%      4440.00000
max      4691.00000
Name: appointmentday, dtype: float64
In [84]:
df.appointmentday.value_counts().size #count number of days with appointments
Out[84]:
27

Although data were collected between 29. April and 8. June, i.e. over a period of 40 days, appointments only took place on 27 days. One of these days is an outlier, 14. May, a Saturday with only 39 appointments, while the median number of appointments is 4308 on each day with appointments. There are no appointments on Corpus Christi (26. May) or the following day, leading to a period of 4 days (26., 27., 28., 29. May) without appointments. The first day with appointments following the public holidays (2. May, 30. May) have more appointments than the median, but this could be because they are Mondays, which on which numbers of appointments are commonly in the top quartile i.e. above 4440 (9. May, 16. May, 30. May, 6. June).

Next we'll look at the proportion of daily no-show appointments (on days with appointments):

In [85]:
##df.groupby('appointmentday').no_show.value_counts(normalize=True).unstack()['Yes']
NSP('appointmentday')
Out[85]:
appointmentday
2016-04-29    0.195672
2016-05-02    0.196755
2016-05-03    0.195065
2016-05-04    0.233261
2016-05-05    0.188670
2016-05-06    0.204950
2016-05-09    0.225271
2016-05-10    0.230269
2016-05-11    0.197810
2016-05-12    0.190487
2016-05-13    0.226988
2016-05-14    0.230769
2016-05-16    0.227401
2016-05-17    0.213681
2016-05-18    0.203522
2016-05-19    0.208899
2016-05-20    0.234848
2016-05-24    0.202295
2016-05-25    0.194167
2016-05-30    0.196721
2016-05-31    0.179247
2016-06-01    0.181900
2016-06-02    0.186079
2016-06-03    0.196822
2016-06-06    0.186101
2016-06-07    0.184783
2016-06-08    0.172806
Name: Yes, dtype: float64
In [86]:
#descriptive statistics of proportion of daily no-show appointments
##df.groupby('appointmentday').no_show.value_counts(normalize=True).unstack()['Yes'].describe()
NSP('appointmentday').describe()
Out[86]:
count    27.000000
mean      0.203157
std       0.018386
min       0.172806
25%       0.189579
50%       0.196822
75%       0.219476
max       0.234848
Name: Yes, dtype: float64
In [87]:
#Plot how the proportion of missed appointments varies daily during the period studied
#days without appointments are not considered, to obtain a continuous line

fig, ax = plt.subplots(figsize=(8,6))
##ax.plot(df.groupby('appointmentday').no_show.value_counts(normalize=True).unstack()['Yes']) #overall proportion of no-shows
ax.plot(NSP('appointmentday')) #overall proportion of no-shows
##ax.plot(df[df['wait']>0].groupby('appointmentday').no_show.value_counts(normalize=True).unstack()['Yes']) #prop. no-shows excluding same-day appointments
ax.plot(NSP('appointmentday',dtframe=df[df['wait']>0]))

fig.autofmt_xdate() # rotate and align the tick labels, as per docs - https://matplotlib.org/gallery/recipes/common_date_problems.html
plt.axhline(y=0.2019, color='r', linestyle='--') #mean proportion of missed appointments
#plt.axvline(x=2016-05-02)
plt.axvline(pd.Timestamp('2016-05-02'), color='b', linestyle=':') #line for first day after IWD https://stackoverflow.com/a/52666020
plt.text(pd.Timestamp('2016-05-02'),0.171,'Labour Day',rotation=0)

plt.axvline(pd.Timestamp('2016-05-30'),color='b', linestyle=':') # line for first day after Corpus Christi
plt.text(pd.Timestamp('2016-05-30'),0.171,'Corpus Christi',rotation=0)

ax.legend(['overall no-shows','excl. same-day appt.'],loc='upper right')

ax.set_ylabel('Proportion of no-show appointments')
ax.set_title('Daily variations in no-show rates');

The proportion of no-shows varies significantly each day. This is not resulting from different proportions of same-day appointments, as excluding same-day appointments has a similar patter.

In May, there are four peaks - could this be a weekly pattern? To test this, we'll calculate no-show rates for different days of the week.

In [88]:
#convert to day of the week as per https://stackoverflow.com/a/30222759
df['appointment_dotw'] = df['appointmentday'].dt.dayofweek #numeric keeps order https://pandas.pydata.org/pandas-docs/version/0.22/api.html#datetimelike-properties

#calculate no-show rates for each day of the week:
##df.groupby('appointment_dotw').no_show.value_counts(normalize=True).unstack()['Yes']
NSP('appointment_dotw')
Out[88]:
appointment_dotw
0    0.206446
1    0.200874
2    0.196861
3    0.193494
4    0.212261
5    0.230769
Name: Yes, dtype: float64
In [89]:
##df[df['wait']>0].groupby('appointment_dotw').no_show.value_counts(normalize=True).unstack()['Yes']
NSP('appointment_dotw',dtframe=df[df['wait']>0])
Out[89]:
appointment_dotw
0    0.302105
1    0.286660
2    0.271474
3    0.272671
4    0.293544
5    0.290323
Name: Yes, dtype: float64
In [90]:
#Create bar chart for no-shows on different days of the week
plt.figure(figsize=(15,5))  
plt.subplot(121)  #create first subplot

#calculate no-show rate for each day of the week, theen plot on bar chart
##df.groupby('appointment_dotw').no_show.value_counts(normalize=True).unstack()['Yes'].plot(kind='bar')
NSP('appointment_dotw').plot(kind='bar')

plt.ylabel("No-show rate")
plt.xlabel("Day of the week")
plt.title("No-show rates for days of the week");

plt.subplot(122)  #create second bar chart
#same, but exclude same-day appointments
##df[df['wait']>0].groupby('appointment_dotw').no_show.value_counts(normalize=True).unstack()['Yes'].plot(kind='bar')
NSP('appointment_dotw',dtframe=df[df['wait']>0]).plot(kind='bar')
plt.ylabel("No-show rate")
plt.xlabel("Day of the week");
plt.title("No-show rates excluding same-day appointments for days of the week");

Over the period studied, the proportion of daily no-shows varies by over 6%, from 17% to 23%. The peaks do not have a weekly periodicity as there isn't a clear association with the day of the week. Although the mean no-show proportion varies by day of the week, the range is only around 2%.

Part II: Identifying patient characteristics associated with no-show appointments

Question II (a): Are patient health and socio-economic characteristics associated with missed appointments ?

The data contain several columns relating to the health and socio-economic status of the patient: scholarship, hypertension, alcoholism, disability.

Let's look if there's an association between missed appointments are and whether the patient belongs to one of these.

One might for example expect that more appointments are missed by patients with alcoholism. Let's compare the numbers of missed appointments (no_show - Yes) of patients with alcoholism (alcoholism - True) or without (False):

In [91]:
df.groupby('alcoholism').no_show.value_counts() #rows grouped by alcoholism, then the values of no_show are counted
Out[91]:
alcoholism  no_show
False       No         85524
            Yes        21637
True        No          2683
            Yes          677
Name: no_show, dtype: int64

when the normalize parameter is set to True, value_counts displays the proportions of kept ('No') vs missed ('Yes') appointments:

In [92]:
df.groupby('alcoholism').no_show.value_counts(normalize=True)
Out[92]:
alcoholism  no_show
False       No         0.798089
            Yes        0.201911
True        No         0.798512
            Yes        0.201488
Name: no_show, dtype: float64
In [93]:
##df.groupby('alcoholism').no_show.value_counts(normalize=True)
NSP('alcoholism',missed_only=False)
Out[93]:
no_show No Yes
alcoholism
False 0.798089 0.201911
True 0.798512 0.201488

The rate of missed appointments with patients affected by alcoholism is 20.15%, with unaffected patients it is 20.19% - there is almost no difference.

Let's look at all available patient characteristics and plot them to compare their effect. I will calculate the proportion of missed appointments for patients with vs without a particular characteristic.

The proportion of kept appointments is not informative since, together with the proportion of missed appointments, it will always add up to 1. In the following, we will only consider the proportion of missed appointments (no_show is 'Yes').

In [94]:
#using a loop to examine characteristics with boolean type, adding proportions of missed appointments to lists
#below unstack()['Yes'] is used to discard the proportion of kept appointments

characteristics = ['scholarship', 'hypertension','diabetes', 'alcoholism','disability'] #list of characteristics considered
yes = [] #list for proportion of missed appointments with patients that possess a characteristic
no = []  #list for proportion of missed appointments with patients that do not possess a characteristic

for characteristic in characteristics:
    ##result=df.groupby(characteristic).no_show.value_counts(normalize=True).unstack()['Yes']
    result=NSP(characteristic)
    print("proportion no-show appointments, patients with {} : {} ; without: {}".format(characteristic,result.loc[True],result.loc[False]))
    yes.append(result.loc[True])
    no.append(result.loc[False])
proportion no-show appointments, patients with scholarship : 0.2373630420771568 ; without: 0.19803331326510135
proportion no-show appointments, patients with hypertension : 0.17301958625751113 ; without: 0.2089945897204689
proportion no-show appointments, patients with diabetes : 0.18003273322422259 ; without: 0.20359141336348924
proportion no-show appointments, patients with alcoholism : 0.20148809523809524 ; without: 0.20191114304644414
proportion no-show appointments, patients with disability : 0.18088432335864224 ; without: 0.20233279769490775

Now we'll create a dataframe from the above lists and plot the proportions of missed appointments depending on whether the appointment was for patients with or without the specified characteristic:

In [95]:
noshows_characteristics = pd.DataFrame({"with":yes, "without":no}, index=characteristics)
noshows_characteristics
Out[95]:
with without
scholarship 0.237363 0.198033
hypertension 0.173020 0.208995
diabetes 0.180033 0.203591
alcoholism 0.201488 0.201911
disability 0.180884 0.202333
In [96]:
ax = noshows_characteristics.plot.barh(figsize=(8,5))
ax.set_xlabel('Proportion of no-show appointments')
ax.set_title('Appointments and patient characteristics')
plt.axvline( x=0.2019, color='r', linestyle='--') #mean proportion of missed appointments
handles, labels = ax.get_legend_handles_labels() # make legend and bar order consistent - from https://stackoverflow.com/a/34142297
ax.legend(reversed(handles), reversed(labels), loc='upper right') # reverse to keep order consistent
plt.xlim((0,0.3));

Missed medical appointments are less prevalent when they are with patients with a disability, with diabetes, or with hypertension. Appointments of patients with these health conditions are 2-3% less likely to be missed. The proportion of missed appointments is unaffected by whether patients have alcoholism or not.

The patient characteristic with the largest difference in missed appointments is in relation to the welfare program Bolsa Família : appointments with patients that are enrolled are 4% more likely to be missed.

On the whole, the effect of patient characteristics on the proportion of missed appointments appears to be relatively small and its significance is uncertain. One may also be skeptical about comparing groups of patients with certain health conditions to groups with no such health conditions: patients with diabetes or hypertension are likely to have more frequent appointments, than patients without these health conditions, which in turn may result in a confounding variable impacting the prevalence of no-shows.

For these reasons it may be instructive to study the prevalence of missed appointments depending on patient characteristics in a second way.

Question II (b): What are the characteristics of patients who frequently miss appointments ?

For many events, a large part of an observed effect is due to a relatively small minority of causative constituent events - this is known as the Pareto principle or the 80/20 rule. In this section I will try to analyse if there are patient groups that cause such a disproportionate amount of missed appointments.

A second approach to studying the relation between no-shows and patient characteristics is to identify a group of patients that are disproportionately missing appointments, and then looking at the composition of that group in terms of patient characteristics. Are patients with multiple missed appointments more or less likely to have a certain health condition or to be a welfare recipient?

This second approach requires grouping patients by the number of their appointments and missed appointments. However, in the current dataframe, each row represents an appointment, and there is no straightforward way of answering this question.

Therefore, based on unique patient IDs, we will create a dataframe in which each row represents a patient, and determine the number of their missed and total medical appointments. Then we will add the patient characteristics to this new dataframe, called df_patients:

In [97]:
#create new df on basis of unique patient ID and columns with kept (No) and missed (Yes) appointments
df_patients = df.groupby(['patientid']).no_show.value_counts().unstack() 
df_patients.head()
Out[97]:
no_show No Yes
patientid
5628261 NaN 1.0
11831856 1.0 NaN
22638656 2.0 NaN
52168938 2.0 NaN
54223998 1.0 NaN
In [98]:
#create new df on basis of unique patient ID and columns with kept (No) and missed (Yes) appointments
df_patients = df.groupby(['patientid']).no_show.value_counts().unstack() 
#df_patients = NSP('patientid', missed_only=False)
df_patients.head()
Out[98]:
no_show No Yes
patientid
5628261 NaN 1.0
11831856 1.0 NaN
22638656 2.0 NaN
52168938 2.0 NaN
54223998 1.0 NaN
In [99]:
#Now we'll tidy up the dataframe: remove NaN, calculate the total number of appointments,
#rename columns and drop unneeded columns

df_patients.fillna(value=0, inplace=True) # missing values are in fact 0
df_patients['total_appt']= (df_patients['No'] + df_patients['Yes']).astype(int) # calculating total number of appointments
df_patients.reset_index(inplace=True) #reset the index 
df_patients.rename(columns={'Yes':'missed_appt'}, inplace=True) # renaming columns with missed appointments
df_patients.drop('No', axis=1,  inplace=True) #kept appointments not needed
df_patients['missed_appt'] = df_patients['missed_appt'].astype(int) #appointments are integers

df_patients.head()
Out[99]:
no_show patientid missed_appt total_appt
0 5628261 1 1
1 11831856 0 1
2 22638656 0 2
3 52168938 0 2
4 54223998 0 1
In [100]:
#To get the patient characteristics, I'll create a copy of the appointments dataframe, dropping unneeded columns
df_patients_char = df.copy().drop(['appointmentid', 'scheduledday','appointmentday', 'sms_received', 'no_show','disability_int','appointment_dotw'], axis=1)

#Next, we're dropping rows with duplicated patientid - this should result in each row representing a unique patient
df_patients_char.drop_duplicates(subset='patientid',inplace=True)
df_patients_char.head()
Out[100]:
patientid gender age neighbourhood scholarship hypertension diabetes alcoholism disability wait wait_duration
0 29872499824296 F 62 JARDIM DA PENHA False True False False False 0.0 same_day
1 558997776694438 M 56 JARDIM DA PENHA False False False False False 0.0 same_day
2 4262962299951 F 62 MATA DA PRAIA False False False False False 0.0 same_day
3 867951213174 F 8 PONTAL DE CAMBURI False False False False False 0.0 same_day
4 8841186448183 F 56 JARDIM DA PENHA False True True False False 0.0 same_day
In [101]:
#the two patient dataframes should have the same number of rows - this will allow them to be merged cleanly:
df_patients.shape, df_patients_char.shape
Out[101]:
((62298, 3), (62298, 11))
In [102]:
#merging the two dataframes on the basis of patientid 
df_patients = df_patients.merge(df_patients_char, how='inner', on='patientid')
df_patients.shape #check number of unique patient ids
df_patients.head()
Out[102]:
patientid missed_appt total_appt gender age neighbourhood scholarship hypertension diabetes alcoholism disability wait wait_duration
0 5628261 1 1 M 13 PARQUE MOSCOSO False False False False False 3.0 same_week
1 11831856 0 1 M 16 SANTO ANTÔNIO False False False False False 0.0 same_day
2 22638656 0 2 F 22 INHANGUETÁ False False False False False 19.0 3_weeks
3 52168938 0 2 F 28 JARDIM DA PENHA False False False False False 27.0 6_months
4 54223998 0 1 M 32 ITARARÉ False False False False False 16.0 3_weeks

The new dataframe has columns specifying the number of total appointments and of missed appointments each unique patient has had.To check the new dataframe, let's start by looking at patients with alcoholism, calculating the mean rates (per patient) of missed appointments and of total appointments :

In [103]:
#Calculate the mean number of missed appointements for patients with and without alcoholism:
df_patients[df_patients['alcoholism']==True].missed_appt.mean(), df_patients[df_patients['alcoholism']==False].missed_appt.mean()
Out[103]:
(0.449535192563081, 0.35591854191340966)
In [104]:
#Calculate the mean of total appointements for patients with and without alcoholism:
df_patients[df_patients['alcoholism']==True].total_appt.mean(), df_patients[df_patients['alcoholism']==False].total_appt.mean()
Out[104]:
(2.231075697211155, 1.7627483879457824)
In [105]:
#Calculate the mean proportion of missed appointements for patients with and without alcoholism 
#this should be the same as the overall proportion of missed appointments with 
#patients with vs with alcoholism previously determined:

df_patients[df_patients['alcoholism']==True].missed_appt.mean() / df_patients[df_patients['alcoholism']==True].total_appt.mean(), df_patients[df_patients['alcoholism']==False].missed_appt.mean() / df_patients[df_patients['alcoholism']==False].total_appt.mean()
Out[105]:
(0.20148809523809524, 0.20191114304644414)

Looking at the patient characteristic of alcoholism, on average each patient with alcoholism has more missed appointments (mean of 0.45 vs 0.36 missed appointments), but on average each patient with alcoholism also has more medical appointments (mean of 2.23 vs 1.76 medical appointments). This means that the likelihood a patient with alcoholism to miss an appointment (20.15%) is virtually indistingushable from a patient without alcoholism (20.19%) - consistent with the previous finding.

Next we will try to group patients together on the basis of how many appointments they miss.

In [106]:
#number of patients grouped by the number of missed appointments:
df_patients.missed_appt.value_counts()
Out[106]:
0     44637
1     14438
2      2415
3       516
4       162
5        58
6        33
7        13
8         9
10        4
11        4
9         3
16        1
12        1
13        1
14        1
15        1
18        1
Name: missed_appt, dtype: int64

A large majority of patients never missed an appointment within the timeframe of the study, and only small numbers of patients repeatedly missed their medical appointment

In [107]:
#now we'll create 4 groups of patients according to the number of missed 
#appointments and label each patient using a new column called reliability:
missed_bins=[0,1,2,4,18]
missed_labels=['never','once','a_few','repeatedly']
df_patients['reliability']=pd.cut(df_patients['missed_appt'],missed_bins,right = False,labels=missed_labels)
df_patients[['missed_appt','total_appt', 'reliability']].head(15) #verifying the new column
Out[107]:
missed_appt total_appt reliability
0 1 1 once
1 0 1 never
2 0 2 never
3 0 2 never
4 0 1 never
5 0 1 never
6 0 1 never
7 1 2 once
8 0 1 never
9 0 2 never
10 0 1 never
11 1 2 once
12 0 1 never
13 0 1 never
14 2 2 a_few
In [108]:
#number of patients in each reliability group:
###df_patients.groupby('reliability').sum()   -change made following reviewers comments:
df_patients.groupby('reliability')['patientid'].count()
Out[108]:
reliability
never         44637
once          14438
a_few          2931
repeatedly      291
Name: patientid, dtype: int64
In [109]:
### added following reviewers comments
#number of patients with specific characteristics in each reliability group (summing boolean values):
df_patients.groupby('reliability')[['scholarship','hypertension','diabetes','alcoholism','disability']].sum()
Out[109]:
scholarship hypertension diabetes alcoholism disability
reliability
never 3873.0 9223.0 3266.0 986.0 856.0
once 1481.0 2497.0 956.0 408.0 215.0
a_few 388.0 469.0 177.0 104.0 48.0
repeatedly 46.0 53.0 17.0 8.0 13.0

The numbers of patients in the group who repeatedly miss appointments are quite small

In [110]:
#now we will look at the number and proportion of patients with alcoholism in each of the reliability groups
df_patients.groupby('reliability')['alcoholism'].value_counts(normalize=False).unstack()[True]
Out[110]:
reliability
never         986
once          408
a_few         104
repeatedly      8
Name: True, dtype: int64
In [111]:
#same but expressed as proportions:
df_patients.groupby('reliability')['alcoholism'].value_counts(normalize=True).unstack()[True]
Out[111]:
reliability
never         0.022089
once          0.028259
a_few         0.035483
repeatedly    0.027491
Name: True, dtype: float64
In [112]:
#loop to calculate the proportion of patients in reliability groups with a particular characteristics
characteristics = ['scholarship', 'hypertension','diabetes', 'alcoholism','disability']
never, once, a_few, repeatedly = [], [], [], [] #lists for calculated proportions
means=[] #list for proportion in whole population
for characteristic in characteristics:
    result=df_patients.groupby('reliability')[characteristic].value_counts(normalize=True).unstack()[True]
    char_never=result.loc['never']
    char_once=result.loc['once']
    char_a_few=result.loc['a_few']
    char_repeatedly=result.loc['repeatedly']
    print("proportion of patients with {} : never - {:.1%}, once - {:.1%}, a few: {:.1%}, repeatedly: {:.1%}".format(characteristic,char_never,char_once,char_a_few,char_repeatedly))
    never.append(char_never)
    once.append(char_once)
    a_few.append(char_a_few)
    repeatedly.append(char_repeatedly)
    means.append(df_patients[characteristic].value_counts(normalize=True)[True]) # average proportion of characteristic
proportion of patients with scholarship : never - 8.7%, once - 10.3%, a few: 13.2%, repeatedly: 15.8%
proportion of patients with hypertension : never - 20.7%, once - 17.3%, a few: 16.0%, repeatedly: 18.2%
proportion of patients with diabetes : never - 7.3%, once - 6.6%, a few: 6.0%, repeatedly: 5.8%
proportion of patients with alcoholism : never - 2.2%, once - 2.8%, a few: 3.5%, repeatedly: 2.7%
proportion of patients with disability : never - 1.9%, once - 1.5%, a few: 1.6%, repeatedly: 4.5%

We will now consider gender as another patient characteristic that can be added to the other characteristics. First, let's check if there are gender differences re.missing appointments:

In [113]:
df_patients.groupby('gender').missed_appt.mean() #mean no-show appointments for female and male patients
Out[113]:
gender
F    0.364365
M    0.347054
Name: missed_appt, dtype: float64
In [114]:
df_patients.groupby('gender').total_appt.mean() #mean number of appointments for female and male patients
Out[114]:
gender
F    1.793882
M    1.738417
Name: total_appt, dtype: float64
In [115]:
#calculate proportion of missed appointments for female and male patients
df_patients.groupby('gender').missed_appt.mean() / df_patients.groupby('gender').total_appt.mean()
Out[115]:
gender
F    0.203115
M    0.199638
dtype: float64

Female patients have slightly more appointments than male patients, but the no-show rate is very similar for the two genders

In [116]:
#we will add gender as a patient characteristic - unlike the other patient characteristics it's a 
#string (F or M) not a boolean, so the above loop couldn't be used, but is easily adapted
characteristics.append('male')
result=df_patients.groupby('reliability')['gender'].value_counts(normalize=True).unstack()['M']
char_never=result.loc['never']
char_once=result.loc['once']
char_a_few=result.loc['a_few']
char_repeatedly=result.loc['repeatedly']
print("proportion of male patients : never - {:.1%}, once - {:.1%}, a few: {:.1%}, repeatedly: {:.1%}".format(char_never,char_once,char_a_few,char_repeatedly))
never.append(char_never)
once.append(char_once)
a_few.append(char_a_few)
repeatedly.append(char_repeatedly)
means.append(df_patients['gender'].value_counts(normalize=True)['M']) # average proportion of characteristic
proportion of male patients : never - 36.1%, once - 35.4%, a few: 31.9%, repeatedly: 35.7%

Only around one-third of patients in the study are male. Next we'll create a dataframe with the statistics of patient characteristics for visualisation.

In [117]:
#create a dataframe with the proportion of patients with the different 
#characteristics studied in each reliability group
reliability_characteristics = pd.DataFrame({"never" : never, "once": once, "a few": a_few,"repeatedly":repeatedly}, index=characteristics)
reliability_characteristics=reliability_characteristics[['never', 'once', 'a few', 'repeatedly']] # control column order
reliability_characteristics
Out[117]:
never once a few repeatedly
scholarship 0.086767 0.102577 0.132378 0.158076
hypertension 0.206622 0.172946 0.160014 0.182131
diabetes 0.073168 0.066214 0.060389 0.058419
alcoholism 0.022089 0.028259 0.035483 0.027491
disability 0.019177 0.014891 0.016377 0.044674
male 0.360643 0.354204 0.319345 0.357388

Now we will visualise these proportions in a bar chart. For each characteristic, a horizontal line shows the overall population mean of each characteristic, and the bars represent how common a particular characteristic is in each reliability group.

In [118]:
ax = reliability_characteristics.plot.bar(figsize=(10,5))
ax.set_ylabel('%patients in reliability groups')
ax.set_title('Proportion of patients')
ax.legend(loc='upper left')

vals = ax.get_yticks()
ax.set_yticklabels(['{:,.0%}'.format(x) for x in vals]); #format as percentage
for i in range(len(means)): #add horizontal lines of mean proportion for each characteristic
    plt.axhline(y=means[i], xmin=i/6, xmax=(i+1)/6,color='black',linestyle='--')

    

Looking at the composition of the groups of patients who miss their appointments repeatedly or a few times, there is no over-representation of patients with hypertension, diabetes, alcoholism or in relation to patient gender. There is a higher proportion of disabled patients amongst those who miss their appointments repeatedly, but this is due to only 13 disabled patients in this group, and probably insignificant.

The only characteristic for which there appears to be a trend is for patients who are recipients of the Bolsa Familia scholarship. The proportion of these scholarship recipients rises in each category with decreasing reliability of turining up for medical appointment, and there is a small overrepresentation of these patients in the groups that miss their appointments a few times or repeatedly.

To conclude this part of the investigation, we will look at the proportion of no-shows in each reliability group, and at the proportions of appointments and no-shows in each reliability group.

In [119]:
#calculate the total number of no-shows in different reliability groups
df_patients.groupby('reliability').missed_appt.sum() 
Out[119]:
reliability
never             0
once          14438
a_few          6378
repeatedly     1480
Name: missed_appt, dtype: int64
In [120]:
#calculate the total number of appointments in different reliability groups
df_patients.groupby('reliability').total_appt.sum()
Out[120]:
reliability
never         67490
once          28508
a_few         11894
repeatedly     2611
Name: total_appt, dtype: int64
In [121]:
#calculating the proportion of no-shows for the four reliability groups
missed_proportion = (df_patients.groupby('reliability').missed_appt.sum() / df_patients.groupby('reliability').total_appt.sum())
missed_proportion
Out[121]:
reliability
never         0.000000
once          0.506454
a_few         0.536237
repeatedly    0.566833
dtype: float64
In [122]:
#Bar chart of the proportion of no-shows in reliability groups
ax = missed_proportion.plot.bar(rot=45)
ax.set_ylabel('Proportion missed in reliability groups')
ax.set_title('Proportion of missed appointments and reliability')
ax.legend(loc='upper left');

There is little difference between reliabity groups in the proportion of missed appointments. For patients who have missed any appointments during the period over which data has been recorded, the likelihood that they miss an appointment is on average a slightly above 50%.

In [123]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,6)) #create two adjacent plots with same y axis
ax1.pie(df_patients.groupby('reliability').total_appt.sum(), autopct='%.1f', explode=[0,0.2,0,0]) #pie chart 'once' wedge detached for emphasis
ax1.set_title('% of total appointments')
ax1.axis('equal')
ax1.legend(df_patients.groupby('reliability').total_appt.sum().index,
          title="missing appts",
          loc="lower left",
          bbox_to_anchor=(1, 0.6, 0, 1))
#explode=[0,0,0.2,0.1]


ax2.pie(df_patients.groupby('reliability').missed_appt.sum(), autopct='%.1f', explode=[0,0.2,0,0]) #second pie chart, 'once' wedge detached for emphasis
ax2.set_title('% of all missed appointments')
ax2.axis('equal');

As an alternative to pie charts, below the same data are showsn as a stacked horizontal bar chart:

In [124]:
reliability_appt_props=df_patients.groupby('reliability').missed_appt.sum().to_frame()
reliability_appt_props=reliability_appt_props/reliability_appt_props.sum()
reliability_appt_props['total']=df_patients.groupby('reliability').total_appt.sum()/df_patients.groupby('reliability').total_appt.sum().sum()
reliability_appt_props=reliability_appt_props.transpose()
reliability_appt_props
Out[124]:
reliability never once a_few repeatedly
missed_appt 0.000000 0.647560 0.286060 0.066380
total 0.610753 0.257984 0.107635 0.023628
In [125]:
#generate horizontal bar chart to compare wait category proportions of total and no-show appointments
ax=reliability_appt_props.plot.barh(stacked=True, legend=None,figsize=(16,5))

#generate and format labels for stacked bar segments, adapted from https://stackoverflow.com/a/39280239
labels = [] 
for j in reliability_appt_props.columns: #loop to format label text
    for i in reliability_appt_props.index:
        label = "{0}:\n{1:.1f}%".format(j,reliability_appt_props.loc[i][j]*100)
        labels.append(label)
patches = ax.patches
for label, rect in zip(labels, patches): #loop to position labels
    width = rect.get_width()
    if width > 0:
        x = rect.get_x()
        y = rect.get_y()
        height = rect.get_height()
        ax.text(x + width/2., y + height/2., label, ha='center', va='center',fontsize=14)

plt.xlim((0,1)) # 0 - 100%
plt.text(0.44,1.3,"All appointments",fontsize=16) #label whole bar in figure
plt.text(0.42,0.32,"Missed appointments",fontsize=16)
ax.set_yticklabels([""]) #don't label bar on y axis
ax.set_title('Reliability and no shows: comparison of total and missed appointments');

The pie chart shows that patients who repeatedly miss appointments are only responsible for a small minority of the overall no-shows. 61.1% of patients never miss an appointment during the period studied. around 1/4 of all appointments are with patients who have missed one appointment over the period studied, and the same group are responsible for nearly 2/3 of all missed appointments. The two pie charts show that, the pattern of no-shows does not follow the Pareto 80/20 rule, as it is not the case that a large majority of missed appointments are due to a minority of patients who disproportionately miss appointments. An important caveat is that this observation is only true for the period studied, which only covers 40 days.

Question II (c): Is patient age associated with no-show appointments?

As previously noted, age is recorded as integers up to a maximum of 115 years. To gain an initial overview of the age distibution of patients and their use of medical resources, we'll represent patient ages as a histogram as well as the sum of total appointments for each age. In a second graph, we'll display the mean number of appointments per patient at each patient age, as well as the mean number of no-show appointments per patient at each age.

In [126]:
f, (ax1, ax2) = plt.subplots(1, 2,  figsize=(15,5)) #create figure with two columns
ax1.hist(df_patients.age, bins=24) #histogram of patient ages, ages up to 115 years, 24 bins => 1 bar = 5 years
ax1.plot(df_patients.groupby('age').total_appt.sum()) #sum of appointments for each patient age
ax1.set_title('Histogram of patient age, and total appointments')
ax1.set_ylabel('Frequency in age group / sum of appointments')
ax1.set_xlabel('patient age')
ax1.legend(['Total appointments','Age'])
ax2.plot(df_patients.groupby('age').total_appt.mean()[1:81]) #mean appointment number per patient for each year
ax2.plot(df_patients.groupby('age').missed_appt.mean()[1:81]) #mean no-show appointment per patient for each year
ax2.set_title('Mean appointments and no-shows by patient age')
ax2.set_ylabel('Mean appointments / mean no-shows per patient')
ax2.set_xlabel('patient age')
plt.xlim((0,80))
ax2.legend(['Total appointments - mean for patient age','Missed appointments - mean for patient age']);

As the histogram on the left shows, the distribution of patient age is relatively flat until approx. 65 years, after which it drops off. Patient and appointment numbers above age 80 are relatively low and will not be considered in some figures. The sum of total appointments for each age group approximately follows the age distribution. The mode of the distribution is 0-5 years, which also has a spike in total appointments for age 0 - this likely reflects births and associated medical interventions.

The graph on the left shows that appointment per patient and no-show appointments are relatively flat across patient age, the former varying between 1.5 and 2 appointments per patient, the latter varying between 0.25 and 0.5 per patient. The graph excludes the first data point (due to a single appointment recorded as age -1) and excludes data above age 80 where patient numbers are rather low.

To test if patient age is associated with no-show appointments, we'll calculate the proportion of missed appointments for each patient age - this is essentially the ratio between the two lines in the chart on the right above

In [127]:
#calculate no-show proportion for each age and store in pd series
noshow_rate_age=(df_patients.groupby('age').missed_appt.mean() / df_patients.groupby('age').total_appt.mean())

noshow_rate_age[1:81].plot(kind='line', figsize=(10,5)) #only consider ages up to 80
plt.title('Missed appointment rate by patient age')
plt.ylabel('Proportion no-show appointments')
plt.xlabel('patient age (years)')
plt.xlim((0,80))
plt.axhline(y=0.201933, color='r', linestyle='--') #mean proportion of missed appointments
plt.text(65, 0.21, "mean no-shows" , rotation=0, verticalalignment='center');

The proportion of no-show appointments rises up to the age of 20, then declines in continuously up to the age of 80. The relationship looks close to linear, so let's use Seaborn's regplot() for the two segments :

In [128]:
#convert to dataframe with appropriate columns for regplot()
noshow_rate_age=noshow_rate_age.reset_index().rename(columns={0:'no-shows'})
In [129]:
fig, (ax1, ax2) = plt.subplots(ncols=2,figsize=(15,5),sharey=True) #create figure with two adjacent subplots
sns.regplot(data = noshow_rate_age[1:22], x = 'age', y = 'no-shows', ax=ax1)#regression plot ages 0-20 years
ax1.set_title('No-shows increase with patient age up to age 20')
ax1.set_ylabel("no-show rate")
ax1.set_xlabel("patient age")

sns.regplot(data = noshow_rate_age[26:82], x = 'age', y = 'no-shows', ax=ax2) #regression plot ages 25-80 years
ax2.set_title('No-shows decrease with patient age 25-80 years')
ax2.set_ylabel("no-show rate")
ax2.set_xlabel("patient age");

The proportion of no-show appointments rises in an almost linear fashion up to the age of about 20, then declines in an linearly from the age 25 for the remaining decades.

To further analyse the association of no-show rates and age, we'll divide patients into the following age groups: infant (0-2 years), child (3-10 years), adolescent(11-17), young adult(18-24), adult(25-44), middle age(45-64), old age(65-115).

In [130]:
age_bins=[0,3,11,18,25,45,65,115]
age_labels=['infant','child','adolescent','young adult', 'adult', 'middle age', 'old age']
df_patients['age_groups'] = pd.cut(df_patients['age'],age_bins,right = False,labels=age_labels)
df['age_groups'] = pd.cut(df['age'],age_bins,right = False,labels=age_labels)

The following figure will visualise the average no-show rates for each age group, and how the no-show rates change with age group for same-day appointments, as well as appointments for which patients have to wait a day or more after scheduling the appointment.

In [131]:
#Proportion of no-show appointments in different age groups:
##no_show_agegroups=df.groupby('age_groups').no_show.value_counts(normalize=True).unstack()['Yes']
no_show_agegroups=NSP('age_groups')

#Proportion of no-show appointments in different age groups - for same-day appointments only:
##noshow_sameday=df[df['wait']==0].groupby('age_groups').no_show.value_counts(normalize=True).unstack()['Yes']
noshow_sameday=NSP('age_groups',dtframe=df[df['wait']==0])

#Proportion of no-show appointments in different age groups 
# - only appointments for which patients wait for at least one day:
##noshow_notsameday=df[df['wait']>0].groupby('age_groups').no_show.value_counts(normalize=True).unstack()['Yes']
noshow_notsameday=NSP('age_groups',dtframe=df[df['wait']>0])

plt.figure(figsize=(16,5))  
plt.subplot(121)  #create first subplot
plt.bar(age_labels, no_show_agegroups)
plt.xlabel("Age group")
plt.ylabel("No-show rate")
plt.xticks(rotation=45)
plt.axhline(y=0.201933, color='r', linestyle='--') #mean proportion of missed appointments
plt.text(5, 0.21, "mean no-shows" , rotation=0, verticalalignment='center')
plt.title("No-show rates in different patient age groups")

plt.subplot(122)  #create second bar chart
plt.bar(age_labels, noshow_notsameday) #bars for no-shows with appointments in which patients need to wait
plt.bar(age_labels, noshow_sameday) #superimpose bars for no-shows with same-day appointments
plt.xlabel("Age group")
plt.ylabel("No-show rate")
plt.xticks(rotation=45)
plt.legend(['Wait for appointment','Same-day appointment'])
plt.title("No-show rates for same-day and wait appointments");

The no-show rate is particularly elevated for appointments with adolescent patients and young adults. This association of no-shows with adolescent patients and young adults is identifiable for appointments in which patients need to wait for at least one day, and for same-day appointments, the association is most pronounced with adolescenet patients.

However, what is the significance of this association for the total number of no-show appointments? We'll use pie charts to compare the proportions of no-shows and total appointments for each age group.

In [132]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,5))
explode=(0,0.1,0.1,0,0,0,0) # emphasise wedges for adolescents and young adults
ax1.pie(df[df['no_show']=='Yes'].groupby('age_groups').age_groups.count(), explode=explode, autopct='%.1f')
ax1.set_title('% of missed appointments')
ax1.legend(age_labels,
          title="Age group",
          loc="lower left",
          bbox_to_anchor=(1, 0.6, 0.4, 1))           
ax1.axis('equal')

ax2.pie(df.groupby('age_groups').age_groups.count(), explode=explode, autopct='%.1f')
ax2.set_title('% of all appointments')
ax2.axis('equal');
In [133]:
age_appt_props=df[df['no_show']=='Yes'].groupby('age_groups').age_groups.count().to_frame()
age_appt_props=age_appt_props/age_appt_props.sum() #convert to proportions
age_appt_props['all_appt']=df.groupby('age_groups').age_groups.count()
age_appt_props['all_appt']=age_appt_props['all_appt']/age_appt_props['all_appt'].sum()
age_appt_props=age_appt_props.transpose()
age_appt_props
Out[133]:
age_groups infant child adolescent young adult adult middle age old age
age_groups 0.058536 0.111201 0.099009 0.110573 0.289543 0.231231 0.099906
all_appt 0.067230 0.102411 0.078088 0.089453 0.261175 0.271382 0.130262
In [134]:
#generate horizontal bar chart to compare age category proportions of total and no-show appointments
ax=age_appt_props.plot.barh(stacked=True, legend=None,figsize=(16,5))

#generate and format labels for stacked bar segments, adapted from https://stackoverflow.com/a/39280239
labels = [] 
for j in age_appt_props.columns: #loop to format label text
    for i in age_appt_props.index:
        label = "{0}:\n{1:.1f}%".format(j,age_appt_props.loc[i][j]*100)
        labels.append(label)
patches = ax.patches
for label, rect in zip(labels, patches): #loop to position labels
    width = rect.get_width()
    if width > 0:
        x = rect.get_x()
        y = rect.get_y()
        height = rect.get_height()
        ax.text(x + width/2., y + height/2., label, ha='center', va='center',fontsize=14)

plt.xlim((0,1)) # 0 - 100%
plt.text(0.44,1.3,"All appointments",fontsize=16) #label whole bar in figure
plt.text(0.42,0.32,"Missed appointments",fontsize=16)
ax.set_yticklabels([""]) #don't label bar on y axis
ax.set_title('Comparison of patient age impact on total and on missed appointments');

As the pie charts show, even though a slightly higher proportion of no-shows are due to adolescent and young adult patients, the impact on the bulk of no-show appointments is relatively insiginificant, because only a relatively small proportion of appointments are with patients in these age groups.

Finally, we'll look at whether the previously identified association of SMS reminders lowering no-shows is still visible across dfferent age groups :

In [135]:
#determine proportion of no-shows grouped by whether an SMS was received or not
#only consider appointments for which reminders are sent, i.e. with over 2 days' wait
agegroups_sms_effect=df[df['wait']>2].groupby(['age_groups','sms_received']).no_show.value_counts(normalize=True).unstack()['Yes']
agegroups_sms_effect=agegroups_sms_effect.unstack().rename(columns={True:'SMS received', False:'No SMS'})#.reset_index()#.rename(columns={0:'longer_wait'})
agegroups_sms_effect
Out[135]:
sms_received No SMS SMS received
age_groups
infant 0.354430 0.288254
child 0.358372 0.290115
adolescent 0.388747 0.329221
young adult 0.382600 0.367656
adult 0.366787 0.314347
middle age 0.280492 0.220243
old age 0.239038 0.196959
In [136]:
#determine proportion of no-shows grouped by whether an SMS was received or not
#only consider appointments for which reminders are sent, i.e. with over 2 days' wait
agegroups_sms_effect=df[df['wait']>2].groupby(['age_groups','sms_received']).no_show.value_counts(normalize=True).unstack()['Yes']

agegroups_sms_effect=agegroups_sms_effect.unstack().rename(columns={True:'SMS received', False:'No SMS'})#.reset_index()#.rename(columns={0:'longer_wait'})
agegroups_sms_effect
Out[136]:
sms_received No SMS SMS received
age_groups
infant 0.354430 0.288254
child 0.358372 0.290115
adolescent 0.388747 0.329221
young adult 0.382600 0.367656
adult 0.366787 0.314347
middle age 0.280492 0.220243
old age 0.239038 0.196959
In [137]:
ax = agegroups_sms_effect.plot.barh(rot=0)
plt.gca().invert_yaxis() #preserve chronological order of age groups, from https://stackoverflow.com/a/34094001
ax.legend(title=None)
ax.set_title('SMS reminders: effect across age groups')
ax.set_ylabel('')
ax.set_xlabel('No-show proportion');

SMS reminders lower the rate of no-shows across all age groups, though the effect is particularly small for young adults

Part III: Variations across hospitals

As stated previously, the data set records 81 different hospitals ('neighbourhood' column) in which appointments take place. The previous two sections have identified factors that appear to be associated with no-show appointments. Thus the final question is to ask if there is a correlation between the above identified factors and variations between hospitals in the rate of no-show appointments.

The rationale is that hospitals may differ in their practices of scheduling medical appointments, as well as in their patient composition, and that such variations may be associated with differences in the mean proportion of no-show appointments in each hospital. Even though the multiplicity of factors involved implies that any correlations are expected to be weak, they may be detectable and, if detected, would increase confidence in the associations identified.

To start with, let's create and visualise a dataframe for the distribution of appointments across hospitals:

In [138]:
hosp_total_appts = df.neighbourhood.value_counts() #counts of total appointments in each hospital
plt.figure(figsize=(17,5))  
plt.subplot(121)  #create first subplot
hosp_total_appts.plot(kind='bar') #first plot of number of appointments in each hospital as bar diagram
plt.title("Total appointments in 81 hospitals")
plt.xticks(fontsize=7)
plt.xlabel("hospital")
plt.ylabel("total appointments")
plt.subplot(122)  #create adjacent second subplot
labels=hosp_total_appts.head(20).index.tolist()+61*[''] # only label largest 20
plt.pie(hosp_total_appts, labels=labels)  #pie diagram to visualise as proportion of all appointments
plt.xlabel("20 largest hospitals are labelled")
plt.axis('equal'); #make sure the pie isn't oval
In [139]:
hosp_total_appts.head() #show 5 largest hospitals
Out[139]:
JARDIM CAMBURI     7717
MARIA ORTIZ        5805
RESISTÊNCIA        4430
JARDIM DA PENHA    3877
ITARARÉ            3514
Name: neighbourhood, dtype: int64
In [140]:
hosp_total_appts.tail() #show 5 smallest hospitals
Out[140]:
ILHA DO BOI                    35
ILHA DO FRADE                  10
AEROPORTO                       8
ILHAS OCEÂNICAS DE TRINDADE     2
PARQUE INDUSTRIAL               1
Name: neighbourhood, dtype: int64

There is a very wide range in the number of appointments per hospital - from 7717 down to single figures. This is problematic for the approach described for identifying correlations as we would be looking at fundamentally different types of instutions but giving each equal weight - the large number of small "hospitals" would likely distort our findings. For that reasons we will only consider the largest hospitals in our analysis :

In [141]:
#calculate the sum and proportion of appointments taking place in the 27 largest hospitals
hosp_total_appts.head(27).sum() , hosp_total_appts.head(27).sum()/hosp_total_appts.sum()
Out[141]:
(78039, 0.70610110295780892)
In [142]:
#get the number of appointments fpr the 27. largest hospital, and calculate the ratio between the largest and 27. largest
hosp_total_appts[26], hosp_total_appts[0]/hosp_total_appts[26]
Out[142]:
(1836, 4.2031590413943354)

We will look at the top 1/3 of largest hospitals. 78040 appointments took place in the 27 largest hospitals, representing 70.6% of all appointments. Thus, while covering most of the appointments recorded, restricting our analysis to the largest 27 will make the set of hospitals much more homogenous, as the smallest hospital considered will only be 4.2 times smaller than the largest, compared to a factor of 7700 if we included all 81 hospitals.

We will now create a new dataframe called df_hospitals and progressively add statistics specific for each hospital :

In [143]:
#create a dataframe in which the first column is the hospital name, the second the number of appointments
df_hospitals=hosp_total_appts.reset_index().rename(columns={'index':'neighbourhood','neighbourhood':'total_appointments'}) #make hospital name a column, rename columns
df_hospitals.head() #row order decreasing with number of appointments, bc value_counts() was originally used 
Out[143]:
neighbourhood total_appointments
0 JARDIM CAMBURI 7717
1 MARIA ORTIZ 5805
2 RESISTÊNCIA 4430
3 JARDIM DA PENHA 3877
4 ITARARÉ 3514
In [144]:
#create series with proportion of no-show values for each hospital
hosp_ns_rate=df.groupby('neighbourhood').no_show.value_counts(normalize=True).unstack()['Yes']
#convert to dataframe with no_show_rate column for each hospital
hosp_ns_rate=hosp_ns_rate.to_frame().reset_index().rename(columns={'Yes':'no_show_rate'}) 
#use merge to add the no-show rate to df_hospitals
df_hospitals=pd.merge(df_hospitals, hosp_ns_rate,on='neighbourhood')
df_hospitals.head() #check
Out[144]:
neighbourhood total_appointments no_show_rate
0 JARDIM CAMBURI 7717 0.189841
1 MARIA ORTIZ 5805 0.209991
2 RESISTÊNCIA 4430 0.204289
3 JARDIM DA PENHA 3877 0.162755
4 ITARARÉ 3514 0.262664
In [145]:
df_hospitals.head(27).describe() #descriptive stats for the largest third of hospitals
Out[145]:
total_appointments no_show_rate
count 27.000000 27.000000
mean 2890.333333 0.203962
std 1318.138956 0.024761
min 1836.000000 0.158416
25% 2116.000000 0.186503
50% 2509.000000 0.201363
75% 3131.000000 0.219706
max 7717.000000 0.262664

Create figure visualising the size and no-show rates for the largest third of hospitals:

In [146]:
hosp_largest27=df_hospitals.head(27) #restricting data to largest third
median_size_27largest=hosp_largest27.total_appointments.median() #median size in largest third of hospitals
mean_noshow_27largest=hosp_largest27.no_show_rate.mean() #mean no-show rate in largest third of hospitals

plt.figure(figsize=(15,5)) # set figure size
plt.subplot(121)  #create first subplot
hosp_largest27.total_appointments.plot(kind='bar')  #create bar plot with appointmente number of each hospital
plt.title("Total appointments in largest third of hospitals")
plt.xticks(fontsize=7)
plt.xlabel("hospital")
plt.ylabel("total appointments")
plt.axhline(y=median_size_27largest, color='black',linestyle='--')

plt.subplot(122)  #create adjacent second subplot
hosp_largest27.no_show_rate.plot(kind='bar')       #bar plot with corresponding no-show rates in each hospital
plt.title("No-show rates in largest third of hospitals")
plt.axhline(y=mean_noshow_27largest, color='black',linestyle='--')
plt.xlabel("hospital")
plt.ylabel("proportion of no-shows");

The sizes of hospitals considered is relatively homogenous. No-show rates in hospitals considered range from 15.8% to 26.3% - this should be sufficient for further analysis. Since mean and median no-show rates are close (20.39% vs 20.13%), the distribution of no-show rates is close to a normal distribution.

We'll redefine df_hospitals to only include the largest 27 hospitals:

In [147]:
df_hospitals=df_hospitals.head(27)
df_hospitals.shape #check
Out[147]:
(27, 3)

Next we will calculate for each hospital (and add to df_hospitals) some statistics on two factors previously found to be associated with no-show rates:

  • the length of time patients have to wait for their appointment after they schedule it - as the proportion of same-day appointments (column called same_day, and the proportion of appointments for which patients wait over a week up to 6 months (called longer_wait);
  • and whether or not patients receive an SMS reminder, particularly if they wait for a longer time for their appointment - we'll calculate the proportion of appointments receive an SMS reminder after waiting over 1 week (called SMS>7), or over 3 weeks (SMS>21), respectively.
In [148]:
#calculate the proportion of same-day appointments, add them to df_hospitals as a column called 'same_day'
hosp_wait_short=df.groupby(['neighbourhood']).wait_duration.value_counts(normalize=True).unstack()['same_day']
hosp_wait_short=hosp_wait_short.to_frame().reset_index() #convert to dataframe suitable for merging
df_hospitals=pd.merge(df_hospitals, hosp_wait_short, on='neighbourhood') #merge dataframes using hospital names

#calculate the proportion of appointments in the categories in which patients wait up to 3 weeks and up to 6 months,
#then add them to df_hospitals as a column called 'long_wait'
hosp_wait_longer=df.groupby(['neighbourhood']).wait_duration.value_counts(normalize=True).unstack()['3_weeks'] + df.groupby(['neighbourhood']).wait_duration.value_counts(normalize=True).unstack()['6_months']
hosp_wait_longer=hosp_wait_longer.to_frame().reset_index().rename(columns={0:'longer_wait'})
df_hospitals=pd.merge(df_hospitals, hosp_wait_longer, on='neighbourhood')

#calculate the proportion of appointments in which patients receive an SMS reminder when they wait for 
#over 7 days for their appointment, add this to df_hospitals as a column called 'SMS>7'
sms_1week = df[df['wait']>7].groupby(['neighbourhood']).sms_received.value_counts(normalize=True).unstack()[True]
sms_1week=sms_1week.to_frame().reset_index().rename(columns={True:'SMS>7'})
df_hospitals=pd.merge(df_hospitals, sms_1week, on='neighbourhood')

#calculate the proportion of appointments in which patients receive an SMS reminder when they wait for 
#over 21 days for their appointment, add this to df_hospitals as a column called 'SMS>21'
sms_3weeks = df[df['wait']>21].groupby(['neighbourhood']).sms_received.value_counts(normalize=True).unstack()[True]
sms_3weeks=sms_3weeks.to_frame().reset_index().rename(columns={True:'SMS>21'})
df_hospitals=pd.merge(df_hospitals, sms_3weeks, on='neighbourhood')
In [149]:
df_hospitals.head() #check new columns
Out[149]:
neighbourhood total_appointments no_show_rate same_day longer_wait SMS>7 SMS>21
0 JARDIM CAMBURI 7717 0.189841 0.324478 0.384735 0.655776 0.639836
1 MARIA ORTIZ 5805 0.209991 0.357450 0.445823 0.658037 0.668148
2 RESISTÊNCIA 4430 0.204289 0.363883 0.363883 0.604218 0.582439
3 JARDIM DA PENHA 3877 0.162755 0.315192 0.446737 0.641455 0.734426
4 ITARARÉ 3514 0.262664 0.322425 0.431133 0.592739 0.555409
In [150]:
df_hospitals.shape
Out[150]:
(27, 7)

To calculate the Pearson's correlation coefficient between different hospital's no-show rate and other columns, we can use the corr() method:

In [151]:
df_hospitals.corr()['no_show_rate']
Out[151]:
total_appointments   -0.106041
no_show_rate          1.000000
same_day             -0.295072
longer_wait           0.155906
SMS>7                -0.506361
SMS>21               -0.433703
Name: no_show_rate, dtype: float64

Next we'll use Seaborn's regplot() method to visualise the relationship between hospitals' no-show rates and the proportion of same-day and longer wait appointments. regplot() also visualises the 95% confidence interval for the regression line.Correlation coefficients calculated by the Pandas corr() method will be displayed on the chart.

In [152]:
corr_same_day=df_hospitals.corr()['no_show_rate']['same_day'] #get correlation coefficient for same_day column
corr_same_day="correlation:\n r = {}".format(corr_same_day.round(decimals=2)) #format for display
corr_longer_wait=df_hospitals.corr()['no_show_rate']['longer_wait'] #get correlation coefficient for longer wait
corr_longer_wait="correlation:\n r = {}".format(corr_longer_wait.round(decimals=2))

fig, (ax1, ax2) = plt.subplots(ncols=2,figsize=(15,5)) #create figure with two adjacent subplots
sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["same_day"], ax=ax1) #regression plot with same-day appointments
ax1.set_title('Correlation with same-day appointments')
ax1.set_ylabel("no-show rate")
ax1.set_xlabel("prop. same-day appointments")
ax1.text(x=0.85,y=0.9, s=corr_same_day, ha="center", va="center", transform=ax1.transAxes) #display correlation coefficient top right

sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["longer_wait"], ax=ax2) #regression plot with longer waits
ax2.set_title('Correlation with longer wait appointments')
ax2.set_ylabel("no-show rate")
ax2.set_xlabel("prop. longer wait appointments")
ax2.text(x=0.85,y=0.9, s=corr_longer_wait, ha="center", va="center", transform=ax2.transAxes); #display correlation coefficient

There is a moderate (r=-0.29) inverse correlation between the proportion of same-day appointments and no-show rates. This means that hospitals with a low no-show rate tend to have a high proportion of same-day appointments.

There is a weak (r=0.16) correlation between the proportion of longer-wait appointments and no-show rates. This means that hospitals with a high no-show rate tend to have a high proportion of longer-wait appointments, but this is only a weak tendency.

Next we'll use Seaborn's regplot() function to look at the association between SMS reminders sent for appointments with over 7 days, and over 21 days of waiting time :

In [153]:
corr_sms7=df_hospitals.corr()['no_show_rate']['SMS>7'] #get correlation coefficient for SMS>7
corr_sms7="correlation:\n r = {}".format(corr_sms7.round(decimals=2)) #format for display
corr_sms21=df_hospitals.corr()['no_show_rate']['SMS>21'] #get correlation coefficient for SMS>21
corr_sms21="correlation:\n r = {}".format(corr_sms21.round(decimals=2))

fig, (ax1, ax2) = plt.subplots(ncols=2,figsize=(15,5), sharey=True, sharex=True) #two adjacent plots with same axes
sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["SMS>7"], ax=ax1)
ax1.set_title('Correlation with SMS reminders wait >7days ')
ax1.set_ylabel("no-show rate")
ax1.set_xlabel("prop. SMS reminders wait >7days")
ax1.text(x=0.85,y=0.9, s=corr_sms7, ha="center", va="center", transform=ax1.transAxes)

sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["SMS>21"], ax=ax2)
ax2.set_title('Correlation with SMS reminders wait >21days')
ax2.set_ylabel("no-show rate")
ax2.set_xlabel("prop. SMS reminders wait >21days")
ax2.yaxis.set_tick_params(reset=True)
ax2.text(x=0.85,y=0.9, s=corr_sms21, ha="center", va="center", transform=ax2.transAxes);

There is an inverse correlation between the proportion of SMS reminders sent for wait periods over 7 days (and 21 days) and no-show rates. This means that hospitals with a high no-show rate tend to have a lower proportion of SMS reminders. Given the number of factors involved, the correlation coefficients of r=-0.51 and r=-0.43, for appointment wait times above 7 and 21 days, respectively, should be considered relatively strong correlations.

Next we will consider two patient-related characteristics that are associated with no-show rates :

  • welfare, by calculating the proportion of patients in receipt of welfare (named "scholarship")

  • and patient age, by calculating the mean patient age for each hospital (age), the proportion of older patients (old), and the proportion of younger patients (young).

In [154]:
#calculate the proportion of appointments with welfare recipient for each hospital and add to the df_hospitals dataframe
hosp_scholarship=df.groupby('neighbourhood').scholarship.value_counts(normalize=True).unstack()[True]
hosp_scholarship=hosp_scholarship.to_frame().reset_index().rename(columns={True:'scholarship'})
df_hospitals=pd.merge(df_hospitals, hosp_scholarship, on='neighbourhood')

#calculate the mean patient age across appointments for each hospital and add to df_hospitals
hosp_mean_age=df.groupby('neighbourhood').age.mean()
hosp_mean_age=hosp_mean_age.to_frame().reset_index()
df_hospitals=pd.merge(df_hospitals, hosp_mean_age, on='neighbourhood')

#for each hospital, calculate the proportion of appointments with patients in old or middle age group and add to the df_hospitals dataframe
hosp_old = df.groupby('neighbourhood').age_groups.value_counts(normalize=True).unstack()['old age'] + df.groupby('neighbourhood').age_groups.value_counts(normalize=True).unstack()['middle age']
hosp_old = hosp_old.to_frame().reset_index().rename(columns={0:'old'})
df_hospitals=pd.merge(df_hospitals, hosp_old, on='neighbourhood')

#for each hospital, calculate the proportion of appointments with patients in child, adolescent or young adult age group and add to the df_hospitals dataframe
hosp_young = df.groupby('neighbourhood').age_groups.value_counts(normalize=True).unstack()['adolescent'] + df.groupby('neighbourhood').age_groups.value_counts(normalize=True).unstack()['young adult']+ df.groupby('neighbourhood').age_groups.value_counts(normalize=True).unstack()['child'] 
hosp_young = hosp_young.to_frame().reset_index().rename(columns={0:'young'})
df_hospitals=pd.merge(df_hospitals, hosp_young, on='neighbourhood')
In [155]:
df_hospitals.shape
Out[155]:
(27, 11)
In [156]:
df_hospitals.head() #check new columns
Out[156]:
neighbourhood total_appointments no_show_rate same_day longer_wait SMS>7 SMS>21 scholarship age old young
0 JARDIM CAMBURI 7717 0.189841 0.324478 0.384735 0.655776 0.639836 0.020604 43.731502 0.552676 0.187897
1 MARIA ORTIZ 5805 0.209991 0.357450 0.445823 0.658037 0.668148 0.020500 35.275797 0.377089 0.288717
2 RESISTÊNCIA 4430 0.204289 0.363883 0.363883 0.604218 0.582439 0.105643 32.918059 0.309481 0.296840
3 JARDIM DA PENHA 3877 0.162755 0.315192 0.446737 0.641455 0.734426 0.002837 44.986588 0.532886 0.211504
4 ITARARÉ 3514 0.262664 0.322425 0.431133 0.592739 0.555409 0.088503 35.826693 0.373933 0.290268

Regression plot to check if the proportion of welfare recipients correlates with no-show rates:

In [157]:
corr_scholarship=df_hospitals.corr()['no_show_rate']['scholarship'] #get correlation coefficient
corr_scholarship="correlation:\n r = {}".format(corr_scholarship.round(decimals=2))

sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["scholarship"])
plt.title('Correlation with proportion scholarship recipients')
plt.ylabel("no-show rate")
plt.xlabel("prop. scholarship recipients")
plt.text(x=0.85,y=0.9, s=corr_scholarship, ha="center", va="center", transform=ax.transAxes); #show correlation coefficient

There is a moderate correlation (r=0.27) between the proportion of appointment with welfare recipients and no-show rates in hospitals. Next we'll use a regression plot to look for correlation between the proportion of older, and of younger patients, with no-show rates:

In [158]:
corr_old=df_hospitals.corr()['no_show_rate']['old']
corr_old="correlation:\n r = {}".format(corr_old.round(decimals=2))
corr_young=df_hospitals.corr()['no_show_rate']['young']
corr_young="correlation:\n r = {}".format(corr_young.round(decimals=2))

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15,5))
sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["old"], ax=ax1)
ax1.set_title('Correlation with old age group ')
ax1.set_ylabel("no-show rate")
ax1.set_xlabel("prop. patients in old age group")
ax1.text(x=0.85,y=0.9, s=corr_old, ha="center", va="center", transform=ax1.transAxes)

sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["young"], ax=ax2)
ax2.set_title('Correlation with young age group')
ax2.set_ylabel("no-show rate")
ax2.set_xlabel("prop. patients in young age group")
ax2.yaxis.set_tick_params(reset=True)
ax2.text(x=0.85,y=0.9, s=corr_young, ha="center", va="center", transform=ax2.transAxes);

There is a moderate (r=-0.32) inverse correlation between the proportion of appointments with older patients and no-show rates. This means that hospitals with a higher proportion of older patients tend to have a lower no-show rate.

Conversely there is a moderate (r=0.32) correlation between the proportion of appointments with younger patients (infants are excluded from this group) and no-show rates. This means that hospitals with a higher proportion of younger patients tend to have a higher no-show rate.

Finally, we'll use a regression plot to look at mean patient age and no-shows:

In [159]:
corr_age=df_hospitals.corr()['no_show_rate']['age']
corr_age="correlation:\n r = {}".format(corr_age.round(decimals=2))

plt.subplots(figsize=(7,5))
sns.regplot(x=df_hospitals["no_show_rate"], y=df_hospitals["age"])
plt.title('Correlation with mean patient age')
plt.xlabel("no-show rate")
plt.ylabel("mean patient age")
plt.text(x=0.7,y=0.7, s=corr_age, ha="center", va="center", transform=ax.transAxes);

Finally, the correlation co-efficients for all the variables considered:

In [160]:
df_hospitals.corr()['no_show_rate'].sort_values() #sorted by increasing value
Out[160]:
SMS>7                -0.506361
SMS>21               -0.433703
old                  -0.324760
same_day             -0.295072
age                  -0.285774
total_appointments   -0.106041
longer_wait           0.155906
scholarship           0.273214
young                 0.324158
no_show_rate          1.000000
Name: no_show_rate, dtype: float64

In conclusion, variations between hospitals in no-show rates generally correlate with factors found to be associated with no-shows in parts I and II. No-shows tend to be lower in hospitals that have a higher proportion of SMS reminders, of older patients, and of same-day appointments. No-shows tend to be higher in hospitals that have a higher proportion of younger patients and welfare recipients.

Conclusions

In this investigation into medical appointments, I tried to identify factors associated with no-shows. Findings are based on simple inspection, analysis and visualisation of numeric data. Since no inferential statistical methods were used, these findings are tentative.

Interestingly, although data were collected on a variety of patient characteristics, these generally do not appear to be play an important role in predicting no-shows. The overall level of 20% no-shows is reduced by 2-3% in appointments with patients affected by health conditions such as a disability, diabetes, or hypertension. Alcoholism, gender have no discernable effect. There is a small increased of 4% of no-shows patients that are enrolled in the welfare program Bolsa Família.

An attempt to define specific characteristics that are more common in the group of patients that repeatedly miss appointments found that, apart from a small increase of the proportion of welfare recipients in this group (from 9% to 16%) such patients are in fact very similar to the group who never or rarely miss appointments (for the duration of the study). There is a moderate/weak correlation (r=0.27) between hospitals with a higher proportion of patients who are welfare recipients and a higher proportion of no-shows.

However, patients who repeatedly miss appointments only account for 7% of no-shows, in comparison to 65% due to patients who miss a single appointment. Perhaps surprisingly, the likelihood that an appointment with a patient will be missed is very similar with patients who miss a single appointment (51%), a few appointments (54%) and repeatedly (57%). Thus the observed pattern does no follow the Pareto 80/20 rule, since the majority of no-shows are due to a large number of patients who miss a single appointment.

However, a very important caveat here is that data were collected over a time-span of 40 days only, which included 27 days with medical appointments. It is possible that a longer-term study would reveal a discernable minority of patients causing the majority of no-shows. In this case, it would become possible to use some additional resources that could be concentrated on this minority to make the health system more efficient.

The only patient characteristic found to be clearly associated with no-shows is age. The rate of no-shows rises up to the age of 20. Then it declines in a linear manner over several decades, from around 25% missed appointments in patients in the mid-twenties to around 15% in patients in their 70s and 80s. However, the overall impact of the increased proportion of no-shows in young patients is minor, as only a minority of no-shows is due to children (11.1%) adolescents (9.9%) and young adults up to age 25 (11.1%). The effect of patient age is consistent with moderate correlations of no-shows in hospitals with a higher proportion of younger patients (r=0.32), with a lower proportion of older patients (r=-0.32), and with a lower mean age (r=-0.29).

Factors related to how medical appointments are organised appear to play a more important effect on the probability patients miss their appointment. The time a patient has to wait is strongly associated with no-shows: the overall rate of 20% no-shows drops over four-fold for same day appointments, to below 5% missed appointments. No-shows for appointments that occur 1-2 days after scheduling are 23%, then rise with increased waiting time up to 33% for appointment which patients wait for between over 3 weeks up to to 6 months.

The impact can be discerned in the bulk of no-shows: the majority of missed appointments are with patients waiting 2-3 weeks (30%) and over 3 weeks to 6months (28%), even though these categories only make up 17% and 11% of all appointments, respectively. There is also a moderate inverse correlation (r=-0.30) between a hospital's proportion of no-show appointments and its proportion of same-day appointments.

Although there are significant daily variations in no-shows, with a minimum 17% and a maximum 23% over the period studied, these do not appear to be associated with day of the week or public holidays.

Finally, SMS reminders, which are not sent out for appointments with wait of 2 or fewer days, appear to reduce the levels of no-shows by several percentage points. The apparent effect is strongest the longer patients wait for their appointment, for instance in the category in which patients wait over 3 weeks, no-show rates drop to 30% after receiving an SMS compared to 38% when patients don't receive an SMS.

It is important to stress that the associations identified do not reveal causal relationships. For instance the reduced level of no-shows after patients receive an SMS may be caused by the reminders counteracting forgetfulness. On the other hand the ability of receiving SMS messages may simply be a marker of a higher socio-economic status which may be associated with lower no-show rates. Or in the case of the much reduced rate of no-shows with same-day appointments, this could be due to them responding to particularly important/urgent medical needs rather than the fact of them occurring on the same day they are scheduled. In other words, scheduling less important appointments to occur on the same day my not lead to the same reduction in no-show rates.

However the findings hint at some practices that could potentially lower no-shows. If reminders do effectively reduce no-shows, these could be performed using different forms of commucation, such as a latter or phone call. A second organisational change could be to test the impact of changing more routine appointments into same-day appointments, particularly if this is applied to patient groups (younger patients, those who have previously missed an appointment) or appointment types (4 weeks - 6 months) with much higher no-show rates. For instance, rather than giving a patient a specific date for an appointment 3 months in advance, they could be required to make the appointment in a specific week 3 months in the future.