This is a project for the Introduction to Data Analysis module in the Udacity Data Analysis Nanodegree. The aim is to "conduct data analysis on a dataset, answering a number of questions and create a file to share the findings. Since inferential statistics or machine learning are not used, those findings will be tentative". The dataset chosen for analysis regards medical appointments:
This dataset collects information from 100k medical in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row. (from the Udacity project description)
How come patients miss scheduled medical appointments?
This is the central question I will try to gain a better understanding of. A preponderance of missed medical appointments make the medical system less efficient, potentially wasting precious resources. Identifying factors associated with missed appointments may hint at measures that could be taken to reduce the level of no-shows.
My approach will be to distinguish between patient characteristics, and factors related to the scheduling of medical appointments. The latter may point towards simple organisational steps that could be implemented to reduce no-shows. By contrast, patient characteristics - such as health or socio-economic status, gender or age - are not liable to intervention by the health system, and would point towards changes in the level of support given to patient groups identified in the analysis.
#import statements for packages to be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#set up matplotlib to work interactively
%matplotlib inline
sns.set() # switch on seaborn defaults
import matplotlib
pd.__version__, np.__version__, matplotlib.__version__, sns.__version__
We'll start by trying to get an insight into the dataset and its structure
#Load the data from the csv file into the dataframe
#The data file was obtained from Kaggle: https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd2e9a_noshowappointments-kagglev2-may-2016/noshowappointments-kagglev2-may-2016.csv
df=pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.shape # attribute telling us the number of rows and columns
df.head() # checking the first few rows, including column names
df.info() # checking data types and identify any missing values NaN
sum(df.duplicated()) # function counting duplicated rows
This initial inspection of data reveals that we have :
- 110527 records with 14 columns
- there are no duplicated rows
- there are no missing data in any of the rows (NaN)
- column names contain both capitalised and lower case letters, as well as some typos. Typos should be corrected, and column names may be converted to all lower case.
Next I will examine the different columns.
#Let's study the first two columns, their type and unique values
df[['PatientId','AppointmentID']].nunique()
There are 110527 unique appointment IDs, the same as the number of rows in the dataframe - consistent with the description of the dataset, each row represents a medical appointment.
There are 62299 unique patient IDs - this implies that some patients have had several appointments.
From the result of df.info(), we can see that, although both appointment and patient IDs are numeric, Pandas converted the former to integer but the latter to floats - but one would expect them to take the same format.
df[['PatientId','AppointmentID']].head()
df['PatientId'].astype(int).head() #display the patient IDs by formating as integers (potentially removing decimals)
df[['PatientId','AppointmentID']].describe()
Appointment IDs take the format of a 7 digit integer starting with 5 as the first digit .
Most patient IDs are distributed over 2 or 3 orders of magnitude, but a few apparent outliers are much smaller. It is unknown how many are integers, as one might expect from an ID number
#using numpy to test if most patient IDs are integers through an elementwise comparison of arrays of patientid with patientid converted to integers
np.array_equal(df['PatientId'], df['PatientId'].astype(int)) #inspired by https://stackoverflow.com/a/49249910
#count the rows in which patient ID is not an integer
sum(df['PatientId'] != df['PatientId'].astype(int))
#using conditional indexing to display the 5 rows in which patient IDs are not integers
df[df['PatientId']!=df['PatientId'].astype(int)]
Almost all patient IDs are integers, with the exception of 5 records in which a data entry error appears to have taken place with a . inserted. These 5 rows could either be dropped, or, given that the error seems obvious, manually corrected.
df['Gender'].nunique() #display unique values for the gender column
df['Gender'].value_counts() #count rows for each gender
df['Gender'].value_counts(normalize=True) #normalised to obtain proportion for each gender
Patient gender is recorded as a string taking the value of either F or M. 65% of rows represent appointments for female patients. A binary gender system is used, implying that the present data are not suitable for gaining insight into the health needs specific to trans or non-binary patients.
df['Age'].nunique() #unique age values
df['Age'].describe() #descriptive stats on age
df['Age'].value_counts().head() #check the most common values
df['Age'].value_counts().tail() #check the least common values - some could be errors
df['Age'].hist(); #visualise the age distribution with a histogram
df[df.Age<0] #checking records for appointments with patients aged below 0
df[df.Age>100] #checking records for appointments with patients aged aboce 100 years
Summary of patient age data:
- Patient age is given as a an integer
- since this doesn't allow for months to be recorded, newborns and babies under 1 year of age are apparently recorded as age 0 - this is the mode of the age distribution
- the median age is 37 years, meaning that in 50% of appointments, the patient is aged less than 37 years. In 25% of appointments, the patient is a minor (below 18) and in 25% over 55 years.
- there are a surprising number of rows with the age of 115, but these are multiple appointments with the same patient
- there is one appointment recorded as age -1. It is not clear what this means but it may represent an unconventional way of recording a prenatal intervention although the expectant mother's age ought to be recordeded since she is the patient. Alternatively, it could be a mistaken way of recording a premature birth. In either case, one may consider removing this record.
No further data cleaning is required for the age column.
Investigating the two date columns:
df[['ScheduledDay','AppointmentDay']].dtypes
df[['ScheduledDay','AppointmentDay']].head()
df[['ScheduledDay','AppointmentDay']].tail()
Date columns:
- According to the description, ‘ScheduledDay’ tells us on what day the patient set up their appointment.
- Currently both dates have the data type of strings - they should be converted to datetime.
- It appears that the time is recorded for the day on which the appointment was set up but not of the appointment date itself. This is potentially problematic - for same-day appointments, if the same granularity is used for ScheduledDay and AppointmentDay and day times used, the appointment appears to take place prior to it being scheduled
- Appointments apparently took place between 29. April and 7 of June 2016
According to the description, ‘Neighborhood’ indicates the location of the hospital at which the apointment took place.
df['Neighbourhood'].nunique() #Number of different hospitals
df['Neighbourhood'].value_counts() #counts of appointments in each hospital
df['Neighbourhood'].value_counts().sort_index().index #sorted to identify if there are any typos
'Neighbourhood' column:
- There are apparently 81 different hospitals, with a very wide spread of appointments (ranging from single figures to several thousand)
- Hospital names are all capitalised strings, including some non-standard letters - this should not pose a problem.
There is no immediately obvious need for data cleaning in this column.
Checking the remaining columns related to patient characteristics:
df[['Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism', 'Handcap']].nunique() #unique values in each column
df[['Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism', 'Handcap']].dtypes #data types in each column
df['Scholarship'].value_counts() #counts for each value
df['Hipertension'].value_counts()
df['Diabetes'].value_counts()
df['Alcoholism'].value_counts()
df['Handcap'].value_counts()
Several patient characteristics are recorded as integers. Scholarship, "hipertension", diabetes and alcoholism only take the values 0 and 1, and could therefore be converted into booleans. "Handcap" takes values from 0 to 4, apparently representing levels of increasingly severe disability. The numbers of patients recorded at levels 3 and 4 are very small. It may be useful to add a column with boolean values expressing whether or not a patient is disabled - this new column would make it possible to treat disability in the same way as the previous characteristics.
df[['SMS_received', 'No-show']].nunique()
df[['SMS_received', 'No-show']].dtypes
df['SMS_received'].value_counts()
df['No-show'].value_counts()
df['No-show'].value_counts(normalize=True)
Appointments in which the patient received an SMS are recorded as integers taking values of 0 or 1 - this could be converted into booleans. No-show appointments are recorded as strings of either Yes (if the patient missed the appointment) or No, if the patient showed up. Overall 20.19% of appointments are missed.
Summary of initial inspection:
This is a dataset of 110527 medical appointments, apparently ordered chronologically, which took place between 29. April and 7 of June 2016. Each appointment has a unique ID, and there are 62299 patients with unique IDs. 14 columns record information relating to the appointment - the date it was scheduled, the date the appointment took place, the hospital, whether an SMS was received, and, importantly, whether the patient showed up for the appointment- as well as information relating to the patient - gender, age, enrolment in a welfare program ("scholarship"), hypertension, disability.
The data are quite clean in that there are no duplicated rows nor missing data in any of the columns. Only the following data cleaning will be undertaken:
- correct malformed patient id numbers and convert the whole column into the integer data type
- convert dates into the datetime data type for columns ScheduledDay, AppointmentDay
- column names: typos will be corrected, all names changed to lower case, hyphen replaced by underscore
- convert to boolean the columns 'Scholarship', 'Hipertension, Diabetes', 'Alcoholism', 'SMS_received'
- add a new column with boolean values for disability to make it compatible with other patient characteristics
- drop single record of appointment with patient with a negative age
df.columns # view current column names
#correct typos in column names,replace hyphen and convert them to all lower-case for consistency
df=df.rename(columns={'Hipertension':'Hypertension','Handcap':'Disability_int', 'No-show':'No_show'}) #rename columns, remove hyphen
df=df.rename(columns=str.lower) # convert all column names to lower case
df.head(1) # check changes
Several rows contain malformed patient ids. Given the small number of rows, this can be done by replacing them manually.
#obtain row numbers and patient ids that are malformed patient ids:
(df[df['patientid']!=df['patientid'].astype(int)])['patientid'] # get rows that are not integers
#Rows that need correcting: [3950, 73228, 73303, 100517, 105430]
#Corrected patient ids for these rows: [9377952927, 53761528476, 14172416655, 3921784439, 4374175652]
#Simple loop to replace the malformed patient ids :
for (row_nb,pid) in [(3950,9377952927),(73228,53761528476,),(73303,14172416655,),(100517,3921784439,),(105430,4374175652)]:
df.iloc[row_nb,0]=pid
#test to verify all patient ids are now integer, to confirm it's safe to confirm the dtype from float to integer
np.array_equal(df['patientid'], df['patientid'].astype(int)) #should now be true
#since all patient ids are integers, convert data type to integer
df['patientid']=df['patientid'].astype(int)
df['patientid'].dtype # check change
df.iloc[[3950, 73228, 73303, 100517, 105430]] # check corrected rows
Convert the data type of the columns scholarship, hypertension, diabetes, alcoholism and sms_received to boolean, and add a column with bolean values for disability:
df[['scholarship', 'hypertension', 'diabetes', 'alcoholism','sms_received']]=df[[ 'scholarship', 'hypertension', 'diabetes', 'alcoholism','sms_received']].astype(bool)
df['disability'] = df['disability_int']>0 #add disability column with boolean values
df.head() # check changes
convert scheduledday and appointmentday to a datetime data type:
df['scheduledday']=pd.to_datetime(df['scheduledday'])
df['appointmentday']=pd.to_datetime(df['appointmentday'])
# As pointed out earlier, the different granularity of the two dates is potentially problematic - let's test for
# how many the appointment dates precedes the date on which it was scheduled:
df.query("scheduledday>appointmentday").shape
df.query("scheduledday>appointmentday").head() # only checking a few rows - not 38K records
In 38568 rows, the appointment appears to take place prior to the appointment being scheduled. In fact this anomaly is likely due to same-day appointments, since the time that an appointement was scheduled is recorded, but not the time of the actual appointment. To verify that this is the case, we can repeat the above query but discarding the time information by rounding 'scheduledday' down :
#Repeating the query, disregarding the time appointments were scheduled
#by rounding down scheduledday using dt.floor, from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.floor.html
df.query("scheduledday.dt.floor('1d')>appointmentday")
Using a strict 'greater than' in the query removes all same-day appointments, leaving only 5 appointments with the date anomaly. These are likely data entry errors and it does not seem possible to correct this error. The most likely explanation is that the two dates were mixed up. The best way to handle these records is to remove them from the dataframe.
drop_list = df.query("scheduledday.dt.floor('1d')>appointmentday").index # list of records to be dropped
df = df.drop(drop_list) # drop records with anomalous dates
df.query("scheduledday.dt.floor('1d')>appointmentday") # verify the records have been dropped
#get record with negative age patient
df[df['age']<0]
df = df.drop(99832) #drop record
df[df['age']<0] #check it's gone
Our datatset contains 110 medical appointments that have either been missed or not, collected together with data on patient health status (diabetes, hypertension, alcoholism, disability) as well as other patient characteristics (gender, age, welfare recipient status. The dataset also records information related to the appointment, such as appointment dates and location, a unique appointment ID and the patient ID.
The analysis approach will be to first try to identify associations between no-show appointments and factors of how medical appointments are organised (Part I). More specifically, I will try to answer the following questions: is the level of no-shows associated with the time patients need to wait? Is it associated with receiving an SMS? Is it associated with the day of the week of the appointment?
The second part (Part II) will aim to identify associations between no-shows and patient characteristics. More specifically, I will try to answer the following questions: is the level of no-shows associated with patient health status (diabetes, hypertension, alcoholism) ? Is it associated with gender, age or enrolement in a walfare program? A particular characteristics more common in the patients that most frequently miss appointments?
Finally, Part III regards differences between the 81 different hospitals in the dataset. Hospitals may differ in the rate of no-show appointments, as well as in how appointments are scheduled and in patient composition. Thus the final question is: is there a correlation between a hospital's level of no-show appointments and factors identified in parts 1 and 2?
One might suspect that the longer a patient needs to wait for their appointment, the less likely they are to turn up. This wait will here be defined as the number of days a patient waits after scheduling their appointment.
This wait (in days) can be calculated as the difference between the date of the medical appointment and the date on which it was scheduled.
We'll define a new wait column that is the result of subtracting scheduledday from appointmentday, with day time removed from the former using dt.floor('1d'). Since the result of this operation has the data type of timedelta64 which cannot be used by certain functions (such as hist()), the result is divided by np.timedelta() which transforms the date into a numeric value of days. This is then added as a new column called 'wait'.
df['wait']=((df.appointmentday - df.scheduledday.dt.floor('1d')) / np.timedelta64(1, 'D')) #inspired by https://stackoverflow.com/a/37840872
df['wait'].describe() #descriptive statistics summary of number of days patients wait for an appointment
The median wait is 4 days, but there is a wide spread, with a maximum of 179 days. 75% of patients wait 15 days or less.
To visualise the spread, let's look at histograms for wait times.
df['wait'].plot(title="Histogram of wait duration (days)", kind='hist'); #histogram of wait days
Given the wide spread, let's look more closely at the first 15 days, and at the long tail:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5)) #figure with two adjacent subplots
ax1.hist(df[df['wait']<=15].wait, bins=15); #15 bins between 0 and 15 days - each bar is one day
ax1.set_ylabel('frequency')
ax1.set_xlabel('wait duration (days)')
ax1.set_title('wait duration up to 15 days')
ax1.text(0.8, 0.8,'each bar is one day', ha='center', va='center', transform=ax1.transAxes) #from https://stackoverflow.com/a/8482667
ax2.hist(df[df['wait']>=74].wait, bins=15); #15*7=105, so 15 bins between 74 and 179 days - each bar is one week
ax2.set_title('wait duration 74 days to 6 months')
ax2.set_ylabel('frequency')
ax2.set_xlabel('wait duration (days)')
ax2.text(0.8, 0.8,'each bar is one week', ha='center', va='center', transform=ax2.transAxes);
df['wait'].value_counts().head()
The distribution of days patients have to wait for their appointment has a broad range and is strongly right-skewed. This is indicated by the large difference between mean (10.18 days) and median (4 days). Over a third (38.563) of appointments occur on the same day they are scheduled, then the wait duration drops from several thousand per day in the first week, to well below 50 a week in the range of 3-6 months wait. However the maximum wait durations are not outliers but part of a long tail.
Given this broad and skewed distribution of appointment wait duration, it is best to study no-shows after dividing the appointments into a small number of categories of wait durations that are more similar in size, using the Pandas cut() method:
wait_bins=[0,1,3,8,22,180] #define the edges of the categories
wait_labels=['same_day','2_days','same_week','3_weeks','6_months'] #category names
#create a new column categorising appointments according to wait duration
df['wait_duration']=pd.cut(df['wait'],wait_bins,labels=wait_labels,include_lowest=True,right = False)
df.wait_duration.value_counts() #check number of appointments in each category
Here the appointments have been grouped into 5 different levels of waiting time :
Next we'll calculate the number of no-show appointments by grouping appointments according to the wait time level and normalising the value_counts in the no-show category:
#the unstack function creates seperate columns for missed and kept appointments
df.groupby('wait_duration').no_show.value_counts().unstack()
#normalising the values converts them to the proportions of missed and kept appointments:
df.groupby('wait_duration').no_show.value_counts(normalize=True).unstack()
For the remainder of the notebook, there will frequently be a need to calculate the proportion of missed appointments in different categories. This can be calculated using groupby(), normalised value_counts(), and unstack()['Yes]. To simplify this calculation, I'll use a function called NSP for no-show proportions:
###Added function following reviewer's comments
def NSP(group,dtframe=df,normalize=True,missed_only=True):
"""
Returns the proportion of no-show appointments in different categories defined by
the group parameter.
Optional paramters:
dtframe : dataframe to use, df by default
normalize : can be set to False if number of no-shows is required
missed_only : can be set to False if both no-shows and kept appointments are required ; if
set to False, a dataframe will be returned, otherwise a pandas series by default
"""
if missed_only: #no-shows only, by default
result=dtframe.groupby(group).no_show.value_counts(normalize=normalize).unstack()['Yes']
else: # return 'Yes' and 'No' columns for no_show
result=dtframe.groupby(group).no_show.value_counts(normalize=normalize).unstack()
return result
##df.groupby('wait_duration').no_show.value_counts(normalize=True).unstack()
NSP('wait_duration')
#calculating the overall no-show rate:
df.no_show.value_counts(normalize=True)
Let's calculate the proportion of missed appointments in each wait category and visualise it in a bar chart. I'll combine groupby() and value_counts() with the normalize parameter set to True to obtain proportions of missed ("Yes") appointments for each category. (This way of calculating proportions of missed appointments will be used in future.)
#we'll only consider the proportion of missed appointments by selecting the 'Yes' column after calling unstack():
##noshows_by_wait = df.groupby('wait_duration').no_show.value_counts(normalize=True).unstack()['Yes']
noshows_by_wait=NSP('wait_duration')
#calculate the mean no-show rate across all appointments:
overall_noshows = df.no_show.value_counts(normalize=True)['Yes']
plt.bar(wait_labels,noshows_by_wait) # bar plot for each category
plt.title('Proportion of no-shows and wait duration')
plt.xlabel("wait duration")
plt.ylabel("proportion of no-show appointments")
plt.axhline(y=overall_noshows, color='r', linestyle='--'); #line for mean proportion of missed appointments
plt.text(4.7, 0.21, "mean no-shows" , rotation=0, verticalalignment='center');
Given the overall percentage of no-shows of 20.19%, missed appointments are particularly low for same-day appointments at 4.64%. It then rises continuously from 22.74% for appointments within the first 3 days to 32.98% for appointments between 3 weeks and 6 months.
Let's look how significant this difference in no-show rates by wait times is for the bulk of appointments, by visualising the proportions of all missed appointments, as well as the total appointments as represented by the different groups of wait duration. This is best represented as a pie chart :
#using documentation https://matplotlib.org/gallery/pie_and_polar_charts/pie_features.html
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,6)) #create two adjacent pie charts with same y axis
explode=(0.1,0.1,0,0,0) #emphasise the two first wedges
ax1.pie(df.groupby('wait_duration').no_show.value_counts().unstack()['Yes'], explode=explode, autopct='%.2f') #pie for no-shows, labelled with proportion of all no-shows
ax1.set_title('% of missed appointments')
ax1.legend(wait_labels,
title="wait duration",
loc="lower left",
bbox_to_anchor=(1, 0.6, 0.4, 1))
ax1.axis('equal')
ax2.pie(df.wait_duration.value_counts(), explode=explode, autopct='%.2f') #pie for all appointments, labelled with proportion of all appointments
ax2.set_title('% of all appointments')
ax2.axis('equal');
As an alternative to pie charts, below is a visualisation comparing the impact on total and no-show appointments using a stacked horizontal bar chart:
###Added following reviewers comments:
#create a dataframe with number of no-shows in each wait category
wait_appt_props=NSP('wait_duration',normalize=False).to_frame()
wait_appt_props=wait_appt_props/(wait_appt_props.sum()) #convert to proportions of no-shows
wait_appt_props['all appointments']=df.wait_duration.value_counts(normalize=True) #add total appointments, as proportions
wait_appt_props=wait_appt_props.transpose() #switch rows/columns for display in bar chart
wait_appt_props #check dataframe
#generate horizontal bar chart to compare wait category proportions of total and no-show appointments
ax=wait_appt_props.plot.barh(stacked=True, legend=None,figsize=(16,5))
#generate and format labels for stacked bar segments, adapted from https://stackoverflow.com/a/39280239
labels = []
for j in wait_appt_props.columns: #loop to format label text
for i in wait_appt_props.index:
label = "{0}:\n{1:.1f}%".format(j,wait_appt_props.loc[i][j]*100)
labels.append(label)
patches = ax.patches
for label, rect in zip(labels, patches): #loop to position labels
width = rect.get_width()
if width > 0:
x = rect.get_x()
y = rect.get_y()
height = rect.get_height()
ax.text(x + width/2., y + height/2., label, ha='center', va='center',fontsize=14)
plt.xlim((0,1)) # 0 - 100%
plt.text(0.44,1.3,"All appointments",fontsize=16) #label whole bar in figure
plt.text(0.42,0.32,"Missed appointments",fontsize=16)
ax.set_yticklabels([""]) #don't label bar on y axis
ax.set_title('Comparison of wait time impact on total and missed appointments');
The two categories with the shortest wait - ie. same-day appointments and those with a wait of 1 or 2 days - represent about one fifth of all missed appointments, but they represent a majority of all appointments. The last two categories (3weeks and 6 months) account for a minority of appointments but a majority of missed appointments.
In conclusion, the time a patient has to wait for their appointment is associated with no-show rates. For same-day appointments, no-show rates are particularly low. By contrast, wait times of 2-3 weeks, or 3 weeks to 6 months, have a higher rate of missed appointments - and these two categories together are responsible for a majority of missed appointments, although they make up about a quarter of all appointments.
df.sms_received.value_counts() #display the number of SMS reminders
In 35.482 cases, SMS messages were received by patients, reminding them of their appointment. Such reminders would be expected to lower the proportion of missed appointments. Let's check the effect :
#display number of no-shows, grouped by whether an SMS reminder was sent
##df.groupby('sms_received').no_show.value_counts()
NSP('sms_received', normalize=False, missed_only=False)
#no-shows grouped by SMS, converted to proportions
##df.groupby('sms_received').no_show.value_counts(normalize=True)
NSP('sms_received', missed_only=False)
##ax = df.groupby('sms_received').no_show.value_counts(normalize=True).unstack().plot.barh()
ax = NSP('sms_received', missed_only=False).plot.barh()
ax.set_xlabel('Proportion of no-show appointments')
ax.set_title('SMS reminders and no-show appointments')
plt.axvline( x=0.201933, color='r', linestyle='--') #mean proportion of missed appointments
handles, labels = ax.get_legend_handles_labels() # make legend and bar order consistent - from https://stackoverflow.com/a/34142297
ax.legend(reversed(handles), reversed(labels), loc='upper right') # reverse to keep order consistent
plt.xlim((0,1))
plt.legend(["overall no-show rate","not missed", "no-show"]);
Surprisingly, appointments in which an SMS has been received have a higher percentage of no-shows (27.57%) than appointments in which no reminder was received (16.69%).
Perhaps SMS reminders are not used for all appointments?
df.groupby('wait').sms_received.value_counts().head(19)
It turns out that SMS reminders are not sent for appointments occurring in the first 2 days. Since these are also the appointments with the lowest no-show rate, this could distort the effect of SMS reminders.
Let's look at the different categories of appointment wait times, first how many SMS reminders were received vs not received, then at the proportion of no-shows associated with SMS reminders being received or not :
df.groupby('wait_duration').sms_received.value_counts() # total numbers of appointments with SMS received or not
#proportion of no-shows in categories defined by wait duration and SMS reminder
##df.groupby(['wait_duration', 'sms_received']).no_show.value_counts(normalize=True).unstack()['Yes']
NSP(['wait_duration', 'sms_received'])
#create a dataframe with no-show rates as a proportion
##df_smsreminders=df.groupby(['wait_duration', 'sms_received']).no_show.value_counts(normalize=True).unstack()['Yes']
df_smsreminders=NSP(['wait_duration', 'sms_received'])
#use unstack to get one column for appointments with, and one without, SMS reminder
df_smsreminders=df_smsreminders.unstack()
#make column names informative
df_smsreminders=df_smsreminders.rename(columns={False:'No reminder', True:'SMS reminder'})
df_smsreminders
ax = df_smsreminders.plot.barh() #create plot with horizontal bars
ax.set_xlabel('Proportion of no-show appointments')
ax.set_title('SMS reminders and No-show appointments')
plt.gca().invert_yaxis() #preserve chronological order of wait groups, from https://stackoverflow.com/a/34094001
plt.axvline( x=0.2019, color='r', linestyle='--') #mean proportion of missed appointments
#handles, labels = ax.get_legend_handles_labels() # make legend and bar order consistent - from https://stackoverflow.com/a/34142297
#ax.legend(reversed(handles), reversed(labels), loc='lower right') # reverse to keep bar and label order consistent
plt.xlim((0,0.4));
SMS reminders lower the rate of missed appointments from 26.59% to 23.76% for the same_week category ; 34.67% to 28.78% in the 3 weeks category and 37.7% to 30.08% in the 6 months category. However these no-show rates are all higher than for 2_days and same-day appointments.
In those categories of appointments where they are sent, SMS reminders are associated with significantly lower no-show rates. This effect is stronger the longer patients wait for their appointment.
During the period studied, there were two public holidays in Brazil: International Workers Day on 1. May, which fell on a Sunday in 2016, and the optional holiday of Corpus Christi on 26. May, which was on a Thursday in 2016 (from https://www.timeanddate.com/holidays/brazil/2016)
Let's first look at the number of appointments on each day with appointments:
df.appointmentday.value_counts().sort_index() #sort_index() to have consecutive dates
#descriptive statistics of daily number of appointments
df.appointmentday.value_counts().describe()
df.appointmentday.value_counts().size #count number of days with appointments
Although data were collected between 29. April and 8. June, i.e. over a period of 40 days, appointments only took place on 27 days. One of these days is an outlier, 14. May, a Saturday with only 39 appointments, while the median number of appointments is 4308 on each day with appointments. There are no appointments on Corpus Christi (26. May) or the following day, leading to a period of 4 days (26., 27., 28., 29. May) without appointments. The first day with appointments following the public holidays (2. May, 30. May) have more appointments than the median, but this could be because they are Mondays, which on which numbers of appointments are commonly in the top quartile i.e. above 4440 (9. May, 16. May, 30. May, 6. June).
Next we'll look at the proportion of daily no-show appointments (on days with appointments):
##df.groupby('appointmentday').no_show.value_counts(normalize=True).unstack()['Yes']
NSP('appointmentday')
#descriptive statistics of proportion of daily no-show appointments
##df.groupby('appointmentday').no_show.value_counts(normalize=True).unstack()['Yes'].describe()
NSP('appointmentday').describe()
#Plot how the proportion of missed appointments varies daily during the period studied
#days without appointments are not considered, to obtain a continuous line
fig, ax = plt.subplots(figsize=(8,6))
##ax.plot(df.groupby('appointmentday').no_show.value_counts(normalize=True).unstack()['Yes']) #overall proportion of no-shows
ax.plot(NSP('appointmentday')) #overall proportion of no-shows
##ax.plot(df[df['wait']>0].groupby('appointmentday').no_show.value_counts(normalize=True).unstack()['Yes']) #prop. no-shows excluding same-day appointments
ax.plot(NSP('appointmentday',dtframe=df[df['wait']>0]))
fig.autofmt_xdate() # rotate and align the tick labels, as per docs - https://matplotlib.org/gallery/recipes/common_date_problems.html
plt.axhline(y=0.2019, color='r', linestyle='--') #mean proportion of missed appointments
#plt.axvline(x=2016-05-02)
plt.axvline(pd.Timestamp('2016-05-02'), color='b', linestyle=':') #line for first day after IWD https://stackoverflow.com/a/52666020
plt.text(pd.Timestamp('2016-05-02'),0.171,'Labour Day',rotation=0)
plt.axvline(pd.Timestamp('2016-05-30'),color='b', linestyle=':') # line for first day after Corpus Christi
plt.text(pd.Timestamp('2016-05-30'),0.171,'Corpus Christi',rotation=0)
ax.legend(['overall no-shows','excl. same-day appt.'],loc='upper right')
ax.set_ylabel('Proportion of no-show appointments')
ax.set_title('Daily variations in no-show rates');
The proportion of no-shows varies significantly each day. This is not resulting from different proportions of same-day appointments, as excluding same-day appointments has a similar patter.
In May, there are four peaks - could this be a weekly pattern? To test this, we'll calculate no-show rates for different days of the week.
#convert to day of the week as per https://stackoverflow.com/a/30222759
df['appointment_dotw'] = df['appointmentday'].dt.dayofweek #numeric keeps order https://pandas.pydata.org/pandas-docs/version/0.22/api.html#datetimelike-properties
#calculate no-show rates for each day of the week:
##df.groupby('appointment_dotw').no_show.value_counts(normalize=True).unstack()['Yes']
NSP('appointment_dotw')
##df[df['wait']>0].groupby('appointment_dotw').no_show.value_counts(normalize=True).unstack()['Yes']
NSP('appointment_dotw',dtframe=df[df['wait']>0])
#Create bar chart for no-shows on different days of the week
plt.figure(figsize=(15,5))
plt.subplot(121) #create first subplot
#calculate no-show rate for each day of the week, theen plot on bar chart
##df.groupby('appointment_dotw').no_show.value_counts(normalize=True).unstack()['Yes'].plot(kind='bar')
NSP('appointment_dotw').plot(kind='bar')
plt.ylabel("No-show rate")
plt.xlabel("Day of the week")
plt.title("No-show rates for days of the week");
plt.subplot(122) #create second bar chart
#same, but exclude same-day appointments
##df[df['wait']>0].groupby('appointment_dotw').no_show.value_counts(normalize=True).unstack()['Yes'].plot(kind='bar')
NSP('appointment_dotw',dtframe=df[df['wait']>0]).plot(kind='bar')
plt.ylabel("No-show rate")
plt.xlabel("Day of the week");
plt.title("No-show rates excluding same-day appointments for days of the week");
Over the period studied, the proportion of daily no-shows varies by over 6%, from 17% to 23%. The peaks do not have a weekly periodicity as there isn't a clear association with the day of the week. Although the mean no-show proportion varies by day of the week, the range is only around 2%.
The data contain several columns relating to the health and socio-economic status of the patient: scholarship, hypertension, alcoholism, disability.
Let's look if there's an association between missed appointments are and whether the patient belongs to one of these.
One might for example expect that more appointments are missed by patients with alcoholism. Let's compare the numbers of missed appointments (no_show - Yes) of patients with alcoholism (alcoholism - True) or without (False):
df.groupby('alcoholism').no_show.value_counts() #rows grouped by alcoholism, then the values of no_show are counted
when the normalize parameter is set to True, value_counts displays the proportions of kept ('No') vs missed ('Yes') appointments:
df.groupby('alcoholism').no_show.value_counts(normalize=True)
##df.groupby('alcoholism').no_show.value_counts(normalize=True)
NSP('alcoholism',missed_only=False)
The rate of missed appointments with patients affected by alcoholism is 20.15%, with unaffected patients it is 20.19% - there is almost no difference.
Let's look at all available patient characteristics and plot them to compare their effect. I will calculate the proportion of missed appointments for patients with vs without a particular characteristic.
The proportion of kept appointments is not informative since, together with the proportion of missed appointments, it will always add up to 1. In the following, we will only consider the proportion of missed appointments (no_show is 'Yes').
#using a loop to examine characteristics with boolean type, adding proportions of missed appointments to lists
#below unstack()['Yes'] is used to discard the proportion of kept appointments
characteristics = ['scholarship', 'hypertension','diabetes', 'alcoholism','disability'] #list of characteristics considered
yes = [] #list for proportion of missed appointments with patients that possess a characteristic
no = [] #list for proportion of missed appointments with patients that do not possess a characteristic
for characteristic in characteristics:
##result=df.groupby(characteristic).no_show.value_counts(normalize=True).unstack()['Yes']
result=NSP(characteristic)
print("proportion no-show appointments, patients with {} : {} ; without: {}".format(characteristic,result.loc[True],result.loc[False]))
yes.append(result.loc[True])
no.append(result.loc[False])
Now we'll create a dataframe from the above lists and plot the proportions of missed appointments depending on whether the appointment was for patients with or without the specified characteristic:
noshows_characteristics = pd.DataFrame({"with":yes, "without":no}, index=characteristics)
noshows_characteristics
ax = noshows_characteristics.plot.barh(figsize=(8,5))
ax.set_xlabel('Proportion of no-show appointments')
ax.set_title('Appointments and patient characteristics')
plt.axvline( x=0.2019, color='r', linestyle='--') #mean proportion of missed appointments
handles, labels = ax.get_legend_handles_labels() # make legend and bar order consistent - from https://stackoverflow.com/a/34142297
ax.legend(reversed(handles), reversed(labels), loc='upper right') # reverse to keep order consistent
plt.xlim((0,0.3));
Missed medical appointments are less prevalent when they are with patients with a disability, with diabetes, or with hypertension. Appointments of patients with these health conditions are 2-3% less likely to be missed. The proportion of missed appointments is unaffected by whether patients have alcoholism or not.
The patient characteristic with the largest difference in missed appointments is in relation to the welfare program Bolsa Família : appointments with patients that are enrolled are 4% more likely to be missed.
On the whole, the effect of patient characteristics on the proportion of missed appointments appears to be relatively small and its significance is uncertain. One may also be skeptical about comparing groups of patients with certain health conditions to groups with no such health conditions: patients with diabetes or hypertension are likely to have more frequent appointments, than patients without these health conditions, which in turn may result in a confounding variable impacting the prevalence of no-shows.
For these reasons it may be instructive to study the prevalence of missed appointments depending on patient characteristics in a second way.
For many events, a large part of an observed effect is due to a relatively small minority of causative constituent events - this is known as the Pareto principle or the 80/20 rule. In this section I will try to analyse if there are patient groups that cause such a disproportionate amount of missed appointments.
A second approach to studying the relation between no-shows and patient characteristics is to identify a group of patients that are disproportionately missing appointments, and then looking at the composition of that group in terms of patient characteristics. Are patients with multiple missed appointments more or less likely to have a certain health condition or to be a welfare recipient?
This second approach requires grouping patients by the number of their appointments and missed appointments. However, in the current dataframe, each row represents an appointment, and there is no straightforward way of answering this question.
Therefore, based on unique patient IDs, we will create a dataframe in which each row represents a patient, and determine the number of their missed and total medical appointments. Then we will add the patient characteristics to this new dataframe, called df_patients:
#create new df on basis of unique patient ID and columns with kept (No) and missed (Yes) appointments
df_patients = df.groupby(['patientid']).no_show.value_counts().unstack()
df_patients.head()
#create new df on basis of unique patient ID and columns with kept (No) and missed (Yes) appointments
df_patients = df.groupby(['patientid']).no_show.value_counts().unstack()
#df_patients = NSP('patientid', missed_only=False)
df_patients.head()
#Now we'll tidy up the dataframe: remove NaN, calculate the total number of appointments,
#rename columns and drop unneeded columns
df_patients.fillna(value=0, inplace=True) # missing values are in fact 0
df_patients['total_appt']= (df_patients['No'] + df_patients['Yes']).astype(int) # calculating total number of appointments
df_patients.reset_index(inplace=True) #reset the index
df_patients.rename(columns={'Yes':'missed_appt'}, inplace=True) # renaming columns with missed appointments
df_patients.drop('No', axis=1, inplace=True) #kept appointments not needed
df_patients['missed_appt'] = df_patients['missed_appt'].astype(int) #appointments are integers
df_patients.head()
#To get the patient characteristics, I'll create a copy of the appointments dataframe, dropping unneeded columns
df_patients_char = df.copy().drop(['appointmentid', 'scheduledday','appointmentday', 'sms_received', 'no_show','disability_int','appointment_dotw'], axis=1)
#Next, we're dropping rows with duplicated patientid - this should result in each row representing a unique patient
df_patients_char.drop_duplicates(subset='patientid',inplace=True)
df_patients_char.head()
#the two patient dataframes should have the same number of rows - this will allow them to be merged cleanly:
df_patients.shape, df_patients_char.shape
#merging the two dataframes on the basis of patientid
df_patients = df_patients.merge(df_patients_char, how='inner', on='patientid')
df_patients.shape #check number of unique patient ids
df_patients.head()
The new dataframe has columns specifying the number of total appointments and of missed appointments each unique patient has had.To check the new dataframe, let's start by looking at patients with alcoholism, calculating the mean rates (per patient) of missed appointments and of total appointments :
#Calculate the mean number of missed appointements for patients with and without alcoholism:
df_patients[df_patients['alcoholism']==True].missed_appt.mean(), df_patients[df_patients['alcoholism']==False].missed_appt.mean()
#Calculate the mean of total appointements for patients with and without alcoholism:
df_patients[df_patients['alcoholism']==True].total_appt.mean(), df_patients[df_patients['alcoholism']==False].total_appt.mean()
#Calculate the mean proportion of missed appointements for patients with and without alcoholism
#this should be the same as the overall proportion of missed appointments with
#patients with vs with alcoholism previously determined:
df_patients[df_patients['alcoholism']==True].missed_appt.mean() / df_patients[df_patients['alcoholism']==True].total_appt.mean(), df_patients[df_patients['alcoholism']==False].missed_appt.mean() / df_patients[df_patients['alcoholism']==False].total_appt.mean()
Looking at the patient characteristic of alcoholism, on average each patient with alcoholism has more missed appointments (mean of 0.45 vs 0.36 missed appointments), but on average each patient with alcoholism also has more medical appointments (mean of 2.23 vs 1.76 medical appointments). This means that the likelihood a patient with alcoholism to miss an appointment (20.15%) is virtually indistingushable from a patient without alcoholism (20.19%) - consistent with the previous finding.
Next we will try to group patients together on the basis of how many appointments they miss.
#number of patients grouped by the number of missed appointments:
df_patients.missed_appt.value_counts()
A large majority of patients never missed an appointment within the timeframe of the study, and only small numbers of patients repeatedly missed their medical appointment
#now we'll create 4 groups of patients according to the number of missed
#appointments and label each patient using a new column called reliability:
missed_bins=[0,1,2,4,18]
missed_labels=['never','once','a_few','repeatedly']
df_patients['reliability']=pd.cut(df_patients['missed_appt'],missed_bins,right = False,labels=missed_labels)
df_patients[['missed_appt','total_appt', 'reliability']].head(15) #verifying the new column
#number of patients in each reliability group:
###df_patients.groupby('reliability').sum() -change made following reviewers comments:
df_patients.groupby('reliability')['patientid'].count()
### added following reviewers comments
#number of patients with specific characteristics in each reliability group (summing boolean values):
df_patients.groupby('reliability')[['scholarship','hypertension','diabetes','alcoholism','disability']].sum()
The numbers of patients in the group who repeatedly miss appointments are quite small
#now we will look at the number and proportion of patients with alcoholism in each of the reliability groups
df_patients.groupby('reliability')['alcoholism'].value_counts(normalize=False).unstack()[True]
#same but expressed as proportions:
df_patients.groupby('reliability')['alcoholism'].value_counts(normalize=True).unstack()[True]
#loop to calculate the proportion of patients in reliability groups with a particular characteristics
characteristics = ['scholarship', 'hypertension','diabetes', 'alcoholism','disability']
never, once, a_few, repeatedly = [], [], [], [] #lists for calculated proportions
means=[] #list for proportion in whole population
for characteristic in characteristics:
result=df_patients.groupby('reliability')[characteristic].value_counts(normalize=True).unstack()[True]
char_never=result.loc['never']
char_once=result.loc['once']
char_a_few=result.loc['a_few']
char_repeatedly=result.loc['repeatedly']
print("proportion of patients with {} : never - {:.1%}, once - {:.1%}, a few: {:.1%}, repeatedly: {:.1%}".format(characteristic,char_never,char_once,char_a_few,char_repeatedly))
never.append(char_never)
once.append(char_once)
a_few.append(char_a_few)
repeatedly.append(char_repeatedly)
means.append(df_patients[characteristic].value_counts(normalize=True)[True]) # average proportion of characteristic
We will now consider gender as another patient characteristic that can be added to the other characteristics. First, let's check if there are gender differences re.missing appointments:
df_patients.groupby('gender').missed_appt.mean() #mean no-show appointments for female and male patients
df_patients.groupby('gender').total_appt.mean() #mean number of appointments for female and male patients
#calculate proportion of missed appointments for female and male patients
df_patients.groupby('gender').missed_appt.mean() / df_patients.groupby('gender').total_appt.mean()
Female patients have slightly more appointments than male patients, but the no-show rate is very similar for the two genders
#we will add gender as a patient characteristic - unlike the other patient characteristics it's a
#string (F or M) not a boolean, so the above loop couldn't be used, but is easily adapted
characteristics.append('male')
result=df_patients.groupby('reliability')['gender'].value_counts(normalize=True).unstack()['M']
char_never=result.loc['never']
char_once=result.loc['once']
char_a_few=result.loc['a_few']
char_repeatedly=result.loc['repeatedly']
print("proportion of male patients : never - {:.1%}, once - {:.1%}, a few: {:.1%}, repeatedly: {:.1%}".format(char_never,char_once,char_a_few,char_repeatedly))
never.append(char_never)
once.append(char_once)
a_few.append(char_a_few)
repeatedly.append(char_repeatedly)
means.append(df_patients['gender'].value_counts(normalize=True)['M']) # average proportion of characteristic
Only around one-third of patients in the study are male. Next we'll create a dataframe with the statistics of patient characteristics for visualisation.
#create a dataframe with the proportion of patients with the different
#characteristics studied in each reliability group
reliability_characteristics = pd.DataFrame({"never" : never, "once": once, "a few": a_few,"repeatedly":repeatedly}, index=characteristics)
reliability_characteristics=reliability_characteristics[['never', 'once', 'a few', 'repeatedly']] # control column order
reliability_characteristics
Now we will visualise these proportions in a bar chart. For each characteristic, a horizontal line shows the overall population mean of each characteristic, and the bars represent how common a particular characteristic is in each reliability group.
ax = reliability_characteristics.plot.bar(figsize=(10,5))
ax.set_ylabel('%patients in reliability groups')
ax.set_title('Proportion of patients')
ax.legend(loc='upper left')
vals = ax.get_yticks()
ax.set_yticklabels(['{:,.0%}'.format(x) for x in vals]); #format as percentage
for i in range(len(means)): #add horizontal lines of mean proportion for each characteristic
plt.axhline(y=means[i], xmin=i/6, xmax=(i+1)/6,color='black',linestyle='--')
Looking at the composition of the groups of patients who miss their appointments repeatedly or a few times, there is no over-representation of patients with hypertension, diabetes, alcoholism or in relation to patient gender. There is a higher proportion of disabled patients amongst those who miss their appointments repeatedly, but this is due to only 13 disabled patients in this group, and probably insignificant.
The only characteristic for which there appears to be a trend is for patients who are recipients of the Bolsa Familia scholarship. The proportion of these scholarship recipients rises in each category with decreasing reliability of turining up for medical appointment, and there is a small overrepresentation of these patients in the groups that miss their appointments a few times or repeatedly.
To conclude this part of the investigation, we will look at the proportion of no-shows in each reliability group, and at the proportions of appointments and no-shows in each reliability group.
#calculate the total number of no-shows in different reliability groups
df_patients.groupby('reliability').missed_appt.sum()
#calculate the total number of appointments in different reliability groups
df_patients.groupby('reliability').total_appt.sum()
#calculating the proportion of no-shows for the four reliability groups
missed_proportion = (df_patients.groupby('reliability').missed_appt.sum() / df_patients.groupby('reliability').total_appt.sum())
missed_proportion
#Bar chart of the proportion of no-shows in reliability groups
ax = missed_proportion.plot.bar(rot=45)
ax.set_ylabel('Proportion missed in reliability groups')
ax.set_title('Proportion of missed appointments and reliability')
ax.legend(loc='upper left');
There is little difference between reliabity groups in the proportion of missed appointments. For patients who have missed any appointments during the period over which data has been recorded, the likelihood that they miss an appointment is on average a slightly above 50%.
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,6)) #create two adjacent plots with same y axis
ax1.pie(df_patients.groupby('reliability').total_appt.sum(), autopct='%.1f', explode=[0,0.2,0,0]) #pie chart 'once' wedge detached for emphasis
ax1.set_title('% of total appointments')
ax1.axis('equal')
ax1.legend(df_patients.groupby('reliability').total_appt.sum().index,
title="missing appts",
loc="lower left",
bbox_to_anchor=(1, 0.6, 0, 1))
#explode=[0,0,0.2,0.1]
ax2.pie(df_patients.groupby('reliability').missed_appt.sum(), autopct='%.1f', explode=[0,0.2,0,0]) #second pie chart, 'once' wedge detached for emphasis
ax2.set_title('% of all missed appointments')
ax2.axis('equal');
As an alternative to pie charts, below the same data are showsn as a stacked horizontal bar chart:
reliability_appt_props=df_patients.groupby('reliability').missed_appt.sum().to_frame()
reliability_appt_props=reliability_appt_props/reliability_appt_props.sum()
reliability_appt_props['total']=df_patients.groupby('reliability').total_appt.sum()/df_patients.groupby('reliability').total_appt.sum().sum()
reliability_appt_props=reliability_appt_props.transpose()
reliability_appt_props
#generate horizontal bar chart to compare wait category proportions of total and no-show appointments
ax=reliability_appt_props.plot.barh(stacked=True, legend=None,figsize=(16,5))
#generate and format labels for stacked bar segments, adapted from https://stackoverflow.com/a/39280239
labels = []
for j in reliability_appt_props.columns: #loop to format label text
for i in reliability_appt_props.index:
label = "{0}:\n{1:.1f}%".format(j,reliability_appt_props.loc[i][j]*100)
labels.append(label)
patches = ax.patches
for label, rect in zip(labels, patches): #loop to position labels
width = rect.get_width()
if width > 0:
x = rect.get_x()
y = rect.get_y()
height = rect.get_height()
ax.text(x + width/2., y + height/2., label, ha='center', va='center',fontsize=14)
plt.xlim((0,1)) # 0 - 100%
plt.text(0.44,1.3,"All appointments",fontsize=16) #label whole bar in figure
plt.text(0.42,0.32,"Missed appointments",fontsize=16)
ax.set_yticklabels([""]) #don't label bar on y axis
ax.set_title('Reliability and no shows: comparison of total and missed appointments');
The pie chart shows that patients who repeatedly miss appointments are only responsible for a small minority of the overall no-shows. 61.1% of patients never miss an appointment during the period studied. around 1/4 of all appointments are with patients who have missed one appointment over the period studied, and the same group are responsible for nearly 2/3 of all missed appointments. The two pie charts show that, the pattern of no-shows does not follow the Pareto 80/20 rule, as it is not the case that a large majority of missed appointments are due to a minority of patients who disproportionately miss appointments. An important caveat is that this observation is only true for the period studied, which only covers 40 days.
As previously noted, age is recorded as integers up to a maximum of 115 years. To gain an initial overview of the age distibution of patients and their use of medical resources, we'll represent patient ages as a histogram as well as the sum of total appointments for each age. In a second graph, we'll display the mean number of appointments per patient at each patient age, as well as the mean number of no-show appointments per patient at each age.
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5)) #create figure with two columns
ax1.hist(df_patients.age, bins=24) #histogram of patient ages, ages up to 115 years, 24 bins => 1 bar = 5 years
ax1.plot(df_patients.groupby('age').total_appt.sum()) #sum of appointments for each patient age
ax1.set_title('Histogram of patient age, and total appointments')
ax1.set_ylabel('Frequency in age group / sum of appointments')
ax1.set_xlabel('patient age')
ax1.legend(['Total appointments','Age'])
ax2.plot(df_patients.groupby('age').total_appt.mean()[1:81]) #mean appointment number per patient for each year
ax2.plot(df_patients.groupby('age').missed_appt.mean()[1:81]) #mean no-show appointment per patient for each year
ax2.set_title('Mean appointments and no-shows by patient age')
ax2.set_ylabel('Mean appointments / mean no-shows per patient')
ax2.set_xlabel('patient age')
plt.xlim((0,80))
ax2.legend(['Total appointments - mean for patient age','Missed appointments - mean for patient age']);
As the histogram on the left shows, the distribution of patient age is relatively flat until approx. 65 years, after which it drops off. Patient and appointment numbers above age 80 are relatively low and will not be considered in some figures. The sum of total appointments for each age group approximately follows the age distribution. The mode of the distribution is 0-5 years, which also has a spike in total appointments for age 0 - this likely reflects births and associated medical interventions.
The graph on the left shows that appointment per patient and no-show appointments are relatively flat across patient age, the former varying between 1.5 and 2 appointments per patient, the latter varying between 0.25 and 0.5 per patient. The graph excludes the first data point (due to a single appointment recorded as age -1) and excludes data above age 80 where patient numbers are rather low.
To test if patient age is associated with no-show appointments, we'll calculate the proportion of missed appointments for each patient age - this is essentially the ratio between the two lines in the chart on the right above
#calculate no-show proportion for each age and store in pd series
noshow_rate_age=(df_patients.groupby('age').missed_appt.mean() / df_patients.groupby('age').total_appt.mean())
noshow_rate_age[1:81].plot(kind='line', figsize=(10,5)) #only consider ages up to 80
plt.title('Missed appointment rate by patient age')
plt.ylabel('Proportion no-show appointments')
plt.xlabel('patient age (years)')
plt.xlim((0,80))
plt.axhline(y=0.201933, color='r', linestyle='--') #mean proportion of missed appointments
plt.text(65, 0.21, "mean no-shows" , rotation=0, verticalalignment='center');
The proportion of no-show appointments rises up to the age of 20, then declines in continuously up to the age of 80. The relationship looks close to linear, so let's use Seaborn's regplot() for the two segments :
#convert to dataframe with appropriate columns for regplot()
noshow_rate_age=noshow_rate_age.reset_index().rename(columns={0:'no-shows'})
fig, (ax1, ax2) = plt.subplots(ncols=2,figsize=(15,5),sharey=True) #create figure with two adjacent subplots
sns.regplot(data = noshow_rate_age[1:22], x = 'age', y = 'no-shows', ax=ax1)#regression plot ages 0-20 years
ax1.set_title('No-shows increase with patient age up to age 20')
ax1.set_ylabel("no-show rate")
ax1.set_xlabel("patient age")
sns.regplot(data = noshow_rate_age[26:82], x = 'age', y = 'no-shows', ax=ax2) #regression plot ages 25-80 years
ax2.set_title('No-shows decrease with patient age 25-80 years')
ax2.set_ylabel("no-show rate")
ax2.set_xlabel("patient age");
The proportion of no-show appointments rises in an almost linear fashion up to the age of about 20, then declines in an linearly from the age 25 for the remaining decades.
To further analyse the association of no-show rates and age, we'll divide patients into the following age groups: infant (0-2 years), child (3-10 years), adolescent(11-17), young adult(18-24), adult(25-44), middle age(45-64), old age(65-115).
age_bins=[0,3,11,18,25,45,65,115]
age_labels=['infant','child','adolescent','young adult', 'adult', 'middle age', 'old age']
df_patients['age_groups'] = pd.cut(df_patients['age'],age_bins,right = False,labels=age_labels)
df['age_groups'] = pd.cut(df['age'],age_bins,right = False,labels=age_labels)
The following figure will visualise the average no-show rates for each age group, and how the no-show rates change with age group for same-day appointments, as well as appointments for which patients have to wait a day or more after scheduling the appointment.
#Proportion of no-show appointments in different age groups:
##no_show_agegroups=df.groupby('age_groups').no_show.value_counts(normalize=True).unstack()['Yes']
no_show_agegroups=NSP('age_groups')
#Proportion of no-show appointments in different age groups - for same-day appointments only:
##noshow_sameday=df[df['wait']==0].groupby('age_groups').no_show.value_counts(normalize=True).unstack()['Yes']
noshow_sameday=NSP('age_groups',dtframe=df[df['wait']==0])
#Proportion of no-show appointments in different age groups
# - only appointments for which patients wait for at least one day:
##noshow_notsameday=df[df['wait']>0].groupby('age_groups').no_show.value_counts(normalize=True).unstack()['Yes']
noshow_notsameday=NSP('age_groups',dtframe=df[df['wait']>0])
plt.figure(figsize=(16,5))
plt.subplot(121) #create first subplot
plt.bar(age_labels, no_show_agegroups)
plt.xlabel("Age group")
plt.ylabel("No-show rate")
plt.xticks(rotation=45)
plt.axhline(y=0.201933, color='r', linestyle='--') #mean proportion of missed appointments
plt.text(5, 0.21, "mean no-shows" , rotation=0, verticalalignment='center')
plt.title("No-show rates in different patient age groups")
plt.subplot(122) #create second bar chart
plt.bar(age_labels, noshow_notsameday) #bars for no-shows with appointments in which patients need to wait
plt.bar(age_labels, noshow_sameday) #superimpose bars for no-shows with same-day appointments
plt.xlabel("Age group")
plt.ylabel("No-show rate")
plt.xticks(rotation=45)
plt.legend(['Wait for appointment','Same-day appointment'])
plt.title("No-show rates for same-day and wait appointments");
The no-show rate is particularly elevated for appointments with adolescent patients and young adults. This association of no-shows with adolescent patients and young adults is identifiable for appointments in which patients need to wait for at least one day, and for same-day appointments, the association is most pronounced with adolescenet patients.
However, what is the significance of this association for the total number of no-show appointments? We'll use pie charts to compare the proportions of no-shows and total appointments for each age group.
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,5))
explode=(0,0.1,0.1,0,0,0,0) # emphasise wedges for adolescents and young adults
ax1.pie(df[df['no_show']=='Yes'].groupby('age_groups').age_groups.count(), explode=explode, autopct='%.1f')
ax1.set_title('% of missed appointments')
ax1.legend(age_labels,
title="Age group",
loc="lower left",
bbox_to_anchor=(1, 0.6, 0.4, 1))
ax1.axis('equal')
ax2.pie(df.groupby('age_groups').age_groups.count(), explode=explode, autopct='%.1f')
ax2.set_title('% of all appointments')
ax2.axis('equal');
age_appt_props=df[df['no_show']=='Yes'].groupby('age_groups').age_groups.count().to_frame()
age_appt_props=age_appt_props/age_appt_props.sum() #convert to proportions
age_appt_props['all_appt']=df.groupby('age_groups').age_groups.count()
age_appt_props['all_appt']=age_appt_props['all_appt']/age_appt_props['all_appt'].sum()
age_appt_props=age_appt_props.transpose()
age_appt_props
#generate horizontal bar chart to compare age category proportions of total and no-show appointments
ax=age_appt_props.plot.barh(stacked=True, legend=None,figsize=(16,5))
#generate and format labels for stacked bar segments, adapted from https://stackoverflow.com/a/39280239
labels = []
for j in age_appt_props.columns: #loop to format label text
for i in age_appt_props.index:
label = "{0}:\n{1:.1f}%".format(j,age_appt_props.loc[i][j]*100)
labels.append(label)
patches = ax.patches
for label, rect in zip(labels, patches): #loop to position labels
width = rect.get_width()
if width > 0:
x = rect.get_x()
y = rect.get_y()
height = rect.get_height()
ax.text(x + width/2., y + height/2., label, ha='center', va='center',fontsize=14)
plt.xlim((0,1)) # 0 - 100%
plt.text(0.44,1.3,"All appointments",fontsize=16) #label whole bar in figure
plt.text(0.42,0.32,"Missed appointments",fontsize=16)
ax.set_yticklabels([""]) #don't label bar on y axis
ax.set_title('Comparison of patient age impact on total and on missed appointments');
As the pie charts show, even though a slightly higher proportion of no-shows are due to adolescent and young adult patients, the impact on the bulk of no-show appointments is relatively insiginificant, because only a relatively small proportion of appointments are with patients in these age groups.
Finally, we'll look at whether the previously identified association of SMS reminders lowering no-shows is still visible across dfferent age groups :
#determine proportion of no-shows grouped by whether an SMS was received or not
#only consider appointments for which reminders are sent, i.e. with over 2 days' wait
agegroups_sms_effect=df[df['wait']>2].groupby(['age_groups','sms_received']).no_show.value_counts(normalize=True).unstack()['Yes']
agegroups_sms_effect=agegroups_sms_effect.unstack().rename(columns={True:'SMS received', False:'No SMS'})#.reset_index()#.rename(columns={0:'longer_wait'})
agegroups_sms_effect
#determine proportion of no-shows grouped by whether an SMS was received or not
#only consider appointments for which reminders are sent, i.e. with over 2 days' wait
agegroups_sms_effect=df[df['wait']>2].groupby(['age_groups','sms_received']).no_show.value_counts(normalize=True).unstack()['Yes']
agegroups_sms_effect=agegroups_sms_effect.unstack().rename(columns={True:'SMS received', False:'No SMS'})#.reset_index()#.rename(columns={0:'longer_wait'})
agegroups_sms_effect
ax = agegroups_sms_effect.plot.barh(rot=0)
plt.gca().invert_yaxis() #preserve chronological order of age groups, from https://stackoverflow.com/a/34094001
ax.legend(title=None)
ax.set_title('SMS reminders: effect across age groups')
ax.set_ylabel('')
ax.set_xlabel('No-show proportion');
SMS reminders lower the rate of no-shows across all age groups, though the effect is particularly small for young adults
As stated previously, the data set records 81 different hospitals ('neighbourhood' column) in which appointments take place. The previous two sections have identified factors that appear to be associated with no-show appointments. Thus the final question is to ask if there is a correlation between the above identified factors and variations between hospitals in the rate of no-show appointments.
The rationale is that hospitals may differ in their practices of scheduling medical appointments, as well as in their patient composition, and that such variations may be associated with differences in the mean proportion of no-show appointments in each hospital. Even though the multiplicity of factors involved implies that any correlations are expected to be weak, they may be detectable and, if detected, would increase confidence in the associations identified.
To start with, let's create and visualise a dataframe for the distribution of appointments across hospitals:
hosp_total_appts = df.neighbourhood.value_counts() #counts of total appointments in each hospital
plt.figure(figsize=(17,5))
plt.subplot(121) #create first subplot
hosp_total_appts.plot(kind='bar') #first plot of number of appointments in each hospital as bar diagram
plt.title("Total appointments in 81 hospitals")
plt.xticks(fontsize=7)
plt.xlabel("hospital")
plt.ylabel("total appointments")
plt.subplot(122) #create adjacent second subplot
labels=hosp_total_appts.head(20).index.tolist()+61*[''] # only label largest 20
plt.pie(hosp_total_appts, labels=labels) #pie diagram to visualise as proportion of all appointments
plt.xlabel("20 largest hospitals are labelled")
plt.axis('equal'); #make sure the pie isn't oval
hosp_total_appts.head() #show 5 largest hospitals
hosp_total_appts.tail() #show 5 smallest hospitals
There is a very wide range in the number of appointments per hospital - from 7717 down to single figures. This is problematic for the approach described for identifying correlations as we would be looking at fundamentally different types of instutions but giving each equal weight - the large number of small "hospitals" would likely distort our findings. For that reasons we will only consider the largest hospitals in our analysis :
#calculate the sum and proportion of appointments taking place in the 27 largest hospitals
hosp_total_appts.head(27).sum() , hosp_total_appts.head(27).sum()/hosp_total_appts.sum()
#get the number of appointments fpr the 27. largest hospital, and calculate the ratio between the largest and 27. largest
hosp_total_appts[26], hosp_total_appts[0]/hosp_total_appts[26]
We will look at the top 1/3 of largest hospitals. 78040 appointments took place in the 27 largest hospitals, representing 70.6% of all appointments. Thus, while covering most of the appointments recorded, restricting our analysis to the largest 27 will make the set of hospitals much more homogenous, as the smallest hospital considered will only be 4.2 times smaller than the largest, compared to a factor of 7700 if we included all 81 hospitals.
We will now create a new dataframe called df_hospitals and progressively add statistics specific for each hospital :
#create a dataframe in which the first column is the hospital name, the second the number of appointments
df_hospitals=hosp_total_appts.reset_index().rename(columns={'index':'neighbourhood','neighbourhood':'total_appointments'}) #make hospital name a column, rename columns
df_hospitals.head() #row order decreasing with number of appointments, bc value_counts() was originally used
#create series with proportion of no-show values for each hospital
hosp_ns_rate=df.groupby('neighbourhood').no_show.value_counts(normalize=True).unstack()['Yes']
#convert to dataframe with no_show_rate column for each hospital
hosp_ns_rate=hosp_ns_rate.to_frame().reset_index().rename(columns={'Yes':'no_show_rate'})
#use merge to add the no-show rate to df_hospitals
df_hospitals=pd.merge(df_hospitals, hosp_ns_rate,on='neighbourhood')
df_hospitals.head() #check
df_hospitals.head(27).describe() #descriptive stats for the largest third of hospitals
Create figure visualising the size and no-show rates for the largest third of hospitals:
hosp_largest27=df_hospitals.head(27) #restricting data to largest third
median_size_27largest=hosp_largest27.total_appointments.median() #median size in largest third of hospitals
mean_noshow_27largest=hosp_largest27.no_show_rate.mean() #mean no-show rate in largest third of hospitals
plt.figure(figsize=(15,5)) # set figure size
plt.subplot(121) #create first subplot
hosp_largest27.total_appointments.plot(kind='bar') #create bar plot with appointmente number of each hospital
plt.title("Total appointments in largest third of hospitals")
plt.xticks(fontsize=7)
plt.xlabel("hospital")
plt.ylabel("total appointments")
plt.axhline(y=median_size_27largest, color='black',linestyle='--')
plt.subplot(122) #create adjacent second subplot
hosp_largest27.no_show_rate.plot(kind='bar') #bar plot with corresponding no-show rates in each hospital
plt.title("No-show rates in largest third of hospitals")
plt.axhline(y=mean_noshow_27largest, color='black',linestyle='--')
plt.xlabel("hospital")
plt.ylabel("proportion of no-shows");
The sizes of hospitals considered is relatively homogenous. No-show rates in hospitals considered range from 15.8% to 26.3% - this should be sufficient for further analysis. Since mean and median no-show rates are close (20.39% vs 20.13%), the distribution of no-show rates is close to a normal distribution.
We'll redefine df_hospitals to only include the largest 27 hospitals:
df_hospitals=df_hospitals.head(27)
df_hospitals.shape #check
Next we will calculate for each hospital (and add to df_hospitals) some statistics on two factors previously found to be associated with no-show rates:
#calculate the proportion of same-day appointments, add them to df_hospitals as a column called 'same_day'
hosp_wait_short=df.groupby(['neighbourhood']).wait_duration.value_counts(normalize=True).unstack()['same_day']
hosp_wait_short=hosp_wait_short.to_frame().reset_index() #convert to dataframe suitable for merging
df_hospitals=pd.merge(df_hospitals, hosp_wait_short, on='neighbourhood') #merge dataframes using hospital names
#calculate the proportion of appointments in the categories in which patients wait up to 3 weeks and up to 6 months,
#then add them to df_hospitals as a column called 'long_wait'
hosp_wait_longer=df.groupby(['neighbourhood']).wait_duration.value_counts(normalize=True).unstack()['3_weeks'] + df.groupby(['neighbourhood']).wait_duration.value_counts(normalize=True).unstack()['6_months']
hosp_wait_longer=hosp_wait_longer.to_frame().reset_index().rename(columns={0:'longer_wait'})
df_hospitals=pd.merge(df_hospitals, hosp_wait_longer, on='neighbourhood')
#calculate the proportion of appointments in which patients receive an SMS reminder when they wait for
#over 7 days for their appointment, add this to df_hospitals as a column called 'SMS>7'
sms_1week = df[df['wait']>7].groupby(['neighbourhood']).sms_received.value_counts(normalize=True).unstack()[True]
sms_1week=sms_1week.to_frame().reset_index().rename(columns={True:'SMS>7'})
df_hospitals=pd.merge(df_hospitals, sms_1week, on='neighbourhood')
#calculate the proportion of appointments in which patients receive an SMS reminder when they wait for
#over 21 days for their appointment, add this to df_hospitals as a column called 'SMS>21'
sms_3weeks = df[df['wait']>21].groupby(['neighbourhood']).sms_received.value_counts(normalize=True).unstack()[True]
sms_3weeks=sms_3weeks.to_frame().reset_index().rename(columns={True:'SMS>21'})
df_hospitals=pd.merge(df_hospitals, sms_3weeks, on='neighbourhood')
df_hospitals.head() #check new columns
df_hospitals.shape
To calculate the Pearson's correlation coefficient between different hospital's no-show rate and other columns, we can use the corr() method:
df_hospitals.corr()['no_show_rate']
Next we'll use Seaborn's regplot() method to visualise the relationship between hospitals' no-show rates and the proportion of same-day and longer wait appointments. regplot() also visualises the 95% confidence interval for the regression line.Correlation coefficients calculated by the Pandas corr() method will be displayed on the chart.
corr_same_day=df_hospitals.corr()['no_show_rate']['same_day'] #get correlation coefficient for same_day column
corr_same_day="correlation:\n r = {}".format(corr_same_day.round(decimals=2)) #format for display
corr_longer_wait=df_hospitals.corr()['no_show_rate']['longer_wait'] #get correlation coefficient for longer wait
corr_longer_wait="correlation:\n r = {}".format(corr_longer_wait.round(decimals=2))
fig, (ax1, ax2) = plt.subplots(ncols=2,figsize=(15,5)) #create figure with two adjacent subplots
sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["same_day"], ax=ax1) #regression plot with same-day appointments
ax1.set_title('Correlation with same-day appointments')
ax1.set_ylabel("no-show rate")
ax1.set_xlabel("prop. same-day appointments")
ax1.text(x=0.85,y=0.9, s=corr_same_day, ha="center", va="center", transform=ax1.transAxes) #display correlation coefficient top right
sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["longer_wait"], ax=ax2) #regression plot with longer waits
ax2.set_title('Correlation with longer wait appointments')
ax2.set_ylabel("no-show rate")
ax2.set_xlabel("prop. longer wait appointments")
ax2.text(x=0.85,y=0.9, s=corr_longer_wait, ha="center", va="center", transform=ax2.transAxes); #display correlation coefficient
There is a moderate (r=-0.29) inverse correlation between the proportion of same-day appointments and no-show rates. This means that hospitals with a low no-show rate tend to have a high proportion of same-day appointments.
There is a weak (r=0.16) correlation between the proportion of longer-wait appointments and no-show rates. This means that hospitals with a high no-show rate tend to have a high proportion of longer-wait appointments, but this is only a weak tendency.
Next we'll use Seaborn's regplot() function to look at the association between SMS reminders sent for appointments with over 7 days, and over 21 days of waiting time :
corr_sms7=df_hospitals.corr()['no_show_rate']['SMS>7'] #get correlation coefficient for SMS>7
corr_sms7="correlation:\n r = {}".format(corr_sms7.round(decimals=2)) #format for display
corr_sms21=df_hospitals.corr()['no_show_rate']['SMS>21'] #get correlation coefficient for SMS>21
corr_sms21="correlation:\n r = {}".format(corr_sms21.round(decimals=2))
fig, (ax1, ax2) = plt.subplots(ncols=2,figsize=(15,5), sharey=True, sharex=True) #two adjacent plots with same axes
sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["SMS>7"], ax=ax1)
ax1.set_title('Correlation with SMS reminders wait >7days ')
ax1.set_ylabel("no-show rate")
ax1.set_xlabel("prop. SMS reminders wait >7days")
ax1.text(x=0.85,y=0.9, s=corr_sms7, ha="center", va="center", transform=ax1.transAxes)
sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["SMS>21"], ax=ax2)
ax2.set_title('Correlation with SMS reminders wait >21days')
ax2.set_ylabel("no-show rate")
ax2.set_xlabel("prop. SMS reminders wait >21days")
ax2.yaxis.set_tick_params(reset=True)
ax2.text(x=0.85,y=0.9, s=corr_sms21, ha="center", va="center", transform=ax2.transAxes);
There is an inverse correlation between the proportion of SMS reminders sent for wait periods over 7 days (and 21 days) and no-show rates. This means that hospitals with a high no-show rate tend to have a lower proportion of SMS reminders. Given the number of factors involved, the correlation coefficients of r=-0.51 and r=-0.43, for appointment wait times above 7 and 21 days, respectively, should be considered relatively strong correlations.
Next we will consider two patient-related characteristics that are associated with no-show rates :
welfare, by calculating the proportion of patients in receipt of welfare (named "scholarship")
and patient age, by calculating the mean patient age for each hospital (age), the proportion of older patients (old), and the proportion of younger patients (young).
#calculate the proportion of appointments with welfare recipient for each hospital and add to the df_hospitals dataframe
hosp_scholarship=df.groupby('neighbourhood').scholarship.value_counts(normalize=True).unstack()[True]
hosp_scholarship=hosp_scholarship.to_frame().reset_index().rename(columns={True:'scholarship'})
df_hospitals=pd.merge(df_hospitals, hosp_scholarship, on='neighbourhood')
#calculate the mean patient age across appointments for each hospital and add to df_hospitals
hosp_mean_age=df.groupby('neighbourhood').age.mean()
hosp_mean_age=hosp_mean_age.to_frame().reset_index()
df_hospitals=pd.merge(df_hospitals, hosp_mean_age, on='neighbourhood')
#for each hospital, calculate the proportion of appointments with patients in old or middle age group and add to the df_hospitals dataframe
hosp_old = df.groupby('neighbourhood').age_groups.value_counts(normalize=True).unstack()['old age'] + df.groupby('neighbourhood').age_groups.value_counts(normalize=True).unstack()['middle age']
hosp_old = hosp_old.to_frame().reset_index().rename(columns={0:'old'})
df_hospitals=pd.merge(df_hospitals, hosp_old, on='neighbourhood')
#for each hospital, calculate the proportion of appointments with patients in child, adolescent or young adult age group and add to the df_hospitals dataframe
hosp_young = df.groupby('neighbourhood').age_groups.value_counts(normalize=True).unstack()['adolescent'] + df.groupby('neighbourhood').age_groups.value_counts(normalize=True).unstack()['young adult']+ df.groupby('neighbourhood').age_groups.value_counts(normalize=True).unstack()['child']
hosp_young = hosp_young.to_frame().reset_index().rename(columns={0:'young'})
df_hospitals=pd.merge(df_hospitals, hosp_young, on='neighbourhood')
df_hospitals.shape
df_hospitals.head() #check new columns
Regression plot to check if the proportion of welfare recipients correlates with no-show rates:
corr_scholarship=df_hospitals.corr()['no_show_rate']['scholarship'] #get correlation coefficient
corr_scholarship="correlation:\n r = {}".format(corr_scholarship.round(decimals=2))
sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["scholarship"])
plt.title('Correlation with proportion scholarship recipients')
plt.ylabel("no-show rate")
plt.xlabel("prop. scholarship recipients")
plt.text(x=0.85,y=0.9, s=corr_scholarship, ha="center", va="center", transform=ax.transAxes); #show correlation coefficient
There is a moderate correlation (r=0.27) between the proportion of appointment with welfare recipients and no-show rates in hospitals. Next we'll use a regression plot to look for correlation between the proportion of older, and of younger patients, with no-show rates:
corr_old=df_hospitals.corr()['no_show_rate']['old']
corr_old="correlation:\n r = {}".format(corr_old.round(decimals=2))
corr_young=df_hospitals.corr()['no_show_rate']['young']
corr_young="correlation:\n r = {}".format(corr_young.round(decimals=2))
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15,5))
sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["old"], ax=ax1)
ax1.set_title('Correlation with old age group ')
ax1.set_ylabel("no-show rate")
ax1.set_xlabel("prop. patients in old age group")
ax1.text(x=0.85,y=0.9, s=corr_old, ha="center", va="center", transform=ax1.transAxes)
sns.regplot(y=df_hospitals["no_show_rate"], x=df_hospitals["young"], ax=ax2)
ax2.set_title('Correlation with young age group')
ax2.set_ylabel("no-show rate")
ax2.set_xlabel("prop. patients in young age group")
ax2.yaxis.set_tick_params(reset=True)
ax2.text(x=0.85,y=0.9, s=corr_young, ha="center", va="center", transform=ax2.transAxes);
There is a moderate (r=-0.32) inverse correlation between the proportion of appointments with older patients and no-show rates. This means that hospitals with a higher proportion of older patients tend to have a lower no-show rate.
Conversely there is a moderate (r=0.32) correlation between the proportion of appointments with younger patients (infants are excluded from this group) and no-show rates. This means that hospitals with a higher proportion of younger patients tend to have a higher no-show rate.
Finally, we'll use a regression plot to look at mean patient age and no-shows:
corr_age=df_hospitals.corr()['no_show_rate']['age']
corr_age="correlation:\n r = {}".format(corr_age.round(decimals=2))
plt.subplots(figsize=(7,5))
sns.regplot(x=df_hospitals["no_show_rate"], y=df_hospitals["age"])
plt.title('Correlation with mean patient age')
plt.xlabel("no-show rate")
plt.ylabel("mean patient age")
plt.text(x=0.7,y=0.7, s=corr_age, ha="center", va="center", transform=ax.transAxes);
Finally, the correlation co-efficients for all the variables considered:
df_hospitals.corr()['no_show_rate'].sort_values() #sorted by increasing value
In conclusion, variations between hospitals in no-show rates generally correlate with factors found to be associated with no-shows in parts I and II. No-shows tend to be lower in hospitals that have a higher proportion of SMS reminders, of older patients, and of same-day appointments. No-shows tend to be higher in hospitals that have a higher proportion of younger patients and welfare recipients.
In this investigation into medical appointments, I tried to identify factors associated with no-shows. Findings are based on simple inspection, analysis and visualisation of numeric data. Since no inferential statistical methods were used, these findings are tentative.
Interestingly, although data were collected on a variety of patient characteristics, these generally do not appear to be play an important role in predicting no-shows. The overall level of 20% no-shows is reduced by 2-3% in appointments with patients affected by health conditions such as a disability, diabetes, or hypertension. Alcoholism, gender have no discernable effect. There is a small increased of 4% of no-shows patients that are enrolled in the welfare program Bolsa Família.
An attempt to define specific characteristics that are more common in the group of patients that repeatedly miss appointments found that, apart from a small increase of the proportion of welfare recipients in this group (from 9% to 16%) such patients are in fact very similar to the group who never or rarely miss appointments (for the duration of the study). There is a moderate/weak correlation (r=0.27) between hospitals with a higher proportion of patients who are welfare recipients and a higher proportion of no-shows.
However, patients who repeatedly miss appointments only account for 7% of no-shows, in comparison to 65% due to patients who miss a single appointment. Perhaps surprisingly, the likelihood that an appointment with a patient will be missed is very similar with patients who miss a single appointment (51%), a few appointments (54%) and repeatedly (57%). Thus the observed pattern does no follow the Pareto 80/20 rule, since the majority of no-shows are due to a large number of patients who miss a single appointment.
However, a very important caveat here is that data were collected over a time-span of 40 days only, which included 27 days with medical appointments. It is possible that a longer-term study would reveal a discernable minority of patients causing the majority of no-shows. In this case, it would become possible to use some additional resources that could be concentrated on this minority to make the health system more efficient.
The only patient characteristic found to be clearly associated with no-shows is age. The rate of no-shows rises up to the age of 20. Then it declines in a linear manner over several decades, from around 25% missed appointments in patients in the mid-twenties to around 15% in patients in their 70s and 80s. However, the overall impact of the increased proportion of no-shows in young patients is minor, as only a minority of no-shows is due to children (11.1%) adolescents (9.9%) and young adults up to age 25 (11.1%). The effect of patient age is consistent with moderate correlations of no-shows in hospitals with a higher proportion of younger patients (r=0.32), with a lower proportion of older patients (r=-0.32), and with a lower mean age (r=-0.29).
Factors related to how medical appointments are organised appear to play a more important effect on the probability patients miss their appointment. The time a patient has to wait is strongly associated with no-shows: the overall rate of 20% no-shows drops over four-fold for same day appointments, to below 5% missed appointments. No-shows for appointments that occur 1-2 days after scheduling are 23%, then rise with increased waiting time up to 33% for appointment which patients wait for between over 3 weeks up to to 6 months.
The impact can be discerned in the bulk of no-shows: the majority of missed appointments are with patients waiting 2-3 weeks (30%) and over 3 weeks to 6months (28%), even though these categories only make up 17% and 11% of all appointments, respectively. There is also a moderate inverse correlation (r=-0.30) between a hospital's proportion of no-show appointments and its proportion of same-day appointments.
Although there are significant daily variations in no-shows, with a minimum 17% and a maximum 23% over the period studied, these do not appear to be associated with day of the week or public holidays.
Finally, SMS reminders, which are not sent out for appointments with wait of 2 or fewer days, appear to reduce the levels of no-shows by several percentage points. The apparent effect is strongest the longer patients wait for their appointment, for instance in the category in which patients wait over 3 weeks, no-show rates drop to 30% after receiving an SMS compared to 38% when patients don't receive an SMS.
It is important to stress that the associations identified do not reveal causal relationships. For instance the reduced level of no-shows after patients receive an SMS may be caused by the reminders counteracting forgetfulness. On the other hand the ability of receiving SMS messages may simply be a marker of a higher socio-economic status which may be associated with lower no-show rates. Or in the case of the much reduced rate of no-shows with same-day appointments, this could be due to them responding to particularly important/urgent medical needs rather than the fact of them occurring on the same day they are scheduled. In other words, scheduling less important appointments to occur on the same day my not lead to the same reduction in no-show rates.
However the findings hint at some practices that could potentially lower no-shows. If reminders do effectively reduce no-shows, these could be performed using different forms of commucation, such as a latter or phone call. A second organisational change could be to test the impact of changing more routine appointments into same-day appointments, particularly if this is applied to patient groups (younger patients, those who have previously missed an appointment) or appointment types (4 weeks - 6 months) with much higher no-show rates. For instance, rather than giving a patient a specific date for an appointment 3 months in advance, they could be required to make the appointment in a specific week 3 months in the future.