The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Home Explore final_project_GEOG540_middaugh

View in Fullscreen

Final project focused on the opioid epidemic for Geography class.

Like this book? You can publish your book online for free in a few minutes!

Download PDF

Related Publications

Discover the best professional documents and content resources in AnyFlip Document Base.

Published by glum.hippo, 2019-07-17 12:49:53

final_project_GEOG540_middaugh

Pages:

1 - 18

Final project focused on the opioid epidemic for Geography class.

Analyzing the Opioid Epidemic - The Impact of Opioid
Prescription Reformulation on Mortality Rates Indiana

April 15, 2019

Esmé Middaugh
GEOG 540
Final Project

1 Introduction

Reports about the impact of prescription opioid epidemic on the US are nearly ubiq-
uitous in today’s news. A growing problem since the late 1990s, the Department of
Health and Human Servcies was forced to declare the issue a public health emergency
(https://www.hhs.gov/opioids/about-the-epidemic/index.html). Understanding and address-
ing the issue is a complex and daunting task, requiring analysis from multiple perspectives to
gain a full picture of the many contributing factors, and many universities have been quick
to try and fill this need for multi-faceted analysis. Indiana University’s ’Addictions Grand
Challenge’(AGC) is one such program (https://news.iu.edu/stories/2018/11/iu/releases/08-
addictions-grand-challenge-phase-two.html).

I am currently serving on project funded by the AGC, "Opioid Addictions and the Labor Mar-
ket: Hiring and Training During an Epidemic,"led by Dr. Kosali Simon (IU School of Public and
Environmental Affairs) and Dr. Katy Börner (IU School of Informatics, Computing and Engineer-
ing). This project "aims to explain both the relationship between opioid prescriptions and mor-
tality, and between opioid use and labor force participation" (Opioid Addictions and the Labor
Market: Hiring and Training During an Epidemic Proposal). As my work on the project up to this
point has been mostly focused on the relationship between opioid prescriptions and mortality, I
decided to focus my paper and analysis on the same question.

My aims for this project are: 1. Cleaning, tidying, and merging my data. I completed some
preliminary cleaning earlier in the semester, but there were still a lot of issues with the data, as
well as with my code. I wanted to rewrite my code to make it more easily used by others and
to end up with a cleaner dataset. 2. Rudimentary exploratory data analysis for Indiana. This
includes line graphs and interactive choropleth maps created using plotly, a data visualizaion
library available for Python (built on Dash). For this report the graphs will be static, but code
to make them interactive will be available in the code. 3. Regression analysis of the relationship
between opioid prescriptions and drug mortality.

Completing these aims will hopefully result in some interesting preliminary findings and a
clean, easy to use data set that will continue to be relevant for my research.

1

2 Methods

Project Design
To meet the aims outlined above, I chose to focus my energies on the ’Cleaning, Tidying, and

Merging’ portion of work. There was a lot to do here, so most of the functions & libraries used fall
into this area, with a smaller portion in the ’Exploratory Data Analysis’ and ’Statistical Methods
(Regression Analysis)’ sections. I organized my work into final paper (where this is), code (where
my original jupyter notebooks are), raw_data, and clean_data to keep things organized.

Methods and Functions Used
Cleaning, Tidying & Merging
The following functions are to combine, clean and calculate some additional columns for the
dataset. While the calculated columns (annual changes for prescription and mortality data) aren’t
used in this paper, they are being used on the AGC project, so I included them here.
Libraries:

- import os, pandas, statistics

Exploring /checking data:

- pd.DataFrame.head()
- pd.DataFrame.info()

Creating new variables / cleaning individual variables:

- pd.DataFrame.apply()
- pd.DataFrame.applymap()
- pd.DataFrame.map()
- pd.DataFrame.get_loc()
- pd.DataFrame.iloc()
- pd.Series.str.cat - created one filed based off FIPS codes
- lambda functions
- statistics.mean()

Handling Whole Dataset:

- pd.DataFrame.merge()
- pd.DataFrame.drop()
- pd.DataFrame.dropna() - fixing missing data
- pd.groupby().mean() - calculating summary statistics for graphing
- pd.DataFrame.to_csv()

Exploratory Data Analysis
Line Graphs:

- import matplotlib.pyplot as plt
- plt.plot(), plt.ylim(), plt.xlabel(), plt.ylabel(), plt.title(), plt.show(), plt.clf()

Choropleth Maps:

2

- import plotly.plotly as py
- import plotly.figure_factory as ff
- ff.create_choropleth()
- import jenkspy
- jenkspy.jenks_breaks() - calculate Jenkins Natural breaks for use in choropleth maps
- conda install -c plotly plotly-orca - necessary for reading images
- pip install putsil - necessary for reading images
- from IPython.display import Image - read images back to the notebook

Regression Analysis

- from sklearn.linear_model import LinearRegression
- from sklearn.model_selection import cross_val_score
- import numpy as np
- np.reshape()

General

- print()
- pip install nbmerge - used to combine all of my jupyter notebooks

Number of Methods, Functions, and Libraries Used
In addition to the ~ 30 imported methods and functions, I also created many of my own func-
tions for use both within my code and later on in the AGC2 project. Please see the results section
for a more in depth explanation of how they function.
Data
The principal data used for this project comes from the CDC’s Wonder Tool
(https://wonder.cdc.gov/). It was collected by past research assistants for Dr. Kosali Si-
mon, and spans the years from 2006-2016. I’m confident that this is an appropriate dataset as it is
mentioned in the proposal for the grant and comes from a reputable source. The dataset covers
the entire united states. For the cleaning and merging of the data I used the entire dataset, which
I then narrowed down to Indiana for the exploratory data analysis, regression analysis, and the
summary/discussion.
For the creating the Indiana dataset I also utilized data from the United States Board on Ge-
ographic Names (https://geonames.usgs.gov/domestic/download_data.htm) to get longitude
and latitude data from FIPS. Ultimately I did not use this in my analysis (I was originally thinking
of doing k-means clustering with it), but kept it for possible future use.
US

Data columns (total 12 columns):

county 30801 non-null object

fips 30801 non-null object

state_abbrv 30801 non-null object

state 30801 non-null object

fips_state 30801 non-null object

year 30801 non-null int64

population 30801 non-null int64

prescription_rate 30801 non-null float64

age_adjusted_mortality_range 30801 non-null object

3

avg_mortality_rate 30801 non-null float64

change_mortality_rate 27825 non-null object

change_prescription_rate 27825 non-null object

dtypes: float64(2), int64(2), object(8)

memory usage: 4.3+ MB

Indiana

Data columns (total 14 columns):

county 994 non-null object

fips 994 non-null object

state_abbrv 994 non-null object

state 994 non-null object

fips_state 994 non-null object

year 994 non-null int64

population 994 non-null int64

prescription_rate 994 non-null float64

age_adjusted_mortality_range 994 non-null object

avg_mortality_rate 994 non-null float64

change_mortality_rate 903 non-null object

change_prescription_rate 903 non-null object

latitude 994 non-null float64

longitude 994 non-null float64

dtypes: float64(4), int64(2), object(8)

memory usage: 116.5+ KB

3 Results

3.1 Clean and Merge Data

This code merges the yearly CDC prescription data into one csv, combining the separate files for
2006 through 2016 into one and appending a column with the appropriate year. It also incorpo-
rates latitude and longitude data for indiana, which is also saved to a separate csv.

In [1]: def find_avg_mortality(mort_range):
"""
#Calculating the average mortality rate given the Estimated Age-adjusted Death Rate
Parameters:
mort-range - a estimated range of average mortality rates in a county,
given in the form of , '<2', #-#' or '30+'

Returns:
The mean of the mort-range, or 30 if the range is >30
"""
if '+' in mort_range:

return 30.0
elif '<' in mort_range:

return float(statistics.mean([0,2]))

4

else:
separated = mort_range.split('-')
lo, hi = float(separated[0]), float(separated[1])
return statistics.mean([lo, hi])

def clean_mortality_data(df):
''' Takes the mortality data from CDC and cleans it (using find_avg_mortality) the
df['avg_mortality_rate'] = df['Estimated Age-adjusted Death Rate, 16 Categories (in
.apply(find_avg_mortality)
#For my research we only need the years 2006 to 2016
clean_df = df[df['Year'] >=2005]
return clean_df

def merge_datasets(clean_prescription_df, clean_mortality_df):
"""
Combines our prescription and morality datasets and modifies
the header names to make them more usable.

"""
## Creating Merged Data Frame
df = pd.merge(clean_prescription_df, clean_mortality_df,

left_on = ['FIPS County Code', 'Year'],
right_on = ['FIPS', 'Year']) #Merging based on County Code and Year
df = df.drop(['FIPS County Code', 'County_y'], axis=1) #Dropping Duplicates

#Renaming
df.columns = ['county', 'state_abbrv', 'prescription_rate',

'year', 'fips', 'state', 'fips_state', 'population',
'age_adjusted_mortality_range', 'avg_mortality_rate']

#Reordering the columns
df = df[['county', 'fips', 'state_abbrv', 'state', 'fips_state',

'year', 'population', 'prescription_rate', 'age_adjusted_mortality_range',
'avg_mortality_rate']]

return df

def yearly_change(source_df, source_col, change_col, sort_col, year_col):
""" Calculates the change from the prior year for the given column
Parameters:
df - the pd.DataFrame you want to modify
source_col - the column you want to calculate annual change for
change_col - the column to hold these calculations

Return:
df - the dataframe with the updated columns
"""
#Sorting columns first so later iterating will work

5

df = source_df.sort_values([sort_col,year_col],ascending=[True,True])

#Getting the index for source column for use in iloc later
source_col_index = df.columns.get_loc(source_col)
fips_index = df.columns.get_loc(sort_col)
year_index = df.columns.get_loc(year_col)
df[change_col] = None
change_col_index = df.columns.get_loc(change_col)

for row in range(len(df)):
#checking that it is the same county code
if df.iloc[row,fips_index] == df.iloc[row-1, fips_index] \
and df.iloc[row - 1, year_index] < df.iloc[row, year_index] :
df.iloc[row, change_col_index] = df.iloc[row, source_col_index] \
- df.iloc[row - 1, source_col_index]

return df

def fix_fips(df, fips_col ='fips', fips_size=5):
"""
This is a function to fix the FIPS column.
The fips_size refers to whether you are trying to
fix a county or state level fips;
pass the size of the fips code you are trying to get.

"""
df[fips_col] = df[[fips_col]].applymap(lambda x: ('000' + str(x))[-fips_size:])
return df

In [2]: import os
import pandas as pd
import statistics

In [3]: # Reading in Prescription and Mortality Files
# Assumes you are in 'code' and 'raw_data' is another folder in parent directory

prescription_fname = "../raw_data/cdc_opioid_prescribing_rate.csv"
with open (prescription_fname) as f:

prescription_df = pd.read_csv(f, na_values="")

mortality_fname = "../raw_data/NCHS_-_Drug_Poisoning_Mortality_by_County__United_State
with open (mortality_fname) as f:

mortality_df = pd.read_csv(f)

In [4]: #Preliminary_cleaning the Prescription Data -- simple
prescription_df = prescription_df.dropna()
clean_prescription_df = prescription_df[prescription_df['Year'] >= 2005]
clean_prescription_df.head()

6

#Cleaniing the mortality_df
clean_mortality_df = clean_mortality_data(mortality_df)

df = merge_datasets(clean_prescription_df, clean_mortality_df)
df.head()

Out[4]: county fips state_abbrv state fips_state year \
0 Anchorage, AK Alaska 2 2006
1 Fairbanks North Star, AK 2020 AK Alaska 2 2006
2 Alaska 2 2006
3 Juneau, AK 2090 AK Alaska 2 2006
4 Kenai Peninsula, AK Alaska 2 2006
Ketchikan Gateway, AK 2110 AK

2122 AK

2130 AK

population prescription_rate age_adjusted_mortality_range \
0 280,085
1 90,545 71.5 12-13.9
2 30,808
3 52,253 54.7 8-9.9
4 13,492
95.3 8-9.9

89.1 12-13.9

144.4 8-9.9

avg_mortality_rate
0 12.95
1 8.95
2 8.95
3 12.95
4 8.95

In [5]: #Clculating the changes per year using our year_change function
df = yearly_change(df, 'avg_mortality_rate', 'change_mortality_rate', 'fips', 'year')
df = yearly_change(df, 'prescription_rate', 'change_prescription_rate', 'fips', 'year')

## Fix the Population Data -- currently a string with a comma in it, need it to be an a
df['population'] = df['population'].map(lambda x: int(x.replace(',', '')))
df['population'][0]

#Using our fix_fips() function to make sure that our fips codes are in the right format
df = fix_fips(df)
df = fix_fips(df, 'fips_state', 2)

#Finally, for this project we will just be looking at indiana
indiana_df = df[df['state']=='Indiana']

In [6]: #Checking our final data frames to see if it looks right:
print(df.info())
print(indiana_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30801 entries, 8 to 30800

7

Data columns (total 12 columns):

county 30801 non-null object

fips 30801 non-null object

state_abbrv 30801 non-null object

state 30801 non-null object

fips_state 30801 non-null object

year 30801 non-null int64

population 30801 non-null int64

prescription_rate 30801 non-null float64

age_adjusted_mortality_range 30801 non-null object

avg_mortality_rate 30801 non-null float64

change_mortality_rate 27825 non-null object

change_prescription_rate 27825 non-null object

dtypes: float64(2), int64(2), object(8)

memory usage: 4.3+ MB

None

<class 'pandas.core.frame.DataFrame'>

Int64Index: 994 entries, 693 to 28680

Data columns (total 12 columns):

county 994 non-null object

fips 994 non-null object

state_abbrv 994 non-null object

state 994 non-null object

fips_state 994 non-null object

year 994 non-null int64

population 994 non-null int64

prescription_rate 994 non-null float64

age_adjusted_mortality_range 994 non-null object

avg_mortality_rate 994 non-null float64

change_mortality_rate 903 non-null object

change_prescription_rate 903 non-null object

dtypes: float64(2), int64(2), object(8)

memory usage: 101.0+ KB

None

Process Indiana FIPS to Latitude and Longitude

In [7]: """
Reading in our data. This solution was suggested
https://stackoverflow.com/questions/42868735/how-do-i-convert-from-census-fips-to-lat-l
and utilizes data from
https://geonames.usgs.gov/domestic/download_data.htm
"""
geo_df = pd.read_csv('../raw_data/IN_FedCodes_20190301.txt', delimiter='|')
#Selecting just our data and renaming
geo_df = geo_df[['STATE_ALPHA', 'STATE_NUMERIC', 'COUNTY_NUMERIC',
'COUNTY_NAME', 'PRIMARY_LATITUDE', 'PRIMARY_LONGITUDE']]

8

geo_df.columns = ['state_abbrv', 'fips_state', 'fips', 'county_short',
'latitude', 'longitude']

geo_df = fix_fips(geo_df, 'fips', 3)
geo_df = fix_fips(geo_df, 'fips_state', 2)
geo_df['fips'] = geo_df['fips_state'].str.cat(geo_df['fips']) # Using the series concat
geo_df.head()

#We have so many different individual ones within each county that we need to combine t
#resulting longitude and latitude is the approximte center of each county
geo_df = geo_df.groupby('fips').mean()
geo_df.head()

#Merging the two datasets and replacing indiana_df with the full result
indiana_df = indiana_df.merge(geo_df, on='fips')
print(indiana_df.info())

<class 'pandas.core.frame.DataFrame'>

Int64Index: 994 entries, 0 to 993

Data columns (total 14 columns):

county 994 non-null object

fips 994 non-null object

state_abbrv 994 non-null object

state 994 non-null object

fips_state 994 non-null object

year 994 non-null int64

population 994 non-null int64

prescription_rate 994 non-null float64

age_adjusted_mortality_range 994 non-null object

avg_mortality_rate 994 non-null float64

change_mortality_rate 903 non-null object

change_prescription_rate 903 non-null object

latitude 994 non-null float64

longitude 994 non-null float64

dtypes: float64(4), int64(2), object(8)

memory usage: 116.5+ KB

None

In [8]: #writing to CSV files so we can then use the clean data later without always having to
df.to_csv('../clean_data/opioid_data_2006-2016.csv')
indiana_df.to_csv('../clean_data/opioid_data_indiana_2006-2016.csv')

3.2 Exploratory Data Analysis

Now that the data is clean, it is helpful to do some exploratory data analysis.

In [9]: #Import statements
import numpy as np

9

import pandas as pd

import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [10]: #reading in teh data and double-checking that everything made it through the csv in pr
fname = "../clean_data/opioid_data_indiana_2006-2016.csv"
df = pd.read_csv(fname, dtype={'fips': str, 'fips_state': str} )
df.sort_values('fips').head()

Out[10]: Unnamed: 0 county fips state_abbrv state fips_state year \
0 0 Adams, IN 2006
1 1 Adams, IN 18001 IN Indiana 18 2007
2 2 Adams, IN 2008
3 3 Adams, IN 18001 IN Indiana 18 2009
4 4 Adams, IN 2010
18001 IN Indiana 18

18001 IN Indiana 18

18001 IN Indiana 18

population prescription_rate age_adjusted_mortality_range \

0 33887 64.1 4-5.9

1 33962 66.3 6-7.9

2 34214 68.4 6-7.9

3 34351 68.6 6-7.9

4 34455 65.7 8-9.9

avg_mortality_rate change_mortality_rate change_prescription_rate \

0 4.95 NaN NaN

1 6.95 2.0 2.2

2 6.95 0.0 2.1

3 6.95 0.0 0.2

4 8.95 2.0 -2.9

latitude longitude
0 40.746298 -84.956258
1 40.746298 -84.956258
2 40.746298 -84.956258
3 40.746298 -84.956258
4 40.746298 -84.956258

In [11]: #Extracting our variables for easier use in the graphs
year = df['year']
prescription = df['prescription_rate']
mortality = df['avg_mortality_rate']

#Summing our Data for each year so that we get clean line graphs

annual_prescription= df.groupby('year')['prescription_rate'].mean().values
annual_mortality = df.groupby('year')['avg_mortality_rate'].mean()
tiny_year = df.year.unique()

10

3.2.1 Line Graphs
In [12]: #Ploting Prescription Data

plt.clf()
plt.plot(tiny_year, annual_prescription)
plt.xlabel('Year')
plt.ylabel('#Opioid Prescriptions / 100 people')
plt.title('Opioid Prescriptions in Indiana \n2006-2016')
plt.show()

It looks like there was a sharp decrease in opioid prescriptions in Indiana. After talking with
Dr. Kosali Simon, this was due to a reformulation of the drug around this time. Given that this
was the height of the prescription time period, this may be a good point to split the dataset for
regression analysis. Let’s look at opioid-related deaths over the same time period:
In [13]: #Ploting Opioid Related Deaths

plt.clf()
plt.plot(tiny_year, annual_mortality)
plt.xlabel('Year')
plt.ylabel('#Opioid Related Deaths / 100,000 people')
plt.ylim(0, max(annual_mortality)+1)
plt.title('Opioid Mortality Rates in Indiana \n2006-2016')
plt.show()

11

3.2.2 Choropleth Maps

In [14]: #pip install jenkspy
import jenkspy

#conda install -c plotly plotly-orca
#pip install putsil
from plotly.offline import iplot, init_notebook_mode
import plotly.io as pio
import plotly.plotly as py
import plotly.figure_factory as ff
from IPython.display import Image

In [15]: ## Need to Calculate Natural Breaks for use in the colorbar --
#see https://github.com/mthh/jenkspy
prescriptions_2012 = df[df['year']==2012]['prescription_rate'].values.tolist()
fips_2012 = df[df['year']==2012]['fips'].values.tolist()
bins_pres= jenkspy.jenks_breaks(prescriptions_2012, nb_class=5)

In [16]: """
The following dependencies must be install for this code to work
pip install plotly

12

pip install geopandas==0.3.0
pip install pyshp==1.2.10
pip install shapely==1.6.3
You would also have to have a plotly account and API key.
This code was taken from the plotly tutorial 'Python USA County Choropleth Maps'
and modified to fit my dataset.
View the original tutorial at: https://plot.ly/python/county-choropleth/
"""
#Have to do so that we can save as images. In my original
#notebook these were interactive graphs, but LaTex can't convert those
init_notebook_mode(connected=True)
#2012 Opioid Prescription Rates in Indiana
fig = ff.create_choropleth(

fips=fips_2012, values=prescriptions_2012, scope=['IN', 'IL', 'OH'],
binning_endpoints=bins_pres,
county_outline={'color': 'rgb(255,255,255)', 'width': 0.5},
round_legend_values=True,
legend_title='# Prescriptions per 100 People',
title='2012 Opioid Prescription Rates in Indiana'
)
#Interactive Graph
#py.iplot(fig, filename='choropleth_2012_opioid_prescriptions_indiana')
#Static Graph
img_bytes = pio.to_image(fig, format='png')
Image(img_bytes)
Out[16]:

13

In [17]: #Reformatting data for easier use in the maps showing change over 2012-2016
df.head()
df_change = df[df['year']>=2012]

df_change = df_change.groupby(['fips'])\
.agg({'change_prescription_rate': np.sum, 'change_mortality_rate': np.sum})
print(df_change.head())

fips_change = df_change.index.values.tolist()
change_prescription_rate = df_change.change_prescription_rate.values.tolist()
change_mortality_rate = df_change.change_mortality_rate.values.tolist()

#Computing the natural breaks for the two datasets
bins_pres_change = jenkspy.jenks_breaks(change_prescription_rate, nb_class=5)
bins_mort_change = jenkspy.jenks_breaks(change_mortality_rate, nb_class=5)

fips change_prescription_rate change_mortality_rate
18001
18003 -4.1 4.0
18005 -13.7 4.0
18007 -57.3 4.0
18009 8.0
-5.3 8.0
-45.6

In [18]: #2012 to 2016: Change in Opioid Prescriptions
fig = ff.create_choropleth(
fips=fips_change,
values=change_prescription_rate,

14

scope=['IN', 'IL', 'OH'],
binning_endpoints=bins_pres_change,
county_outline={'color': 'rgb(255,255,255)', 'width': 0.5},
round_legend_values=True,
legend_title='Change in # Prescriptions per 100 People',
title='2012 to 2016: Change in Opioid Prescriptions'
)
#For interactive graph:
#py.iplot(fig, filename='choropleth_change_opioid_prescriptions_indiana')
#Static Graph
img_bytes = pio.to_image(fig, format='png')
Image(img_bytes)
Out[18]:

In [19]: #2012 to 2016: Change in Opoid Mortality Rates
fig = ff.create_choropleth(
fips=fips_change,
values=change_mortality_rate, scope=['IN', 'IL', 'OH'],
binning_endpoints=bins_mort_change,
county_outline={'color': 'rgb(255,255,255)', 'width': 0.5},
round_legend_values=True,
legend_title='Change in Opioid Related Deaths per 100,000 People',
title='2012 to 2016: Change in Opoid Mortality Rates'
)
py.iplot(fig, filename='choropleth_change_opioid_prescriptions_indiana')

15

#Static Graph
img_bytes = pio.to_image(fig, format='png')
Image(img_bytes)
Out[19]:

From the above choropleth maps, it appears that many of the areas with the steepest decrease
in prescription rates actually have the largest increase in opioid related mortality rates. It isn’t the
easiest to interpret however, so regression analysis will be helpful in determining if there actually
is a relationship.

3.3 Regression Analysis

Given the mountain-like shape of the prescription rates at 2012 regression analysis across the entire
dataset does not seem like it would reveal much; the rise and might hide correlation from before
and after the reformulation in 2012. As such, I performed split the dataset and ran two regressions,
one from 2006 to 2012, and one from 2012 to 2016.
In [22]: # Import LinearRegression

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np
def five_fold_regression_analysis(X, y):

"""
THIS FUNCTION IS ESSENTIALLY COPIED DIRECTLY FROM THE DATACAMP TUTORIAL
I put it together into one function for reuse, but the code is not mine.
"""

16

# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv=5)

#Print statement taken from https://scikit-learn.org/stable/modules/generated/skle
print('Array of scores of the estimator for each run of the cross validation.')
# Print the 5-fold cross-validation scores
print(cv_scores)

# Print the average 5-fold cross-validation score
print("Average 5-Fold CV Score: {}\n".format(np.mean(cv_scores)))

In [23]: #Splitting our data into pre/post reformulation
pre_2012 = df[df['year'] <=2012]
post_2012 = df[df['year'] >=2012]

pre_2012_mortality = pre_2012['avg_mortality_rate'].values.reshape(-1,1)
pre_2012_prescription = pre_2012['prescription_rate'].values.reshape(-1,1)

post_2012_mortality = post_2012['avg_mortality_rate'].values.reshape(-1,1)
post_2012_prescription = post_2012['prescription_rate'].values.reshape(-1,1)

#We will treat prescription as our independent variable and
#opioid related mortality rates as our dependent variable
print('Pre 2012:')

five_fold_regression_analysis(pre_2012_prescription, pre_2012_mortality)

print('Post 2012: ')
five_fold_regression_analysis(post_2012_prescription, post_2012_mortality)

Pre 2012:
Array of scores of the estimator for each run of the cross validation.
[-0.01699805 0.1991368 0.25072148 0.47599374 0.05145635]
Average 5-Fold CV Score: 0.19206206217475358

Post 2012:
Array of scores of the estimator for each run of the cross validation.
[-0.06966678 0.06733981 -0.06187388 0.08349046 -0.07546054]
Average 5-Fold CV Score: -0.011234185295052046

17

4 Summary and Discussion

The result of the five-fold regression score supports what the choropleths suggested. Before 2012
there is a relatively strong positive correlation between opioid prescription rates and mortality
rates. After 2012, however, there is a very slight negative correlation. Why is it that counties with
the biggest drop in prescriptions are actually correlated to an increase in opioid-related deaths?
One possible explanation is that these counties had the the highest reliance on opioid prescriptions
to begin with, so when that supply diminished it created a vacuum where people seek out deadlier
alternatives such as heroin or fentanyl. While much more analysis would need to be done to con-
firm this, the combination of the line graphs, choropleth maps, and five-fold regression analysis
make this a hypothesis that may be worth exploring further.

In the course of my my analysis there were lots of little areas and a few large areas where
I got stuck. Small syntactic Python mistakes cost a lot of time, but in the end were useful for
helping cement Python’s (and particularly pandas) rules. One area where I got confused was in
the regression analysis section; at first I didn’t use the five-fold method and I got an odd result
which took me a while to figure it. It was a helpful reminder about the importance of making
sure you are using the right statistical analysis technique. Another part of the project that I got
stuck on for a long time was figuring out the ins-and-outs of plotly, particularly graphing static
images instead of interactive plots. This was tricky to do because a) plotly is definitely made for
interactive graphing and b) I didn’t realize I would have to do this until I was nearly done with
the assignment and I tried to convert to a PDF via LaTex. Had I realized this sooner I might have
chosen a different library for my choropleth maps, but in the end I learned a lot and got the static
images of the graphs to show up.

5 References

https://www.hhs.gov/opioids/about-the-epidemic/index.html
https://www.fda.gov/drugs/drugsafety/informationbydrugclass/ucm338566.htm
https://news.iu.edu/stories/2018/11/iu/releases/08-addictions-grand-challenge-phase-

two.html
https://wonder.cdc.gov/
https://geonames.usgs.gov/domestic/download_data.htm

18

Click to View FlipBook Version