Part 1 : Pie | Part 2 : Exploration | Part 3 : Factors | Part 4 : Layers | Part 5 : Geoms/Stats | Part 6 : Polishing | Part 7 : Final Thoughts
The goal of this shorter exercise is make use of the factoring capabilities of R to reduce the complications of large data sets. In this case we will be working to determine if counties with higher percentages of poverty show any difference in the ammount of toxic releases.
The EPA Toxic Release Inventory 2008 dataset will again be used for this exercise but other information will be added to it.
Get the modules and the data you needed loaded in to R. We will be using the sqlite database of the data set, which can be created based on instructions in Part 2, or by downloading it from this page
Step 1: Load the sqlite and the ggplot library.
library(RSQLite)
library(ggplo2)
Step 2: Create the database connection.
con <- dbConnect(SQLite(), "C:\\Users\\dnfehren\\Desktop\\epa.sqlite")
Step 1: Create and send a query to the database asking for state, county and total release data.
query <- dbSendQuery(con, statement = "SELECT ST, County, Carcinogen, Total_Releases from tri");
Step 2: Fetch an R dataframe from the query.
return <- fetch(query, n = -1)
Step 3: Reclass some of the data frame columns from character to more useful classes.
replace_tr <- as.numeric(return$Total_Releases)
return$Total_Releases <- replace_tr
rreplace_car <- as.factor(return$Carcinogen)
return$Carcinogen <- replace_car
Step 5: Rename the columns.
colnames(return) <- c("state", "county", "carcinogen", "total_rel")
Step 6: Now the factoring, first thing to do is create points where the data will be cut into factors and the labels for those factors.
cutpoints_c <- c(-1,.3,200,4000,30000,3.305e+08)
cutlabels_c <- c("small","med-small","med","large","very large")
Step 7: Create a new column with the cut data as factors then save that column into the origional data frame.
chem_cut <- cut(return$total_rel, breaks = cutpoints_c, labels = cutlabels_c)
return$total_rel <- chem_cut
County by county poverty data was found through the US Census 2008 data available here. The data comes as an excel spreadsheet but should be edited to clean up column names and re-saved as a .csv file for easier reading in R.
Step 1: Load csv table into a R data frame.
pov <- read.csv("C:\\Users\\dnfehren\\Desktop\\est08ALL_edit.csv")
Step 2: Create cut points and lables.
cutpoints_p <- c(0,11,16,20,30,55)
cutlabels_p <- c("low","med-low","med","med-high","high")
Step 3: Perform cut and replace the old column in data frame
pov_cut <- cut(pov$Poverty_Percent_All_Ages, breaks = cutpoints_p, labels = cutlabels_p)
pov$Poverty_Percent_All_Ages <- pov_cut
Step 4: These are a lot of columns that aren't necessary for us in the poverty data, we'll a smaller data frame of columns using cbind
pov_sum <- cbind(pov[3],pov[4],pov[8])
Step 5: Rename the columns and uppercase all county names to match the EPA data.
colnames(pov_sum) <- c("state", "county", "pov_percent")
pov_sum$county <- toupper(pov_sum$county)
Step 1: Merge the two data frames, this will automatically use columns with the same names (here state and county) to join the frames together
merged_pov_chem <- merge(return, pov_sum)
Step 2: Create a plot based on the merged data showing how often releases of each size occured in counties with each level of poverty.
ggplot(merged_pov_chem, aes(x=pov_percent, fill=total_rel)) + geom_bar(position="dodge")
Click for larger images
Step 3: Create a plot based on merged data showing the prevalence of carcinogenic chemicals released in each county category.
ggplot(merged_pov_chem, aes(x=pov_percent, fill=carcinogen)) + geom_bar(position="dodge")
Click for larger images
Step 4: Create a historgram showing the count of all release events by county category.
qplot(pov_percent, data=merged_pov_chem, geom="histogram")
Click for larger images
Since the counts of the some of the factor categories are so varied, it made sense to try to compare the percentages of each type of release for each type of county.
The following data was not generated using R, the Excel sheet that was used is in the Files section of this page.
Revised code to accomplish this using R will be forthcoming.
_____ | small | med-small | med | large | very-large |
low | 27.07 | 26.14 | 24.52 | 13.33 | 8.94 |
med-low | 25.74 | 24.73 | 25.67 | 13.83 | 10.03 |
med | 22.61 | 25.03 | 24.91 | 15.04 | 12.41 |
med-high | 24.71 | 23.03 | 21.66 | 15.4 | 15.2 |
high | 22.72 | 19.2 | 18.97 | 16.16 | 22.95 |
Though it does not appear in the visualizations based on counting events, when the percentages of the occurrence of toxic release events are calculated the high prevalence of large toxic releases in the poorest counties becomes more clear.
R Script this can be loaded in R and used to reproduce this exercise's commands.
SQLite database file of EPA data.
Zipped .csv file containing a cleaned version of the county poverty data from the US Census.