Front | Resume | Data Analysis and Applications | MSI Course Work | Writings | Contact

Part 1 : Pie | Part 2 : Exploration | Part 3 : Factors | Part 4 : Layers | Part 5 : Geoms/Stats | Part 6 : Polishing | Part 7 : Final Thoughts

: EPA TRI Data Analysis : Part 3

Using Factors

The goal of this shorter exercise is make use of the factoring capabilities of R to reduce the complications of large data sets. In this case we will be working to determine if counties with higher percentages of poverty show any difference in the ammount of toxic releases.

The EPA Toxic Release Inventory 2008 dataset will again be used for this exercise but other information will be added to it.

Section 1 : Preparing the R workspace

Get the modules and the data you needed loaded in to R. We will be using the sqlite database of the data set, which can be created based on instructions in Part 2, or by downloading it from this page

Step 1: Load the sqlite and the ggplot library.



Step 2: Create the database connection.

con <- dbConnect(SQLite(), "C:\\Users\\dnfehren\\Desktop\\epa.sqlite")

Section 2 : Gathering and Preparing the Toxic Release Data

Step 1: Create and send a query to the database asking for state, county and total release data.

query <- dbSendQuery(con, statement = "SELECT ST, County, Carcinogen, Total_Releases from tri");

Step 2: Fetch an R dataframe from the query.

return <- fetch(query, n = -1)

Step 3: Reclass some of the data frame columns from character to more useful classes.

replace_tr <- as.numeric(return$Total_Releases)

return$Total_Releases <- replace_tr

rreplace_car <- as.factor(return$Carcinogen)

return$Carcinogen <- replace_car

Step 5: Rename the columns.

colnames(return) <- c("state", "county", "carcinogen", "total_rel")

Step 6: Now the factoring, first thing to do is create points where the data will be cut into factors and the labels for those factors.

cutpoints_c <- c(-1,.3,200,4000,30000,3.305e+08)

cutlabels_c <- c("small","med-small","med","large","very large")

Step 7: Create a new column with the cut data as factors then save that column into the origional data frame.

chem_cut <- cut(return$total_rel, breaks = cutpoints_c, labels = cutlabels_c)

return$total_rel <- chem_cut

Section 2 : Gathering and Preparing the County Poverty Data

County by county poverty data was found through the US Census 2008 data available here. The data comes as an excel spreadsheet but should be edited to clean up column names and re-saved as a .csv file for easier reading in R.

Step 1: Load csv table into a R data frame.

pov <- read.csv("C:\\Users\\dnfehren\\Desktop\\est08ALL_edit.csv")

Step 2: Create cut points and lables.

cutpoints_p <- c(0,11,16,20,30,55)

cutlabels_p <- c("low","med-low","med","med-high","high")

Step 3: Perform cut and replace the old column in data frame

pov_cut <- cut(pov$Poverty_Percent_All_Ages, breaks = cutpoints_p, labels = cutlabels_p)

pov$Poverty_Percent_All_Ages <- pov_cut

Step 4: These are a lot of columns that aren't necessary for us in the poverty data, we'll a smaller data frame of columns using cbind

pov_sum <- cbind(pov[3],pov[4],pov[8])

Step 5: Rename the columns and uppercase all county names to match the EPA data.

colnames(pov_sum) <- c("state", "county", "pov_percent")

pov_sum$county <- toupper(pov_sum$county)

Section 3 : Merging the Data and Creating Plots

Step 1: Merge the two data frames, this will automatically use columns with the same names (here state and county) to join the frames together

merged_pov_chem <- merge(return, pov_sum)

Step 2: Create a plot based on the merged data showing how often releases of each size occured in counties with each level of poverty.

ggplot(merged_pov_chem, aes(x=pov_percent, fill=total_rel)) + geom_bar(position="dodge")


Click for larger images

Bar chart of number of release events, categorized by size, in counties grouped by poverty level

Step 3: Create a plot based on merged data showing the prevalence of carcinogenic chemicals released in each county category.

ggplot(merged_pov_chem, aes(x=pov_percent, fill=carcinogen)) + geom_bar(position="dodge")


Click for larger images

Bar chart of number of carcinogenic release events, categorized by county poverty level

Step 4: Create a historgram showing the count of all release events by county category.

qplot(pov_percent, data=merged_pov_chem, geom="histogram")


Click for larger images

Histogram showing the count of release events over county poverty levels

Section 4 : Comparing Percentages

Since the counts of the some of the factor categories are so varied, it made sense to try to compare the percentages of each type of release for each type of county.

The following data was not generated using R, the Excel sheet that was used is in the Files section of this page.

Revised code to accomplish this using R will be forthcoming.

_____ small med-small med large very-large

Section 5 : Conclusions

Though it does not appear in the visualizations based on counting events, when the percentages of the occurrence of toxic release events are calculated the high prevalence of large toxic releases in the poorest counties becomes more clear.


R Script this can be loaded in R and used to reproduce this exercise's commands.

SQLite database file of EPA data.

Zipped .csv file containing a cleaned version of the county poverty data from the US Census.

Excel spreadsheet used to determin release percentages.