Front | Resume | Data Analysis and Applications | MSI Course Work | Writings | Contact

Part 1 : Pie | Part 2 : Exploration | Part 3 : Factors | Part 4 : Layers | Part 5 : Geoms/Stats | Part 6 : Polishing | Part 7 : Final Thoughts

: EPA TRI Data Analysis : Part 5

Geom_ and Stat_ Objects

Another look at R and ggplot2 this time with a new, much bigger dataset, and a bit more attention paid to the components of a plot.

So EPA Toxic Release data used in the last 4 exercises help a lot of interesting data, but it was missing something important, any information about time. All of the release events were from 2008, but it was not described when during that year, nor was there any indication of of trends over time.

To remedy this situation I have taken it upon myself to consolidate the last 10 epa TRI datasets into one, very large, database. This new database is only the essential information about each event, the facility information, the location, the chemical and its associated data (if available) and the amounts. A lot of the data that makes the TRI data sets so rich have been removed to save space and to account for the EPA's changing reporting structure each year.

To create the new database, the last 10 datasets were downloaded, extracted using their build in self-extract tools (which the EPA should know, really inconveniences Mac OS and Linux users), edited to remove non-standard columns, combined into one massive .csv file and the imported into a sqlite3 database.

For those looking to repeat the steps to create your own a couple of tips: beware of stray punctuation, chemical names hid all sorts of things that will break your database syntax; keep careful track of changing labels for columns between years, especially for totals and finally, bring some snacks, the database import of the csv, which weighed in around 140 MB, took around 3 hours (each time, I had to do it twice - see tip 1) on a Quad-Core Pentium with 4GB RAM. This is probably more of a problem with the GUI interfaces for SQLite than the database itself, and so it may speed up if you use the command line, but caveat importer.

The new database has around 900,000 rows of toxic release data, and will be available for download in the files section of this page.

The R code for this part may seem a little sparse but the objective is to explore the new time dimension in the data while demonstrating the power of the ggplot geom objects to rapidly show many different views on the data.

Section 1 : Preparing the R workspace


Like earlier exercise the initial steps will involve connecting to a SQLite database and drawing out some specific columns, namely those dealing with release types.

Step 1: Load the ggplot and rsqlite libraries.

library(ggplo2)

library(RSQLite)

Step 2: Connect to database.

con <- dbConnect(SQLite(), "C:\\Users\\dnfehren\\Desktop\\epa.sqlite")

Step 3: Send query and fetch result.

query <- dbSendQuery(con, statement = "SELECT tri_id, year, facility_state FROM tri")<

return <- fetch(query, n = -1)

Section 2 : This is a bit of dodge around what might be more elegant functions in R, but the following code is used to split the data into regions based on the US Census regions (with an additional region for Alaska) and provide a identifying column in the database.


A simple map of the census regions is available here.

This will involve some subsetting the main dataframe based on the states within each region, adding a region column, and then combining all of the parts back together.

Eventually this region information might make it in to the database proper, if it does I will update this page to reflect the change.

Step 1: Subset the dataframe based on the regions in each state, add a new column called "region" with the name of the region.

akl <- subset(ret, ret$facility_state == "AK")

akl$region <- "ALASKA"

pil <- subset(ret, ret$facility_state == "HI" | ret$facility_state == "AS" | ret$facility_state =="GU" | ret$facility_state =="MP" | ret$facility_state =="PR" | ret$facility_state =="VI")

pil$region <- "PACIFIC ISLANDS"

pac <- subset(ret, ret$facility_state == "CA" | ret$facility_state == "OR" | ret$facility_state == "WA")

pac$region <- "PACIFIC"

mtn <- subset(ret, ret$facility_state == "MT" | ret$facility_state == "ID" | ret$facility_state == "NV" | ret$facility_state == "WY" | ret$facility_state == "UT" | ret$facility_state == "CO" | ret$facility_state == "AZ" | ret$facility_state == "NM")

mtn$region <- "MOUNTAIN"

wnc <- subset(ret, ret$facility_state == "ND" | ret$facility_state == "MN" | ret$facility_state == "SD" | ret$facility_state == "IA" | ret$facility_state == "NE" | ret$facility_state == "KS" | ret$facility_state == "MO")

wnc$region <- "WEST NORTH CENTRAL"

wsc <- subset(ret, ret$facility_state == "OK" | ret$facility_state == "AR" | ret$facility_state == "TX" | ret$facility_state == "LA")

wsc$region <- "WEST SOUTH CENTRAL"

enc <- subset(ret, ret$facility_state == "WI" | ret$facility_state == "MI" | ret$facility_state == "IL" | ret$facility_state == "IN" | ret$facility_state == "OH")

enc$region <- "EAST NORTH CENTRAL"

esc <- subset(ret, ret$facility_state == "KY" | ret$facility_state == "TN" | ret$facility_state == "MS" | ret$facility_state == "AL")

esc$region <- "EAST SOUTH CENTRAL"

ngl <- subset(ret, ret$facility_state == "ME" | ret$facility_state == "VT" | ret$facility_state == "NH" | ret$facility_state == "MA" | ret$facility_state == "RI" | ret$facility_state == "CT")

ngl$region <- "NEW ENGLAND"

mat <- subset(ret, ret$facility_state == "NY" | ret$facility_state == "NJ" | ret$facility_state == "PA")

mat$region <- "MID ATLANTIC"

sat <- subset(ret, ret$facility_state == "MD" | ret$facility_state == "DE" | ret$facility_state == "DC" | ret$facility_state == "WV" | ret$facility_state == "VA" | ret$facility_state == "NC" | ret$facility_state == "SC" | ret$facility_state == "GA" | ret$facility_state == "FL")

sat$region <- "SOUTH ATLANTIC"

Step 2: Recombine all of the dataframes by row.

rel_regions <- rbind(akl,pil,pac,mtn,wnc,wsc,enc,esc,ngl,mat,sat)

Section 3 : Plotting with geoms.


Now that we have this huge data set, what does it look like?

Step 1: Set the ggplot object to use the combined data frame as its data source.

plot <- ggplot(rel_regions)

Step 2: Plot: Histogram of the overall dataset, where did the most release events occur in the last 10 year.

plot + geom_histogram(aes(region))

Click for larger images

Histogram of the overall dataset

So it looks like for the 10 reporting sessions up to 2008 the East North Central region had the most release events. There are fairly heavily population Great Lakes states and the homes of a great deal of the US heavy industry, so the count is not surprising. The South Atlantic and West South Central appear to be tied for second. The West South Central includes TX and LA two stats that are heavily involved in the petroleum industry, so the elevated amount of releases is expected. The South Atlantic makes less sense with regard to its prominence and will probably require more research.

Step 3: Plot: Density plot of the full dataset, where did events occur in a great density?

plot + geom_density(aes(region, fill = region))

Click for larger images

Density plot of the full dataset

Several interesting things appear in this plot which uses the stat_density attributes of ggplot to create smoothed frequency plots for the data. The first, odd though inconsequential, is that the densities appear to progress in near alphabetical order, not significant but still. The more surprising element of the graph is the very large spike seen for Alaska which although it has a low number of releases comparatively has them in greater density.

Step 4: Plot: This is sort of a 2d histogram, the count of each combination of region and year in the dataset is plotted using color blocks.

plot + geom_bin2d(aes(factor(year),factor(region)))

This plot does not hold as many surprises as the density plot, but does contain a lot of information about the releases between 1999 and 2008. One important thing to notice is that the deep red indicating the highest number of releases across the ten periods is not for Alaska, but for the East North Central. Though the changes are slight, in this version of the plot it does appear that all regions have had slight decreases in the number of releases. Refinements to this plot should use the stat_bin options to modify the binning procedures that produce each year and regions data points.

Click for larger images

count of each combination of region and year in the dataset

Step 5: Plot: 2d histogram, this time using hex's instead of squares.

coming soon

Step 6: Plot: Line plot.

coming soon

Files

R Script this can be loaded in R and used to reproduce this exercise's commands.

Custom SQLite database file of EPA data from 1999 to 2008 (~160MB).