Front | Resume | Data Analysis and Applications | MSI Course Work | Writings | Contact

Part 1 : Pie | Part 2 : Exploration | Part 3 : Factors | Part 4 : Layers | Part 5 : Geoms/Stats | Part 6 : Polishing | Part 7 : Final Thoughts

: EPA TRI Data Analysis : Part 4


Another look at R and ggplot2 this time to see how companies are releasing toxic materials.

The EPA Toxic Release data set has a lot of data describing how chemicals were releases or during what business process they were released.

Section 1 : Preparing the R workspace

Like earlier exercise the initial steps will involve connecting to a SQLite database and drawing out some specific columns, namely those dealing with release types.

Step 1: Load the ggplot and rsqlite libraries.



Step 2: Connect to database.

con <- dbConnect(SQLite(), "C:\\Users\\dnfehren\\Desktop\\epa.sqlite")

Step 3: Send query and fetch result.

query <- dbSendQuery(con, statement = "SELECT TRI_FACILITY_ID,Fugitive_Air,Stack_Air,Water,Underground_Class_I,Underground_Class_II_V,RCRA_C_Landfills,Other_Landfills,Land_Treatment,Surface_Impoundment,RCRA_C_Surface_Impoundment,Other_Surface_Impoundment,Other_Disposal,POTW_Total_Transfers,M10,M41,M62,M71,M81,M82,M72,M63,M66,M67,M64,M65,M73,M79,M90,M94,M99,M20,M24,M26,M28,M93,M56,M92,M40,M50,M54,M61,M69,M95 FROM tri");<

return <- fetch(query, n = -1)

Section 2 : Reform the data, splitting on and off site release types.

This will involve some subsetting data frames as well as the melt function to create new data structures that can be plotted

Step 1: Grab the on site types of release from the query return and then rename the columns with easier labels.

on_site_wide <- return[c("TRI_Facility_ID","Fugitive_Air","Stack_Air","Water","Underground_Class_I","Underground_Class_II_V","RCRA_C_Landfills","Other_Landfills","Land_Treatment","Surface_Impoundment","RCRA_C_Surface_Impoundment","Other_Surface_Impoundment","Other_Disposal")]

names(on_site_wide) <- c("Facility ID","Air Leak","Air Release", "Water", "Strict Well", "Other Well", "Strict Landfill", "Other Landfill", "Soil", "Surface Holding", "Strict Surface Holding","Other Surface","Other")

Step 2: Grab the off site types of release from the query return and then rename the columns with easier labels.

off_site_wide <- return[c("TRI_Facility_ID","POTW_Total_Transfers","M10","M41","M62","M71","M72","M63","M64","M65","M73","M79","M90","M94","M99")]

names(off_site_wide) <- c("Facility ID","OS POTW","OS Storage","OS Solidification","OS WasteWater Treatment", "OS Well", "OS Old Well Data", "OS Surface", "OS Landfill", "OS Strict Landfill", "OS Soil", "OS Other Surface","OS Other", "Waste Broker", "Unknown")

Step 3: Use the melt() command to reshape the data frames from their "wide" forms to "long" forms where each column is given a single row and variable.

on_site_full <-, id=1)

off_site_full <-, id=1)

Step 4: The data set has a lot of zeros, here the subset() function removes them

on_site <- subset(on_site_full, on_site_full$value != 0)

off_site <- subset(off_site_full, off_site_full$value != 0)

Step 5: Use qplot to create some simple histograms to see the relative counts of release events.

qplot(variable, data=on_site, geom="histogram")

qplot(variable, data=off_site, geom="histogram")


Click for larger images

Histogram of on site release types

Histogram of off site release types


So it appears that most of the on site releases happen into the air, either by leaks and accidents or specific venting events. Off site releases look to most commonly involve shipping materials to publicly owned treatment works or to landfills. These make sense, most businesses do not have waste treatment facilities on site and would need outside organizations to deal with it.

Section 3 : Create new a new dataframe to more directly compare on and off site release events.

The new frames are based on the counts for each variable in the melted dataframes created above and used to create more specific plots than the histograms.

Step 1: Tabulate on and off site variables and then combine the tables into a new data frame.

on_df <-$variable))

off_df <-$variable))

combo_df <- rbind(on_df,off_df)

names(combo_df) <- c("type","count")

Step 2: Pull out the similar release event types from the combined table and add a new column with labels for site type.

similar_methods <- combo_df[c(6,21,7,20,8,22,11,23),]

similar_methods["site"] <- c("on","off","on","off","on","off","on","off")

Section 4 : Create a plot from the similar release type dataframes. This will be done through the creation of multiple layers of the plot rather than the one step qplots shown above.

A bar plot should make it easier to compare the types of releases, and perhaps show if releases on or off site are more frequent.

Step 1: Create the initial ggplot object and define the data that the plot will be built from.

p <- ggplot(similar_methods)

empty plot window

Step 2: There is nothing in that plot, we need to apply the aesthetic layer, which defines the x-axis of the plot as using the factorization of the type column from the dataframe.

p <- p + aes(factor(type))

empty plot window

Step 3: That plot doesn't look right, it is only showing the presence of each variable, we don't need that as we have count data in our frame already. To use the count column the weight parameter of the plot must be set to use count column from the dataframe.

p <- p + aes(factor(type), weight = count)

empty plot window

Step 4: Looks closer but there is no differentiation between on and off site. This can be done by applying a further aesthetic map to the data, using the site column data in the fill parameter to color in the values based on the site type.

p <- p + aes(factor(type), fill = factor(site))

empty plot window

Step 5: That's much better but the stacking makes it hard to read. As a final step we will set set the bar plots position parameter to "dodge" which will place columns with the same factor side by side.

p + geom_bar(position="dodge")


Click for larger images

Bar chart release types that are similar across on and off site


It seems as though a substantial number of release events occur into the air and cannot be controlled by either the organization responsible for the event or an outside partner called in to deal with waste products or toxic materials. Also, the dramatically lower usage numbers for landfills, wells and surface treatment areas that fall under the more strict RCRA Class C category and the high numbers of "unknown" and "other" release events speaks to the relatively uncontrolled nature of toxic releases despite the reporting system.


R Script this can be loaded in R and used to reproduce this exercise's commands.

SQLite database file of EPA data.