The EPA Toxic Release data set has a lot of data describing how chemicals were releases or during what business process they were released.
Like earlier exercise the initial steps will involve connecting to a SQLite database and drawing out some specific columns, namely those dealing with release types.
Step 1: Load the ggplot and rsqlite libraries.
Step 2: Connect to database.
con <- dbConnect(SQLite(), "C:\\Users\\dnfehren\\Desktop\\epa.sqlite")
Step 3: Send query and fetch result.
query <- dbSendQuery(con, statement = "SELECT TRI_FACILITY_ID,Fugitive_Air,Stack_Air,Water,Underground_Class_I,Underground_Class_II_V,RCRA_C_Landfills,Other_Landfills,Land_Treatment,Surface_Impoundment,RCRA_C_Surface_Impoundment,Other_Surface_Impoundment,Other_Disposal,POTW_Total_Transfers,M10,M41,M62,M71,M81,M82,M72,M63,M66,M67,M64,M65,M73,M79,M90,M94,M99,M20,M24,M26,M28,M93,M56,M92,M40,M50,M54,M61,M69,M95 FROM tri");<
return <- fetch(query, n = -1)
This will involve some subsetting data frames as well as the melt function to create new data structures that can be plotted
Step 1: Grab the on site types of release from the query return and then rename the columns with easier labels.
on_site_wide <- return[c("TRI_Facility_ID","Fugitive_Air","Stack_Air","Water","Underground_Class_I","Underground_Class_II_V","RCRA_C_Landfills","Other_Landfills","Land_Treatment","Surface_Impoundment","RCRA_C_Surface_Impoundment","Other_Surface_Impoundment","Other_Disposal")]
names(on_site_wide) <- c("Facility ID","Air Leak","Air Release", "Water", "Strict Well", "Other Well", "Strict Landfill", "Other Landfill", "Soil", "Surface Holding", "Strict Surface Holding","Other Surface","Other")
Step 2: Grab the off site types of release from the query return and then rename the columns with easier labels.
off_site_wide <- return[c("TRI_Facility_ID","POTW_Total_Transfers","M10","M41","M62","M71","M72","M63","M64","M65","M73","M79","M90","M94","M99")]
names(off_site_wide) <- c("Facility ID","OS POTW","OS Storage","OS Solidification","OS WasteWater Treatment", "OS Well", "OS Old Well Data", "OS Surface", "OS Landfill", "OS Strict Landfill", "OS Soil", "OS Other Surface","OS Other", "Waste Broker", "Unknown")
Step 3: Use the melt() command to reshape the data frames from their "wide" forms to "long" forms where each column is given a single row and variable.
on_site_full <- melt.data.frame(on_site_wide, id=1)
off_site_full <- melt.data.frame(off_site_wide, id=1)
Step 4: The data set has a lot of zeros, here the subset() function removes them
on_site <- subset(on_site_full, on_site_full$value != 0)
off_site <- subset(off_site_full, off_site_full$value != 0)
Step 5: Use qplot to create some simple histograms to see the relative counts of release events.
qplot(variable, data=on_site, geom="histogram")
qplot(variable, data=off_site, geom="histogram")
Click for larger images
So it appears that most of the on site releases happen into the air, either by leaks and accidents or specific venting events. Off site releases look to most commonly involve shipping materials to publicly owned treatment works or to landfills. These make sense, most businesses do not have waste treatment facilities on site and would need outside organizations to deal with it.
The new frames are based on the counts for each variable in the melted dataframes created above and used to create more specific plots than the histograms.
Step 1: Tabulate on and off site variables and then combine the tables into a new data frame.
on_df <- as.data.frame(table(on_site$variable))
off_df <- as.data.frame(table(off_site$variable))
combo_df <- rbind(on_df,off_df)
names(combo_df) <- c("type","count")
Step 2: Pull out the similar release event types from the combined table and add a new column with labels for site type.
similar_methods <- combo_df[c(6,21,7,20,8,22,11,23),]
similar_methods["site"] <- c("on","off","on","off","on","off","on","off")
A bar plot should make it easier to compare the types of releases, and perhaps show if releases on or off site are more frequent.
Step 1: Create the initial ggplot object and define the data that the plot will be built from.
p <- ggplot(similar_methods)
Step 2: There is nothing in that plot, we need to apply the aesthetic layer, which defines the x-axis of the plot as using the factorization of the type column from the dataframe.
p <- p + aes(factor(type))
Step 3: That plot doesn't look right, it is only showing the presence of each variable, we don't need that as we have count data in our frame already. To use the count column the
weight parameter of the plot must be set to use count column from the dataframe.
p <- p + aes(factor(type), weight = count)
Step 4: Looks closer but there is no differentiation between on and off site. This can be done by applying a further aesthetic map to the data, using the site column data in the
fill parameter to color in the values based on the site type.
p <- p + aes(factor(type), fill = factor(site))
Step 5: That's much better but the stacking makes it hard to read. As a final step we will set set the bar plots
position parameter to "dodge" which will place columns with the same factor side by side.
p + geom_bar(position="dodge")
Click for larger images
It seems as though a substantial number of release events occur into the air and cannot be controlled by either the organization responsible for the event or an outside partner called in to deal with waste products or toxic materials. Also, the dramatically lower usage numbers for landfills, wells and surface treatment areas that fall under the more strict RCRA Class C category and the high numbers of "unknown" and "other" release events speaks to the relatively uncontrolled nature of toxic releases despite the reporting system.
R Script this can be loaded in R and used to reproduce this exercise's commands.