Front | Resume | Data Analysis and Applications | MSI Course Work | Writings | Contact

Part 1 : Pie | Part 2 : Exploration | Part 3 : Factors | Part 4 : Layers | Part 5 : Geoms/Stats | Part 6 : Polishing | Part 7 : Final Thoughts

: EPA TRI Data Analysis : Part 2

Exploratory Views

The goal of this exercise is to go deeper into R and ggplot2 and to use them to make more interesting exploratory graphics.

I am still using the EPA Toxic Release Inventory 2008 dataset, but will use the exercise to go a bit further than one county like in part 1.

Section 1 : Preparing the R workspace

These are standard steps that will get ggplot loaded and the data into a R data frame, you'll be doing this in every exercise.

Step 1: Load the ggplot library (if you did part one it should be installed).


Step 2: Load the dataset.

Hopefully you still have the dataset in an R-readable form from Part 1, if so you can use the same commands to load the EPA data into R.

chem = read.delim("C:\\Users\\dnfehren\\Desktop\\tri_2008_US_v08.txt")

Section 2 : Which industries had the most release events in 2008?

We'll make a bar chart to figure this out, there are other ways that you could do it but this seems to work nicely.

Step 1: NAICS codes allow for the identification of specific business groups. You can get the official listing from the Census Bureau. Get the 2007 file and use Excel or another spreadsheet program to convert it from a .xls file to a .csv

Step 2: Read the NAICS codes into a data frame.

naics_description <- read.csv("C:\\Users\\dnfehren\\Desktop\\naics07-6.csv")

Step 3: Use R's table() function to get the frequencies of each NAICS code and save that table into a new data frame.

naics_cnt_frm = data.frame(table(chem$Primary.NAICS))

Step 4: Extract a subset of the new data frame, grabbing only rows with NAICS identifiers that have a frequency count greater than 500

naics_cnt_subfrm = subset(naics_cnt_frm, Freq > 250)

Step 5: Using R's merge() function, join the smaller subset with the NAICS list this will add readable industry descriptions for each NAICS code in the subset data frame.

subfrm_with_naics <- merge(naics_description, naics_cnt_subfrm, by.x = "X2007.NAICS.Code", by.y = "Var1")

Step 6: Plot the recently merged subset using more descriptive titles for each axis.

qplot(factor(X2007.NAICS.Code), data=subfrm_with_naics, geom="bar", fill=factor(X2007.NAICS.Title), weight=Freq)+scale_x_discrete("NAICS Industry Code")+scale_y_continuous("Number of Release Events")


Click for larger images

Bar chart of the industries with the most toxic release events

Section 3 : Which industries released the most toxic chemicals in 2008?

We'll make a another bar chart to figure this out, but setting up the data is slightly different.

Step 1: Sort the full 'chem' data frame by the descending amount of total releases into a new data frame.

ordered_tot_release <- chem[ order(-chem$Total.Releases) ,]

Step 2: Take the top 100 largest releases from the ordered data frame and create a new data frame with this selection.

top_releasers <- ordered_tot_release[1:100,]

Step 3: Similar to above merge the top_releasers data frame with the NAICS codes that you loaded earlier.

top_rel_with_naics <- merge(naics_description, top_releasers, by.x = "X2007.NAICS.Code", by.y = "Primary.NAICS")

Step 4: Plot the merged data frame using a bar chart.

qplot(factor(Facility.Name), data=top_rel_with_naics, geom="bar", fill=factor(X2007.NAICS.Title), weight=Total.Releases)


Click for larger images

Bar chart of the facilities with the largest toxic release events, color fill by industy

Section 4 : Where are the facilities that had the most release events?

No more bars for a bit. To see this data we will use a map projection. Getting the data into the proper formed proved a challenge for this section, so as a solution the task of maintaining the data was removed from R and given to a database.

For the following code to work you are going to need to have some extra software working on your computer.

Step 1: Create an SQLite database from your EPA data set

This might be slightly beyond the scope of this exercise, but to take advantage of the rest of the information here you will need a database file.

The easiest way is to import the file using a SQLite GUI utility to create the database,table and load the data from the text file.

It is also possible to import from the sqlite command shell but with over 100 columns in the dataset there is a lot that could go wrong when you are typing column names.

The rest of these steps will assume a database called "epa.sqlite" with one table called "tri."

Step 2: Load the map module, we'll need this for our projections later.


Step 3: Load the rsqlite module library, this will allow for communication between R and an SQLite database.


Step 4: Connect to the database file.

con <- dbConnect(SQLite(), "/users/dnfehren/Desktop/epa.sqlite")

Step 5: Create the query and send it to the database, saving the results in a variable.

event_mapping_query <- dbSendQuery(con, statement = "SELECT count(*) AS count, Facility_Name AS name, TRI_Facility_ID AS id, Latitude AS latitude, Longitude AS longitude FROM tri GROUP BY TRI_Facility_ID HAVING count(*) > 25 ORDER BY count(*) DESC");

Step 6: Get the data out of the query response object and into a dataframe

event_mapping_return <- fetch(event_mapping_query, n = -1)

Step 7: The query returned its data a characters, to use the mapping module we need the relevant data to be numeric.

event_mapping_return$count <- as.numeric(event_mapping_return$count)

event_mapping_return$longitude <-as.numeric(event_mapping_return$longitude)

event_mapping_return$latitude <-as.numeric(event_mapping_return$latitude)

Step 8: Create a scatter plot of the data, with size of the point related to the count of toxic events, then overlay it on a projection of the United States.

ggplot(event_mapping_return, aes(longitude, latitude)) + borders("state") + geom_point(aes(size = count))


Click for larger images

Map of United States overlayed with plot of the facilities with the most frequent toxic releases

Section 5 : Where are the facilities that had the largest release events?

This is similar to the last section, it also uses the database and the map projection, so you will still need those modules loaded.

Step 1: Assuming you are still connected to the database, this query will return the top 100 largest release events as shown in the Total_Releases column.

quant_mapping_query <- dbSendQuery(con, statement = "SELECT Total_Releases AS tot_rel, TRI_Facility_ID AS id, Latitude AS latitude, Longitude AS longitude, ST AS state FROM tri ORDER BY Total_Releases +0.0 DESC LIMIT 100");

Step 2: Get the query data into a data frame.

quant_mapping_return <- fetch(quant_mapping_query, n = -1)

Step 3: Convert text columns into numeric columns, add them back to the data frame.

quant_mapping_return$tot_rel <- as.numeric(quant_mapping_return$tot_rel)

quant_mapping_return$longitude <-as.numeric(quant_mapping_return$longitude)

quant_mapping_return$latitude <-as.numeric(quant_mapping_return$latitude)

Step 4: Plot the data.

ggplot(quant_mapping_return, aes(longitude, latitude)) + borders("state") + geom_point(aes(size = tot_rel))


Click for larger images

Map of US overlayed with dots scaled according to size of toxic release


From the first two charts it looks like both mining and the fossil fuel industry are the biggest culprits when it comes to releasing toxic checmicals into the environment.

Fossil fuel facilities have more frequent releases but mining facilities release larger ammounts of chemicals when they happen.

The geographic distribution is not as clear cut. The prevalance of events on the Gulf Coast probably reflect the high density of oil refineries located there. Also though it is not shown in outline on the map the largest release event of 2008 took place in Alasksa, another center for fossil fuel extraction and processing.

Section 6 : How dangerous are spills as they get bigger?

This is a little more abstract. In this part we will attempt to see what kind of chemicals are dumped and how dangerous they might be.

The EPA provides a couple of useful categories to help determin this, one is the Classification of the chemical PCB, Non-PCB and Dioxin as well as if the chemical is a Carcinogen.

To take a closer look we will plot the frequencies of Classifications and Carcinogens for different categories of total release size

Step 1: Send a query that removes a handful of really huge release events from the top of the dataset and fetch the data back into R.

useful_releases_query <- dbSendQuery(con, statement = "SELECT * FROM tri WHERE Total_Releases+0.0 < 50000000 ORDER BY Total_Releases+0.0 DESC");

return <- fetch(useful_releases_query, n = -1)

Step 2: Reclass the Total_Release column as numeric.

return$Total_Releases <-as.numeric(return$Total_Releases)

Step 3: Create a lable vector to use in the cut fuction

label_holder = c("1","2","3","4","5","6","7","8","9","10")

Step 4: Cut the Total_Releases into 10 parts, use the above vector to lable the new break factors and set ordered to true.

return$Total_Releases <- cut(return$Total_Releases, breaks=10, labels = label_holder, ordered_result=TRUE)

Step 5: Create levels in the Classification and Carcinogen columns based on the values in the column.

levels(return$Classification) <- c("NON-PCB","PCB","DIOXIN")

levels(return$Carcinogen) <- c("YES", "NO")

Step 6: Create a table cross tabluating Classification, Caricenogen and Total Releases and force it into a dataframe.

combo_table <-$Classification, return$Total_Releases, return$Carcinogen))

Step 7: Rename the columns of the dataframe

colnames(combo_table) <- c("classification","release_size","carcinogen","freq")

Step 8: Plot all of the data, jittered to better show the points.

p <- ggplot(combo_table, aes(factor(release_size), freq))

p + geom_point(aes(colour = factor(classification), shape = factor(carcinogen)), position="jitter")


Click for larger images

Point plot of chemical spills by classification/carcinogen and size

Step 9: Most the frequencies are below 15, reducing the scale and replotting shows more useful variation.

p <- ggplot(combo_table, aes(factor(release_size), freq))<\p><\code>

p + geom_point(aes(colour = factor(classification), shape = factor(carcinogen)), position="jitter")+ scale_y_continuous(limits=c(0, 15))<\p><\code>


Click for larger images

Point plot of chemical spills by classification/carcinogen and size


Exploration R Script this can be loaded in R and used to reproduce this exercise's commands.

Zip compressed tab-delimited text file of the EPA data.

NAICS codes and description .csv file.

Zipped EPA data in .csv with column names cleaned for database entry.

Zipped EPA data in tab-delimited .txt with clean(-ish) columns for database.

SQLite database file of EPA data.