Front | Resume | Data Analysis and Applications | MSI Course Work | Writings | Contact

Part 1 : Pie | Part 2 : Exploration | Part 3 : Factors | Part 4 : Layers | Part 5 : Geoms/Stats | Part 6 : Polishing | Part 7 : Final Thoughts

: EPA TRI Data Analysis : Part 6

Polishing Plots

Standard beginning, querying the large EPA databases for a set of columns, factoring it and then loading a new set of data that holds poverty statistics for all US counties.

Step 1: Load the ggplot and rsqlite libraries.



Step 2: Load the function to add region data to the dataframe based on state name abbreviations.

source(file = "C:\\Users\\dnfehren\\Desktop\\final_asm\\regionalize.RData")

Step 3: Connect to database.

con <- dbConnect(SQLite(), "C:\\Users\\dnfehren\\Desktop\\epa.sqlite")

Step 4: Send query and fetch result.

query <- dbSendQuery(con, statement = "SELECT year, facility_state, facility_county, (total_air+total_surface_water+total_underground_water+total_land) AS total_release FROM tri")<

ret <- fetch(query, n = -1)

Step 5: Factor the year, region and facility state columns from the new data frame.

#factor year, states and regions




Step 6: Load the poverty csv into an R data frame.

pov <- read.csv("C:\\Users\\dnfehren\\Desktop\\final_asm\\poverty_data.csv")

Section 2 : Create specific data frame for the state of Michigan

Step 1: Subset the two large dataframes, only grabbing MI rows.

mi_pov <- subset(pov, state == "MI")

mi_rel <- subset(releases, facility_state == "MI")

Step 2: Tabulate the occurences of release by year and county.

mi_tab <-$year, mi_rel$facility_county))

colnames(mi_tab) <- c("year","county", "freq")

Step 3: Merge the tabulated results with the poverty data.

mi_mrg <- merge(mi_tab, mi_pov, by.x = c("county","year"), by.y = c("county_name","year"))

Step 4: Remove the factors that were automatically applied when the frame was tabulated.

mi_mrg$all_age_pov_percent <- as.numeric(as.character(mi_mrg$all_age_pov_percent))

Section 3 : Create the plot, adding polishing elements.

Step 1: Set data frame based on tablulated and merged table.

mi_rel_freq <- ggplot(mi_mrg)

Step 2: Create the initial plot.

mi_rel_freq <- mi_rel_freq + geom_point(aes(all_age_pov_percent,freq, color=year))

Click for larger images


Step 3: Change the colors of the points in the plot to better reflect progression of time.

mi_rel_freq <- mi_rel_freq + scale_colour_manual("Year", values = c("1999"="#F7FBFF", "2000"="#DEEBF7", "2001"="#C6DBEF", "2002"="#9ECAE1", "2003"="#6BAED6", "2004"="#4292C6", "2005"="#2171B5", "2006"="#08519C", "2007"="#08306B", "2008"="#06234D"))

Click for larger images


Step 4: Add titles for the plot, the axes, and edit the title of the legend.

mi_rel_freq <- mi_rel_freq + opts(title = "Connecting poverty rates and number of toxic releases in MI counties") + scale_x_continuous("Percent of Total Population Below Poverty Line") + scale_y_continuous("Number of Toxic Release Events")

Click for larger images



R Script this can be loaded in R and used to reproduce this exercise's commands.

R function load this with the source() command to allow for adding regions to dataframes with state abbreviations.

Custom SQLite database file of EPA data from 1999 to 2008 (~160MB).