The version of the dataset that I downloaded consists of 320,574 crimes reported during this period, and includes information about the nature of the incident (“Homicide”, “Larceny”, “Assault”, etc.), date & time of the report, as well as the location.
In this post, we’ll take a look at the geospatial aspect of the data, looking at how crime reports are geographically distributed in Boston.
Cleaning Up the Data
To start, let’s take a closer look at the data set. Here is a look at some of the columns in the data set and the first few records in the table:
We see that the data contains one record for each incident, as well as a number of other structured fields regarding the incident.
To plot the data on a map, we will make use of the “Location” field. However, we first need to convert this field into a more useful form–in particular, we need to split the latitude and longitude into two separate columns. To do this, we can use the stringr package with a bit of regex like so:
Exploring the Data with ggmap
With the location field cleaned up, we can now get to plotting. For this, we will use the ggmap package, which uses a ggplot-like grammar for easily plotting geospatial data. As a first exercise, here is the geographical distribution of drug charges in Boston:
If you’re famililar with ggplot, you’ll note that ggmap is conceptually very similar. In this case Bos_map is my base layer map of Boston (centered on the lat-long determined by geocode). I then use geom_point to add the crime data layer to the map. If you want to learn more, a good introduction to ggmap can be found here.
As for the plot itself, a couple things are of note. For one thing, you may notice that a few regions of the city are conspicuously lacking in crime. In particular, there appear to have been exactly zero drug charges in Cambridge and Somerville over the last few years; likewise for Brookline. Of course this isn’t correct, but is rather a result of the fact that these areas are not part of the City of Boston proper, and hence were not included in the dataset (i.e. they have their own police departments).
To clarify this point, I thought it might make sense to visualize the exact geographic extent of the City of Boston. To do this, I pulled down the Boston Neighborhood Shapefiles from the open data portal. I have no previous experience working with shapefiles, but after a bit of googling (e.g.) and experimentation (and installing some new packages), I was able to plot them with ggmap:
Each of the red-shaded polygons in this image represent a different neighborhood in the City of Boston. If we overlay this map with our previous drug charges plot, we see that, as expected, our dataset is entirely contained in this area (drug charges now represented in black):