Four Methods for Reverse Geocode

In summary, this post will demonstrate four methods for reverse geocoding. These three methods are summarized below:

ggmap reverse geocoding function
Extract country and state/province data from shapefile
Identify closest city
Identify closest zip code (US Only)

In my years of munging geospatial data, I’ve run into a number of instances when I have a data set with a lat-lon combination, but no other spatial reference (i.e. Country, State/Province, City, etc.). If you have a data set with latitude and longitude but not Country, State/Province, or city, than I’d recommend using one of the methods below (or a combination of them) to add a Country, State/Province, and/or City field to your data. The easiest of these to use is the reverse geocoding function from the ggmap package. This is very easy to use and fairly accurate. However, it is rate limited at 2500 reverse geocodes daily, and can take up to 45 minutes to reverse geocode 2500 lat/lon combinations (depending on internet connection). Additionally, you will not be able to use this function from an air-gapped network or from an austere location with limited internet connectivity. The last three methods work must faster, are not rate limited, and can work in an air-gapped network (or otherwise austere environment). In these cases, if you have US Only spatial data, I’d recommend using the zip code method below. If you have international data, and require Country and State/Province, use the shapefile data extraction method. If you require Country and City data, use the world city data in the second method. If you require Country, State/Province, and City data, you will have to use a combination of both the shapefile data extraction and international city data methods.

Using the revgeocode function

The revgeocode method from the ggmap package package is intuitive and easy to use. An example is given below:

First, let’s establish some random cities to use as an example:

## Choose random cities to test our algorithms
cities <- data.frame(
  lat=c(-33.86785,51.50726,55.75,39.90750,35.46756,-1.28333,52.52330,-22.90278),
  lon=c(151.20732,-0.1278328,37.5,116.39723,-97.51643,36.81667,13.41377,-43.2075)
)

Extract Data from Shapefile

This method will extract spatial data from a shapefile. This is most often conducted to enrich your original data with Country and State/Province data. This method is the most accurate method for adding country and state/province since it uses point-in-polygon calculations. The shapefile data is also usually more accurate than the city data used in the next example.

To manipulate shapefiles, we first need to load the sp and maptools packages.

library(sp)
library(maptools)

Next you will need to download a Level 1 World Shapefile. I’ve found a decent shapefile that Harvard maintains at the following link: Harvard Level 1 World Shapefile.

After downloading and unzipping the shapefile, we read it into our environment below:

world.shp <- readShapePoly("g2008_1.shp")
proj4string(world.shp) <- CRS("+proj=longlat +ellps=WGS84")

This is what our Level 1 shapefile looks like (remember that Level 1 Shapefiles are one level below country boundary for all countries).

alt text

Now I’m going to read in a list of lat/lon combinations from around the world to test our code on. While these aren’t explicitly named, note that they were created by geocoding the following cities: Sydney, London, Moscow, Beijing, Oklahoma City, Nairobi, Berlin, and Rio de Janero.

## Choose random cities to test our algorithms
cities <- data.frame(
  lat=c(-33.86785 , 51.50726 , 55.75000 , 39.90750 , 35.46756 , -1.28333 , 52.52330 ,-22.90278),
  lon=c(151.20732 , -0.1278328 , 37.5 ,116.39723 ,-97.51643  ,36.8166700 , 13.41377, -43.2075)
)

Now let’s see a picture of our cities plotted over our shapefile:

points on shapefile

Next, we need to convert the cities data frame to a Spatial Points Data Frame. We start by creating a separate object called cities.sp, and then creating the Spatial Points Data Frame.

cities.sp <- cities                              ## Rename
coordinates(cities.sp) <- ~lon+lat               ## Create Spatial Points Data Frame
proj4string(cities.sp) <- CRS("+proj=longlat +ellps=WGS84")

Next we will use point-in-polygon calculations to determine which polygon each point fall in. The over command in R is a powerful command that allows us to conduct this point-in-polygon calculation.

## Conduct point-in-polygon calculation and extract data
extracted.data <- over(cities.sp,world.shp)

extracted.data

##          AREA PERIMETER G2008_1_ G2008_1_ID ADM1_CODE ADM0_CODE
## 1 76.60676282 78.223011    32966      32965       470        17
## 2 17.22851535 50.129004     9630       9629      3182       256
## 3  0.14140157  1.653553     9584       9583      2535       204
## 4  1.73808134  7.161621    13166      13165       899        53
## 5 18.00501971 25.756017    14010      14009      3250       259
## 6  0.05658006  1.664051    27161      27160     51328       133
## 7  0.11789101  2.070110    10602      10601      1310        93
## 8  3.83091826 21.150444    32532      32531       683        37
##                                    ADM0_NAME       ADM1_NAME
## 1                                  Australia New South Wales
## 2 U.K. of Great Britain and Northern Ireland         England
## 3                         Russian Federation          Moskva
## 4                                      China     Beijing Shi
## 5                   United States of America        Oklahoma
## 6                                      Kenya         Nairobi
## 7                                    Germany          Berlin
## 8                                     Brazil  Rio De Janeiro
##   LAST_UPDAT CONTINENT                    REGION STR_YEAR0 STR_YEAR1
## 1   20050415   Oceania Australia and New Zealand         0         0
## 2   20050825    Europe           Northern Europe         0         0
## 3   20050415    Europe            Eastern Europe         0         0
## 4   20050415      Asia              Eastern Asia         0         0
## 5   20060103  Americas          Northern America         0         0
## 6   20081111    Africa            Eastern Africa         0         0
## 7   20051229    Europe            Western Europe         0         0
## 8   20050415  Americas             South America         0         0
##   EXP_YEAR0 EXP_YEAR1
## 1         0         0
## 2         0         0
## 3         0         0
## 4         0         0
## 5         0         0
## 6         0         0
## 7         0         0
## 8         0         0

Next we will merge this data with our original data:

## Merge Data
cities$Country <- extracted.data$ADM0_NAME
cities$level_1 <- extracted.data$ADM1_NAME

Below is our final data frame.

	lat	lon	Country	level_1
1	-33.87	151.21	Australia	New South Wales
2	51.51	-0.13	U.K. of Great Britain and Northern Ireland	England
3	55.75	37.50	Russian Federation	Moskva
4	39.91	116.40	China	Beijing Shi
5	35.47	-97.52	United States of America	Oklahoma
6	-1.28	36.82	Kenya	Nairobi
7	52.52	13.41	Germany	Berlin
8	-22.90	-43.21	Brazil	Rio De Janeiro

Now we have spatial data added to our original data frame and we can easily bin data by country or province.

Reverse Geocode to City Level (with OpenGeoCode Data)

The next method I’ll demonstrate is to find the nearest world city. There are several data bases available online for doing this. Open Geo Code maintains a database at the following link. This database is maintaned by the National Geospatial Agency (NGA) and is fairly comprehensive. It does contain multiple entries for each city (usually to capture various spellings of the same city). My biggest complaint of this data set is that it does not contain any cities for the US or China. While I can sort of understand not having any US cities, I have no idea why it does not contain Chinese cities.

alt text

The other data set that I’ve used before is from Maxmind link. This data set is not maintained as much as the NGA data set, but does contain all countries (to include US cities). This is a fairly large dataset, containing over 3 million entries.

Both data sets require some cleaning. The NGA dataset contains a unique feature ID, which can be used to remove duplicates. Ensure that you maintain either an english or latin spelling of city names. The Maxmind data set does not contain a unique feature ID, so I had to create one by concatenating the lat/lon combination.

With Maxmind Data

For this blog post, I will only demonstrate using the Maxmind data. First we’ll read in the data:

world.cities <- read.csv("worldcitiespop.txt",as.is=TRUE)

Next we will concatenate the lat/lon combination in order to create a unique identifier. Having done this, we remove duplicates:

## Create identifier with lat/lon
world.cities$id<-paste(world.cities$Latitude,world.cities$Longitude,sep="")

## Remove Duplicates
world.cities <- world.cities[!duplicated(world.cities$id),]

Next, we will use the fields package to create a distance matrix, and then choose the closest city to each of these grids. Note that if you have grids near borders, this algorithm may find that the closest city is on the other side of the border.

library(fields)

## Create distance matrix
mat<-rdist.earth(data.matrix(cities[,c(2,1)]),data.matrix(world.cities[,c(7,6)]))

## Add fields to data frame
cities$country <- NA
cities$city <- NA
cities$accent_city <- NA
cities$dist <- NA

## Loop through data frame and find closes city, merging country and level 1 data 
for(i in 1:nrow(cities)){
k<-which.min(mat[i,])
cities$country[i] <- world.cities$Country[k]
cities$city[i] <- world.cities$City[k]
cities$accent_city[i] <- world.cities$AccentCity[k]
cities$dist[i] <- mat[i,k]
}

Now that this is done, we can print our new data frame to see how it worked.

	lat	lon	country	city	accent_city	dist
1	-33.87	151.21	au	pyrmont	Pyrmont	0.43
2	51.51	-0.13	gb	charing cross	Charing Cross	0.25
3	55.75	37.50	ru	shelepikha	Shelepikha	0.65
4	39.91	116.40	cn	beijing	Beijing	1.55
5	35.47	-97.52	us	oklahoma city	Oklahoma City	0.02
6	-1.28	36.82	ke	nairoba	Nairoba	0.00
7	52.52	13.41	de	berlin	Berlin	0.74
8	-22.90	-43.21	br	praia formosa	Praia Formosa	0.16

Reverse Geocode US Zipcodes

The reverse Geocode of US Zipcodes is very similar to the world cities algorithm. Note that this data frame can also be used to geocode data that has a zip code but does not contain lat/lon. Several open source zip code data files are available online. The one that I’m going to use here was accessed at OpenGeoCode.

First, let’s read the data into our environment. Note that in this case, I explicitly declare the field types so that the zipcode field is read into R as a character field instead of of a numeric field (if read in as a numeric field, it will remove any leading zero’s).

zip <- read.csv("cityzip.csv" ,as.is=TRUE,
                colClasses = c("character","character","character","numeric","numeric","numeric"))

Here’s the list of US cities that we will use to test our algorithm. These lat/lon combinations represent the city center of Portland, Salt Lake City, Dallas, Knowxville, Oklahoma City, Orlando, Raleigh, and Boston.

us.cities <- data.frame(
  lat=c(45.50786 ,40.76078 ,32.78306 ,35.96064 ,35.46756 ,28.53834 ,35.88904 ,42.35843),
  lon=c(-122.69079,-111.89105,-96.80667,-83.92074 , -97.51643 , -81.37924,-82.83104  ,-71.05977)
)

Next we use an algorithm very similar to the world city algorithm to reverse geocode the lat/lon combinations.

library(fields)
mat<-rdist.earth(data.matrix(us.cities[,c(2,1)]),data.matrix(zip[,c(5,4)]))

us.cities$city <- NA
us.cities$postal <- NA
us.cities$dist <- NA

for(i in 1:nrow(us.cities)){
k<-which.min(mat[i,])
us.cities$city[i] <- zip$City[k]
us.cities$postal[i] <- zip$Postal[k]
us.cities$dist[i] <- mat[i,k]
}

Let’s check our final product:

	lat	lon	city	postal	dist
1	45.51	-122.69	Portland	97201	0.45
2	40.76	-111.89	Salt Lake City	84114	0.42
3	32.78	-96.81	Irving	75202	0.04
4	35.96	-83.92	Knoxville	37902	0.21
5	35.47	-97.52	Oklahoma City	73102	0.55
6	28.54	-81.38	Orlando	32801	0.89
7	35.89	-82.83	Hot Springs	28743	0.23
8	42.36	-71.06	Boston	02109	0.51

Written on July 17, 2016