Four Methods for Reverse Geocode
In summary, this post will demonstrate four methods for reverse geocoding. These three methods are summarized below:
- ggmap reverse geocoding function
- Extract country and state/province data from shapefile
- Identify closest city
- Identify closest zip code (US Only)
In my years of munging geospatial data, I’ve run into a number of instances when I have a data set with a lat-lon combination, but no other spatial reference (i.e. Country, State/Province, City, etc.). If you have a data set with latitude and longitude but not Country, State/Province, or city, than I’d recommend using one of the methods below (or a combination of them) to add a Country, State/Province, and/or City field to your data. The easiest of these to use is the reverse geocoding function from the ggmap package. This is very easy to use and fairly accurate. However, it is rate limited at 2500 reverse geocodes daily, and can take up to 45 minutes to reverse geocode 2500 lat/lon combinations (depending on internet connection). Additionally, you will not be able to use this function from an air-gapped network or from an austere location with limited internet connectivity. The last three methods work must faster, are not rate limited, and can work in an air-gapped network (or otherwise austere environment). In these cases, if you have US Only spatial data, I’d recommend using the zip code method below. If you have international data, and require Country and State/Province, use the shapefile data extraction method. If you require Country and City data, use the world city data in the second method. If you require Country, State/Province, and City data, you will have to use a combination of both the shapefile data extraction and international city data methods.
Using the revgeocode function
First, let’s establish some random cities to use as an example:
Extract Data from Shapefile
This method will extract spatial data from a shapefile. This is most often conducted to enrich your original data with Country and State/Province data. This method is the most accurate method for adding country and state/province since it uses point-in-polygon calculations. The shapefile data is also usually more accurate than the city data used in the next example.
To manipulate shapefiles, we first need to load the sp and maptools packages.
Next you will need to download a Level 1 World Shapefile. I’ve found a decent shapefile that Harvard maintains at the following link: Harvard Level 1 World Shapefile.
After downloading and unzipping the shapefile, we read it into our environment below:
This is what our Level 1 shapefile looks like (remember that Level 1 Shapefiles are one level below country boundary for all countries).
Now I’m going to read in a list of lat/lon combinations from around the world to test our code on. While these aren’t explicitly named, note that they were created by geocoding the following cities: Sydney, London, Moscow, Beijing, Oklahoma City, Nairobi, Berlin, and Rio de Janero.
Now let’s see a picture of our cities plotted over our shapefile:
Next, we need to convert the cities data frame to a Spatial Points Data Frame. We start by creating a separate object called cities.sp, and then creating the Spatial Points Data Frame.
Next we will use point-in-polygon calculations to determine which polygon each point fall in. The over command in R is a powerful command that allows us to conduct this point-in-polygon calculation.
Next we will merge this data with our original data:
Below is our final data frame.
|1||-33.87||151.21||Australia||New South Wales|
|2||51.51||-0.13||U.K. of Great Britain and Northern Ireland||England|
|5||35.47||-97.52||United States of America||Oklahoma|
|8||-22.90||-43.21||Brazil||Rio De Janeiro|
Now we have spatial data added to our original data frame and we can easily bin data by country or province.
Reverse Geocode to City Level (with OpenGeoCode Data)
The next method I’ll demonstrate is to find the nearest world city. There are several data bases available online for doing this. Open Geo Code maintains a database at the following link. This database is maintaned by the National Geospatial Agency (NGA) and is fairly comprehensive. It does contain multiple entries for each city (usually to capture various spellings of the same city). My biggest complaint of this data set is that it does not contain any cities for the US or China. While I can sort of understand not having any US cities, I have no idea why it does not contain Chinese cities.
The other data set that I’ve used before is from Maxmind link. This data set is not maintained as much as the NGA data set, but does contain all countries (to include US cities). This is a fairly large dataset, containing over 3 million entries.
Both data sets require some cleaning. The NGA dataset contains a unique feature ID, which can be used to remove duplicates. Ensure that you maintain either an english or latin spelling of city names. The Maxmind data set does not contain a unique feature ID, so I had to create one by concatenating the lat/lon combination.
With Maxmind Data
For this blog post, I will only demonstrate using the Maxmind data. First we’ll read in the data:
Next we will concatenate the lat/lon combination in order to create a unique identifier. Having done this, we remove duplicates:
Next, we will use the fields package to create a distance matrix, and then choose the closest city to each of these grids. Note that if you have grids near borders, this algorithm may find that the closest city is on the other side of the border.
Now that this is done, we can print our new data frame to see how it worked.
|2||51.51||-0.13||gb||charing cross||Charing Cross||0.25|
|5||35.47||-97.52||us||oklahoma city||Oklahoma City||0.02|
|8||-22.90||-43.21||br||praia formosa||Praia Formosa||0.16|
Reverse Geocode US Zipcodes
The reverse Geocode of US Zipcodes is very similar to the world cities algorithm. Note that this data frame can also be used to geocode data that has a zip code but does not contain lat/lon. Several open source zip code data files are available online. The one that I’m going to use here was accessed at OpenGeoCode.
First, let’s read the data into our environment. Note that in this case, I explicitly declare the field types so that the zipcode field is read into R as a character field instead of of a numeric field (if read in as a numeric field, it will remove any leading zero’s).
Here’s the list of US cities that we will use to test our algorithm. These lat/lon combinations represent the city center of Portland, Salt Lake City, Dallas, Knowxville, Oklahoma City, Orlando, Raleigh, and Boston.
Next we use an algorithm very similar to the world city algorithm to reverse geocode the lat/lon combinations.
Let’s check our final product:
|2||40.76||-111.89||Salt Lake City||84114||0.42|