Here at Conveyal, we work with US Census data a lot. Historically, retrieving this data has been a bit difficult,
as you have to get the block level geometries from one place,
data on demographics from another, and data on employment
from a third. You then have to extract all the files and join them in GIS. It’s even
more complicated if you hail from an urban area that encompasses multiple states (like the home of Conveyal: Washington, DC).
Finally, you have to interpret the column codes to map them to something meaningful for analysis (who knew that
“number of jobs in manufacturing”, for example?)
To solve this problem, we decided to create a seamless data source for US Census data. We retrieved the 11 million block-level geometries for all US states and territories, as well as LODES data for states where it is available. We merged all of these state-level datasets into a single national file, and then split it up into 63,645 Web Mercator tiles at zoom 11, stored on S3 in GeoBuf format. Each tile includes all blocks whose envelope overlaps that tile. We use our seamless-census tool to perform this processing step. We also gave all those cryptic columns human-readable names; since we’re not using shapefiles, column names are not limited to ten characters.
Once we’ve done that, it’s relatively easy to extract data for an arbitrary geographic bounding box (even one that crosses state lines). We just select the tiles that overlap the area of interest, download them, and then run the features through a final geographic filter to weed out any overselection. Once we’ve done that, we can dump the features to a new GeoBuf file. We also wrote a tool to do that, which is also in the GitHub repository. It’s also possible to perform extracts programatically from Java using our library, and it wouldn’t be hard to implement the extractor in another language.
There’s no reason why we should keep this to ourselves, either. This is open data and it should be accessible to the world,
so we’ve gone ahead and made the S3 bucket (
lodes-data) where we store the tiles public. It’s a requester-pays bucket, so anyone using
it pays the (miniscule) S3 bandwidth costs directly to Amazon. Just use the credentials from your AWS account to access it;
the bandwidth you use will be added to your AWS bill. The data are the 2015 TIGER/Line blocks for every state, and 2013 LODES data
for all segments and job types. Massachusetts, Puerto Rico and the US Virgin Islands have no data available, and
Kansas uses the 2011 data (rather than 2013) since newer data is not available. We haven’t put demographic data from the
decennial census in yet.
The format we’ve devised isn’t specific to the US Census, either. We could use exactly the same infrastructure to handle extracts from any large dataset that can be represented as vector data, and then it could be accessed using the same tooling.
The extractor is also available as a Java class (see SeamlessSource and its subclasses in the seamless-census repository), so it’s easy to integrate with programs written in Java. The extractor is fairly simple, so it shouldn’t be difficult to port it to other languages where a geobuf library exists.