Last spring I posted an article discussing some shortcomings of the PBF format and proposing a much simpler OSM interchange format called VEX. I argued that while PBF has major size and speed advantages over the traditional OSM XML format, it is significantly more complex than necessary to achieve these goals.
I shared my proposal with the OSM development list along with some measurements showing file sizes comparable to those of PBF. There was a generally positive response to this proposal. I have updated the VEX specification and our implementation in light of the following feedback I received:
- My discussion focused too much on the size of the files. While compactness is desirable, storage and bandwidth are relatively cheap these days, so speed and flexibility of processing may be more significant.
- My measurements focused too much on extracts of specific regions. Results for the entire planet would be more meaningful because the structure of data, tags, etc. vary from one region to another.
- The VEX format’s block structure did not provide sufficient support for fast seeking and parallelization of OSM processing.
This new revision of the VEX format follows the same basic principles (see the detailed description in the previous post). It differs only in that:
- It is more regular, eliminating all special sentinel values and using simple list structures with explicit element counts throughout, which makes it even simpler to parse.
- Rather than compressing the whole stream file blocks are compressed individually. Block size and entity type declared up front in the block header, outside the compressed section.
Below is a diagram of the new file structure. The colored rows represent the data stream (green is uncompressed and blue is compressed), and the white boxes above them show how the file would be parsed:
These changes provide some major advantages. Because each block contains only one kind of entity (nodes, ways, or relations) and this type is stated outside the compressed section, blocks may be skipped without decompressing them. The consumer may for example skip over all nodes and ways to process only relations, then re-scan the file to process ways or nodes. Such multi-pass processing or seeking within files is common, and has until now required decompressing and decoding the entire file repeatedly.
It is of course also possible to pipeline the decompression, decoding, and compression stages in different threads. Blocks may be distributed to any number of threads for processing in parallel. Compression levels can be easily adjusted to trade off processing time against storage space.
The implementation remains quite simple. See the
DeflatedBlock*.java classes in Conveyal’s osm-lib repository.
File size and speed experiments
My first article on the VEX format reported a reduction in size over PBF, but later experiments did not yield the same result. I discovered that the difference was in fact due to reduced coordinate precision in VEX, resulting in slightly less entropy and therefore more effective compression by our combination of delta coding and zlib.
My implementation of the revised VEX format eliminates this error. Experiments on the full-planet OSM dataset without metadata still show that VEX is comparable in size to PBF, but VEX output from osm-lib is now about 2% larger than equivalent PBF output from the same library using the same level of zlib compression. I consider this difference in file size to be negligible considering the simplicity of the VEX format and the capacity for seeking to specific entity types. It is also entirely possible that a few adjustments or optimizations could eliminate this small size difference.
Anecdotally, reading entire planet in PBF format and writing it back out to PBF is about 40 percent slower than reading from and writing to VEX, but this difference cannot be taken at face value because our VEX handling code is multi-threaded. Additionally, both the VEX and PBF code are unoptimized Java subject to garbage collection etc. If there is continued interest in this format, I plan to perform more rigorous speed tests using optimized low-level code that could eventually be wrapped in bindings for scripting languages. Because of its straightforward file structure I suspect that the VEX implementation could be made extremely fast while remaining readable and maintainable.
In the same way that each block contains only a single entity type which is declared in its header, each block could be restricted to a specific layer of map data (streets, buildings, land use, points of interest, etc.) Individual layers could then be extracted or processed without decompressing the others. OpenStreetMap now contains at least as many buildings as streets in some places, so this feature could provide major speedups in applications where only one or the other is needed.
There are a few tags that are much more common than others. Assigning these common tags small integer constants based on their frequency in the full OSM database significantly reduced storage requirements in our original C implementation of Vanilla Extract. But as others have noted, compactness is not the only goal. This is an effective but somewhat awkward and rigid optimization.
Finally, author and version metadata are necessary for feature parity with XML and PBF, and provisions must be made for expressing changesets in VEX.