The Medium Is Not The Message: alternative serialization formats for GTFS-realtime

In the North American transit market, GTFS-realtime has become the dominant format for publication of real-time transit data by transit operators, easily eclipsing TCIP, SIRI, and a handful of proprietary APIs of varying quality. (In Europe, SIRI and a range of national standards continue to dominate the market.) This has, for the most part, been an enormous step forward in promoting the publication of high-quality real-time transit data in a standardized synoptic format.

However, there has always been a nagging challenge, one which rears its head from time to time: GTFS-rt is packaged in Protocol Buffers, an efficient, optimized, binary-encoded format which requires the use of language-specific bindings to serialize and deserialize.

While some programming languages have excellent support for Protocol Buffers and high-quality tooling available, others do not, and certain mailing lists in the transit data world are rife with complaints about the challenges of working with Protocol Buffers. Some folks will drop a URL to a GTFS-rt feed in their browser or point curl at it, and upon getting a screenful of binary nonsense, give up and move on to some other project. This isn’t what we want, and it doesn’t serve the transit data community well.

I am sure that somewhere, in some room in Mountain View (or elsewhere) when Google was first planning to consume real-time data in Google Transit, some engineer sat down to work up a Fermi estimate: “if Google consumes X GTFS-rt feeds, Y times per minute, and it costs Z CPU-seconds to consume an XML feed, then it costs $α/year, but if we adopt Protocol Buffers, which are β% more efficient, then we can save $γ annually”, and so it was. Protocol Buffers are everywhere within Google, and this would not have been a difficult decision to make. The only wrench in the works is that Google is not the only consumer of GTFS-rt feeds, the producers are certainly not Googlers, and the capabilities of and tooling available to the average transit data engineer are very different from the average Google engineer. Google’s pervasive use of Protocol Buffers in their internal systems means that they have excellent tooling for working with protobufs, whereas for many developers on the outside, anything that isn’t JSON, CSV, XML, or perhaps YAML or TOML is seen as an imposition.

It’s funny, because while Google’s mission is to “organize the world’s information and make it universally accessible and useful”, this is done, in large part, by consuming unstructured data. There are very few cases where Google explicitly asks us to provide them with data in a format of their choosing. In fact, I thought about it for a bit and the only other widespread example I could come up with of Google specifying a format in which to consume data was Sitemaps, an XML format first developed by Google in 2005.

Ignoring the fact that Protocol Buffers did not exist in the open-source world until 2008, would Sitemaps have had anywhere near the adoption by webmasters that they did if they had to be generated in a relatively arcane, obscure, binary-only format that many developers find challenging to work with?

To be clear, there are very real technical factors in play here. Binary formats like Protocol Buffers are more efficient than their text-based counterparts, and that efficiency has impacts on bandwidth and memory consumption, CPU utilization, and in turn power consumption. We should make data available in efficient formats like Protocol Buffers (and others, such as Parquet) whenever possible, and even when using text-based formats should be mindful of the tradeoffs between them.

For example, JSON can be substantially more efficient to parse than XML. Barbeau et al. found that “[a]verage performance for cold starts (i.e., when the user first starts the mobile app) was over 4 times faster for JSON than XML, with an average time difference of 14 seconds” and that “JSON also had a noticeable performance advantage in warm starts, being an average of 224ms faster than XML”.

It’s possible to see these stats and make an argument that choosing a more efficient binary serialization format is some sort of planetary imperative—that choosing XML over other formats, knowing that XML is more resource-hungry to parse, contributes to global warming and is tantamount to hopping in a gas-guzzling SUV just to run around the block and pick up a coffee.

But is this really the hill we want to die on? We’re not just code golfing here; there is a very real, very practical goal of enabling and encouraging use of public transportation (and, in turn, reducing use of single-occupant vehicles and the many negative externalities associated with same).

Indeed, one of the guiding principles of static GTFS is that “[f]eeds should be easy to create and edit”, followed by “[f]eeds should be easy to parse”. Now, it’s certainly true that the technical considerations around real-time data are somewhat different from those for static data—real-time feeds are generated and parsed on a substantially more frequent basis than their static counterparts, and real-time data is almost always machine-generated, whereas static GTFS data can be hand-edited, at least for small, uncomplicated transit networks. Notwithstanding this, the overarching principle is clear: the GTFS ecosystem is meant to have a low barrier to entry.

Consequently, if producers find GTFS-rt so off-putting to work with that they are dissuaded from opening up their real-time data, or they hack together their own non-standard formats, that is a net loss. Similarly, if developers of consuming applications waste time banging their head against the wall fighting with protoc or give up entirely when they can’t find a working Protocol Buffers library for the programming language of their choice, that is a net loss.

And so, to that end, the data model provided by GTFS-rt is far more important than the specific realization of GTFS-rt in Protocol Buffers.

It is not the worst thing in the world if a producer or consumer prefers to deal with data in the GTFS-rt data model through a JSON or XML serialization of that data. Yes, it is true that these formats do not provide the same kind of built-in, mandatory schema validation as Protocol Buffers, but XML Schema and JSON Schema do exist, and support for XML or JSON is far, far more widely available in programming languages (particularly within their standard libraries or built-in functions) than Protocol Buffers.

To be clear, I like working with Protocol Buffers. When I presented an introduction to GTFS-realtime at TransportationCamp PHL 2019, I put far more time into talking about what is actually in a GTFS-rt feed than talking about how to consume one. My advice was simple: “there are libraries for many common languages; use them!”. But I also know that not all developers share my sentiment. Some developers really struggle with the tooling, and some languages simply don’t have good bindings for Protocol Buffers.

On top of this, there are already GTFS-rt producers, like the MBTA, that provide feeds which use the GTFS-rt data model but are serialized to JSON rather than Protocol Buffers, and the world hasn’t ended yet.

At the end of the day, our goal should be to get more transit data in more peoples’ hands, not to engage in serialization snobbery. I am guilty of this, and I say this as a measure of atonement: if Protocol Buffers are in fact a serious barrier to entry for some GTFS-rt producers and consumers, then it is time to formally define standardized alternative serialization formats for GTFS-rt, including JSON and XML schemata, and accept that there may be cases where these are preferable, in addition to providing tools to convert between the various defined serialization formats.