GTFS-realtime archiver design considerations

Recently, I discussed some observations from my archive of GTFS-realtime data. The focus of that post was on the behavior of feeds in the corpus, not the mechanics of the archiver itself, but I don’t mean for it to be a black box. As I observed previously, GTFS-rt poses more of a challenge and causes more befuddlement for folks looking to get into transit data analysis than static data (and for good reason!). Storing static GTFS in a relational database for analysis is comparatively easy—each file maps, essentially one-to-one, to a well-structured database table. There are a few exceptions, like turning shapes into actual GIS objects, but the model mostly fits.

Archiving GTFS-rt, though, does not necessarily have such a straightforward approach. Sure, you can just fetch the feed from a cron job and store the response on disk, but then how do you perform useful analysis on that dataset? How does that scale up as you accumulate data?

So, in this post, let’s explore the solution space a bit, then discuss some of the angles which the designer of a GTFS-rt archiver may want to consider. Starting off, there are (I think) four main ways to archive GTFS-rt data which are reasonable:

Binary protobufs only: This is the most basic, most essential approach, and indeed it doesn’t even require a database. All that is necessary is to fetch the feed and store the binary protobuf data somewhere on disk¹ or in a cloud storage bucket. In essence this completely shifts the deserialization cost from fetch-time to query-time. If you know you’ll only ever be querying reasonably time-bounded sets of data, this may be fine! But if you need to be able to quickly, easily run needle-in-a-haystack queries over, say, a month or more of data, having no choice but to do a linear search through the dataset may be problematic.

It’s also worth thinking about how this plays out over the long run. GTFS-rt data compresses pretty well, especially with modern compression algorithms like Zstandard. You can, of course, compress each fetch result as it happens, but you’ll get better compression performance² by bundling up larger batches of data. So this may lead to a two-stage approach where you hold data uncompressed on disk initially, and then, say every 24 hours, pack up the previous day’s fetches into a compressed tarball. (But, data in a compressed tarball may be even more unwieldy to query!) You can also store the fetches in a database, but this doesn’t necessarily provide much of an advantage over the filesystem or a cloud storage bucket, and some databases perform poorly with large binary data.

As far as querying is concerned, there are some database engines, such as Spark SQL, which provide functions which will yield a nested table from protobuf data held in a column. This “solves” the querying problem, but again every query is going to require a full table scan plus the overhead of deserialization (unless you can at least bound the query by fetch time, but then you’re still doing a scan over some subset of the table). In other database engines which don’t have native functions for working with protobufs, this can potentially be implemented with a UDF, but that may incur even more overhead. (If you don’t care about writing SQL, you can also just brute-force this in any programming language with a Protocol Buffers library.)
3NF object-relational mapping: One of the advantages Protocol Buffers provides GTFS-rt is a clear, well-defined schema. You can take this schema and map it to a relational model without considerable difficulty. With an ORM like Hibernate, you may not even have to write much code. Just chuck a FeedMessage object into the Hibernate session and let it rip. With that said, it’s worth thinking about the performance implications of this model. We ordinarily eschew joins in OLAP applications, and this approach leads to a lot of joins: a 1:n relationship between FeedMessage and FeedEntity, a 1:1 relationship between FeedEntity and Alert, and TripUpdate, VehiclePosition, a 1:n relationship between TripUpdate and StopTimeUpdate and between Alert and InformedEntity, and so on. If you don’t embed some of the little entities like Position and TripDescriptor where they’re used, you’re going to end up with a sizable number of tables, and the resultant database schema will almost certainly be unpleasant to use and sluggish to query.
Flattened tabular: Here, we still adopt a relational model, but we start making concessions to practicality. Do we need to hold on to all of the fields of the FeedHeader, and do we really need to hold them in their own table? Or is FeedHeader.timestamp really the only field we’re interested in, and if so, can we just embed it in the relevant entities (accepting that yes, we are duplicating data but overall making the schema simpler and easier to query)?

And do we need a perfunctory table for FeedEntity, when FeedEntity is basically just an inert data-holder? Or can we just skip straight to tables for Alert, TripUpdate, and VehiclePosition? And to the extent that those entities have 1:n relationships, can we model those using the support that many modern databases have for arrays and embedded objects?

In the end this shakes out as just three tables: one each for Alert, TripUpdate, and VehiclePosition.
Feed contents as JSON: In this perfectly middle-of-the-road option, we deserialize protobufs as they are fetched, and then reserialize the objects to JSON, and store this JSON in a database table. This somewhat splits the overhead between fetch-time and query-time; it means that the database schema does not have to change to support GTFS-rt extensions (though the extensions must still be known to the archiver at fetch-time), and we can answer arbitrary questions of the data. In database engines with an efficient JSON data type, like ClickHouse, this works pretty well. In Postgres, though, with the jsonb data type, the overhead of TOAST will eventually kill you. (Ask me how I know.)

It’s also important to keep in mind that you can use a combination of these approaches. You might choose to stuff raw protobufs into a cloud storage bucket but also consume those into a database for easy querying. You might store protobuf-as-JSON in a database table but also use an insert trigger or ClickHouse incremental materialized view to pull out certain data of interest into another better-structured table. You might not ordinarily store the raw response body except in error cases, when it may be useful for troubleshooting.

So, how do these approaches score? There are plenty of tradeoffs between them, but here are a few that I think are worth considering in particular:

Fidelity: Some of these approaches are “destructive”, in that they make it challenging or impossible to completely reconstruct the fetched feed as a protobuf after-the-fact. In some applications, like a feed producer seeking a definitive record of what was served to consumers for diagnostic purposes (similar in principle to an aircheck in broadcasting), there may be no substitute for storing binary protobufs. At the other extreme, someone who is solely interested in the behavior of a single transit route or a handful of transit routes may see fit to discard data for the rest of the network!
Ease of querying: The flattened tabular approach is probably the easiest to query. The JSON approach is not much harder, though you end up with some boilerplate like array join response_contents.entity[] as e (in ClickHouse’s SQL dialect) wherever you want to iterate over entities, and it’s certainly not as memory or CPU-efficient as either of the more relational approaches. The binary protobuf approach, as discussed previously, is almost certainly the hardest to query and basically locks you into a model of “loop over every fetch between X and Y, deserialize, perform analysis”.

If you want to be able to write queries like “find me every time bus 1234 was within 500 feet of this point over the past month”, the flattened tabular approach is by far the best, and lets you take full advantage of database capabilities like indexes. (This example uses vehicle positions but of course the same goes for TripUpdate and Alert entities.)
Handling of errors: Folks who are purely interested in GTFS-rt from the transit perspective probably have no interest in errors. Skip the fetch and try again next time. On the other hand, folks who are approaching the archiving task from the data infrastructure perspective probably have a considerable interest in capturing metadata like HTTP headers, response time, and even the response body when it is invalid.
Handling of extensions: Though Protocol Buffers provide GTFS-rt with a well-defined, rigid schema, the GTFS-rt protobuf definition also permits the use of extensions, and they do appear in the wild in some feeds. So the designer of an archiver must consider whether they want or need to support extensions at all, and if so whether they need only to support specific fields from certain pre-determined extensions, or any arbitrary extension (in which case there may be no alternative but to store binary protobufs if the extensions being used are not known in advance).

If only specific extension fields are of interest, for example, then either of the relational approaches described above can be made to work (treating the extension fields as though they were part of the base GTFS-rt schema). Even if a wider, but still bounded set of extensions are in play, the relational approach can still be used, with an additional column added in the relevant places to hold extension data (say, as a JSON column, Postgres hstore, or ClickHouse map).
Level of infrastructure: If you have limited capacity to run things 24/7, then you may be inclined to choose an approach that yields a lighter-weight archiver implementation, such as just fetching and storing protobufs on disk or in a cloud storage bucket, rather than an approach which requires that you keep a database server running all the time. If you are storing the fetched data in a cloud storage bucket, for example, it becomes easy for users of the dataset to bring their own computing power to bear–perhaps your archiver is running on a little Raspberry Pi or a small cloud VM, but you have a beefy workstation with plenty of RAM and CPU power to run queries. (This is essentially a version of the “separation of storage and compute” discourse that is all the rage in data lake circles.)

After all this, you might be wondering “so what should I actually do”? My own archiver stores protobufs converted to JSON in a ClickHouse JSON column (the schema looks somewhat like this), and then I also run incremental materialized views to pull out slices of data to power specific data products like the BART stringlines. Some of the feeds I consume make significant use of extensions, and during initial design this seemed like the path of least resistance. With the Java protobuf libraries, Jackson, and jackson-datatype-protobuf, the conversion from protobufs to JSON is just a few lines of code. (Yes, the set of supported extensions is defined at build-time and extensions must be enabled in advance on a per-feed basis, but at least no changes to the data model are required.)

With all of that said, as far as I am concerned the “flattened tabular” model is a very, very close runner-up, and if I were starting from scratch today, or implementing the archiver in a language other than Java, I might adopt that model, even considering as it requires more effort to support extensions. (I might also stop being fussy about wanting to be able to support arbitrary extensions and just special-case support for the feeds which I consume that use extensions.)

if you’re storing fetched feeds straight on the filesystem, it’s worth being mindful of inode exhaustion, as this approch will result in many small files being created over time ↩
if you want to wring every last bit of performance out of Zstd, you may also want to consider using a custom dictionary ↩