Lessons learned from consuming GTFS-realtime at scale, hundreds of billions of StopTimeUpdates later…

Disclaimer: I work for a transit agency and have, at times, worked on passenger information and open data. The views contained in this post are entirely my own and not reflective of or endorsed by any transit agency or other organization.

I have operated a GTFS-realtime archiver (for fun) since early 2022. Why? Originally I had been trying to troubleshoot some feed problems at work, and I added in some other agencies’ feeds to use for peer benchmarking purposes, and then it just sort of took on a life of its own from there. It has evolved into a platform which I’ve used to build a range of tools to explore not only the behavior of these real-time feeds, but also the behavior of the transit networks themselves.

Because I had this archive handy, for example, it was easy for me to build stringlines for BART. When a friend came to me last year and asked “can you get me the track assignment of every MBTA Commuter Rail train at South Station this year”, after a few minutes of SQL, I had an answer. When some other friends wondered about WMATA’s Metrorail fleet allocation, I was able to throw together a quick query to show how WMATA’s allocation of 3000, 6000, and 7000-series cars to each of the five Metrorail lines had changed over the past year. When I recently found myself wondering about the behavior of a particular Muni vehicle, I was able to run a query and find every vehicle which had been within a certain radius of a given point during a particular 30-minute span, then find the complete trajectory for one of those vehicles once I’d picked it out of the data. Just for fun, I can also put more than ten thousand public transit vehicles on a real-time map.

These things were easy precisely because the basic infrastructure already existed, and every time I have one of these successes I am motivated to archive more feeds and do more with the archive. There’s a common theme I’ve heard over and over from folks wanting to get into transit data analysis: working with static data is easy, but real-time data is a lot more challenging. It’s one thing to go fetch an agency’s static GTFS once every few months, or even once a week, but fetching a real-time feed every 30 seconds poses a real obstacle. Even with the ready availability of archiving tools, you need somewhere to run the archiver, with a stable Internet connection, enough storage, and so on. At some point I intend to build a self-service tool to let folks fetch their own extracts from the archive.

In any event, in its present incarnation, the archive contains data for around 170 feeds, going back to the end of 2023 for the feeds I’ve been archiving the longest (though some I’ve only added more recently). In total, as I write this, the archiver’s database contains about 80 million feed fetches, consisting of roughly 16 billion GTFS-rt TripUpdate entities, 350 billion StopTimeUpdate entities, 9 billion VehiclePosition entities, and 750 million Alert entities. (For the latest numbers, you can refer to this dashboard.) The size on disk, thanks to ClickHouse’s considerable compression capabilities, is just over 1.2 TiB.

This is not particularly large, but it’s not small either, and along the way I’ve learned a lot about consuming GTFS-rt, and the behavior of producers. I’ve had some incredibly good experiences—readily discoverable, publicly available feeds which function reliably—and some rather poor experiences. In this post, I’d like to share some of the things I’ve learned, in the hope of improving the ecosystem for all.

Availability

A fairly basic point, but one worth stating: I can’t archive a feed I can’t get to or which doesn’t exist! Unfortunately, there are plenty of agencies out there that still don’t publish real-time data, or don’t publish it in standardized formats like GTFS-realtime, or only provide their data to certain consumers but do not publish it as open data. I can remember making impassioned arguments in favor of open transit data 15+ years ago, when the concept was still somewhat more novel, but today it feels like the kind of thing that should go without saying.

Discoverability

Discoverability is also a larger issue than I would have expected it to be in 2025. I really shouldn’t have to hunt to find a transit agency’s open data. I’d like to see a link somewhere on the homepage—certainly it doesn’t have to be as prominent as links to schedules, fares, maps, route information, and so on, but a link in the footer titled “Open Data” would be warmly appreciated. At the very least, searching for keywords like “GTFS” should be fruitful.

Some transit operators list their real-time data in the data catalog of their parent jurisdiction, and while this is certainly beneficial, it shouldn’t be to the exclusion of a link somewhere on their own Web site.

Similarly, there are a surprising number of agencies whose feeds are only listed in third-party data catalogs such as Interline’s Transitland or MobilityData’s Mobility Database and not mentioned anywhere on the agency’s own Web site. Agencies should understand that these are not authoritative references and that being listed in one of these catalogs does not relieve them of the need to provide this information on their own Web site as well.

Third-party producers

Agencies should take responsibility for their own open data, even if they rely on one or more third-party service providers in their data pipeline. Rather than requiring consumers to get data directly from the third-party service provider, agencies should redistribute it from a platform that they control.

Suppose that the third-party service provider chooses to implement usurious terms and conditions for their API, and the agency throws up their hands and says to consumers “well, that’s between you and them”, or the third-party service provider starts passing judgement on who can and can’t get an API key (more on registration later), or what is or isn’t considered an acceptable use of the data (public data)? None of these are acceptable circumstances for open data.

Agencies taking responsibility for their own data also insulates downstream consumers from changes in service providers. If an agency decides someday that they don’t want to renew their contract with Alex’s Bus Predictions and instead contract with Sam’s All-Inclusive Transit SaaS, that shouldn’t require every downstream consumer to get a new API key, reconfigure their consuming applications with a new URL, and so on.

I understand that agencies choose SaaS solutions precisely because they don’t want to maintain a lot of technology in-house. I understand that a tech-light agency that has procured a SaaS CAD/AVL solution might recoil at being told “now you have to go procure an API gateway, too!”. But by analogy to paper timetables, how would riders feel if they showed up at a customer service counter, asked for a map or a timetable, and were told “well, actually, go down the street to the copy shop on the corner, and negotiate directly with them for a map”?

Registration

As far as registration is concerned, I am generally of the view that it should not be required to access open data, including real-time feeds! Agencies will claim a whole host of reasons for requiring registration—that their legal department made them, that they need to know who is using their feeds, that they need to do so in order to combat abuse, and so on.

Most of these reasons do not stand up to close scrutiny. There are plenty of agencies who offer public, unauthenticated access to their static and real-time data, and they manage just fine.

But, let’s say that an agency really wants data consumers to register. The registration process should be entirely self-service, should require only an email address (and perhaps a name), and should immediately yield an API key which is active and ready for use. This much is fine, and indeed perfectly reasonable as a means to bypass anti-abuse measures aimed at unauthenticated users (similar to the approach Wikipedia takes, where users can register an account to contribute from an IP address which would otherwise be blocked).

However, a registration form which requires a local phone number or street address is a deterrent to use, and indeed contrary to the principles of Open Data. A registration form which asks questions like “purpose of use” is even more of a deterrent, and highly contrary to the principles of Open Data. A registration form which requires manual approval for each new account is, for all intents and purposes, not Open Data.

A link titled “Open Data” should not lead to a registration form with fields like “Business Contact Person” and “Business Email Address”. At the risk of being incendiary, this is not open data. Don’t bother putting it behind a link marked “Open Data”. If you only want to offer your data in a B2B context to folks who will mail you a paper copy of a contract with a wet-ink signature (which I have seen required!) then please, don’t call it open data.

Rate limits

Now, about rate limits. A real-time data feed is worthless if it is impossible to fetch it frequently enough to be useful. For most real-time data, this means at least every 30 seconds, and perhaps even every 5 or 10 seconds. (This is simply not hard for modern CDNs to cope with. In fact, it is pretty trivial.) If a feed is rate-limited to the point that it is not possible for consumers to fetch it often enough to keep the data fresh on their end, then it is not useful.

So, for example, if you offer registered users 60 fetches per hour, and you are publishing separate TripUpdate, VehiclePosition, and Alert feeds (as is the case for most GTFS-rt producers), then consumers have to make some hard decisions about what to fetch and when (e.g. do they fetch each feed once every three minutes?), and ultimately their downstream consumers are going to suffer from a poorer experience and stale data.

(Or, your consumers can just register for a couple of keys under assumed names and throwaway e-mail addresses, and do an end-run around your limits.)

Along the same lines, saying “oh, but you can email us and we might increase your rate limit” is not good enough.

Use of a CDN

It is natural for GTFS-rt producers to want to use a CDN to lighten the load on their backend server. However, it’s important to understand that without tuning, the average CDN is not appropriate for the task. The average CDN, with default settings, implements anti-abuse measures designed to deter or even entirely block automated requests. In some cases this will manifest in the CDN returning a CAPTCHA—a mechanism designed to distinguish humans from bots—which is completely inappropriate for an endpoint meant to be consumed automatically.

Mitigating abuse

Tying together some of the points about registration and rate limits and CDNs—invariably someone will ask “but what about abuse!?”. First, we have to define what abusive behavior is. It’s not 2-3 requests per feed per minute. That is normal GTFS-rt consumer behavior. 2-3 requests per second, on the other hand, is a little more problematic, and hundreds of requests per second is definitely problematic. The usual Internet “background noise” behavior (scanning for well-known vulnerabilities, etc.) is also problematic. These kinds of things should, if persistent, garner an IP block, though not an indefinite one. Block the offending host for a while, then let it through again and see if the behavior persists. This is pretty easy to do with a decent technology stack (such as with HAProxy), or a well-configured CDN. Certainly this shouldn’t require any sort of manual intervention on the part of the producer, and it shouldn’t require heavy-handed measures which harm legitimate consumers.

In addition, producers should not use technical measures to control access to their APIs by country region. Though the idea of “well, we’re in Country X, so why would anyone outside Country X want our data” may be an attractive one, the reality is that it doesn’t hold up. There are many perfectly valid reasons why a consumer may access, or appear to access, GTFS-rt feeds from outside their country of origin, including providing services from geographically distributed infrastructure, research, and the oddities of Internet traffic (for example, IP geolocation is notoriously unreliable).

(My GTFS-rt archiver happens to be hosted at a datacenter in Beauharnois, Québec, because that’s where I could get cheap hosting!)

Other HTTP niceties

A few more points having to do with behavior at the HTTP layer:

Compression: Though GTFS-rt feeds are packaged in the binary Protocol Buffers format, and this is more efficient than a text-based format, that doesn’t mean they don’t still benefit from compression. However, many Web servers and CDNs will only compress certain content types which they know to be compressible, and in particular this almost always excludes the generic application/octet-stream content type which is used to serve a majority of GTFS-rt feeds. (Incidentally, the documentation recommends application/protobuf be used as the content type for protobuf messages.) Supporting compression saves everyone bandwidth, and honestly, compression isn’t that CPU-intensive. Particularly if you’re using a cache or CDN which caches the compressed feed (as it should), this can be done in a fairly lightweight way, and Zstandard’s performance is amazing.
Avoid custom headers: Setting aside that I would prefer feeds to be available without an API key, if one is required, it should be possible to pass the key as a query parameter, rather than having to set a custom HTTP request header. This is just a small quality-of-life improvement—it makes it possible to easily retrieve the feed in a browser, for example—and it avoids the need for clients to support setting custom request headers.
Support HTTP GET: I wish I didn’t even have to make this point, but it should be possible to retrieve GTFS-rt feeds with an HTTP GET request. There is a GTFS-rt producer out there, for example, whose feed endpoints are only available via HTTP POST (and to make matters worse, these endpoints are only accessible with a short-lived token which must be retrieved from another endpoint which is itself heavily rate-limited). This sort of complex and non-standard behavior has no place in an open data dissemination pipeline. I had to implement a custom proxy to be able to archive the feed described here, and that just shouldn’t have been necessary.

Errors and other behavior

Now, let’s say that after all this, we actually fetch the feed. We’re done, right? Not quite. While some GTFS-rt producers can go days or weeks without any of their feeds returning an error, some have error rates on the order of 1-2%, and others as high as 5%!

What kinds of errors? Let’s take a look:

Error type	Percent of errors observed
CDN returned a CAPTCHA	39%
HTTP 404 Not Found	20%
Protocol Buffer parsing failure	15%
Other HTTP 500-series	11%
Timeout	9%
DNS resolution	2%
Connection reset or ended unexpectedly	1%
HTTP 403 Forbidden	1%
HTTP 429 Too Many Requests	1%
Failed to connect	1%
HTTP 401 Unauthorized	<1%
TLS problem	<1%
Other	<1%
Other HTTP 400-series	<1%

Surprisingly, it’s not always DNS! In fact, it’s most commonly misconfigured CDNs returning a CAPTCHA. The 404s somewhat perplex me. I can only assume they stem from naïve implementations which write GTFS-rt to a file which is then served up by a Web server, leading to race conditions where consumers who hit the endpoint at just the right moment find the file missing.

It’s also worth pointing out the fallibility of Protocol Buffers. I started working on GTFS-rt producers over a decade ago, and at the time I firmly believed that the protobuf bindings for my chosen programming language simply would not let me serialize an invalid message (that is, one missing required fields or with invalid field values). Yet I have seen manifestly invalid GTFS-rt feeds in the wild, and for a while I found them puzzling. Where were they coming from? Were there GTFS-rt bindings for some languages which allowed developers to serialize invalid messages? Yes, yes there were, and they were right under my nose!

Even Java’s protobuf bindings, which I thought I was pretty familiar with, have a buildPartial() method. So you know what? Now my archiver has an option to use the parsePartialFrom() method when parsing, for feeds which chronically generate invalid protobufs. But, as a general matter, GTFS-rt producers should not serialize invalid messages. If a message is missing required field values, that is a sign of a problem!

Anyway, why are we even talking about errors? At the low end, feeds with error rates of 1-2% are probably not worth worrying about. A backend with reasonable retry behavior should be able to try again upon receiving an error, and downstream consumers will probably never notice the glitch. But 5%, on the other hand—one in twenty fetches failing, or once every ten minutes, for a consumer fetching twice a minute—is a sign of a problem. When there are feeds that are able to routinely deliver error rates well under 1% (in fact, well under 0.1%, and in some cases closer to 0.01%), it’s worth looking at best practices.

In general, the most reliable feeds are the ones which are distributed using a properly-tuned CDN, or where the feed contents are written to a public cloud storage bucket and made available to consumers from there. By contrast, those feeds where it’s obvious that the feed is coming straight from a CAD/AVL backend server (particularly one hosted on-prem by the agency!) tend to have the worst performance, and, in particular, tend to suffer under load (i.e. at peak times).

Conclusion

I thought this was going to be an easy little post with a few observations, and instead it has turned into a three-thousand-word essay on all manner of feed misbehavior. Honestly, I wish this weren’t necessary. Having campaigned for open transit data for the past fifteen years, I very much wish we lived in a world where it was easy for every transit agency, large and small, to provide robust, high-quality static and real-time data.

Until that time, though, I hope this post serves to outline a clear set of best practices for dissemination of real-time transit data. If this post convinces even one agency to drop their registration requirement, or loosen up a rate-limit, or take control of their feed from a third-party SaaS AVL provider, or improve the discoverability of their feed, or publish GTFS-rt in the first place, then it will have been worth it.