If you work with real-time transit data, you've probably heard the steadily-increasing call for data producers to release their data in open, standardized formats like GTFS-realtime and SIRI. But how do you actually make your data available in those formats? Some AVL vendors are beginning to include standards-compliant APIs in their products, and that's great for agencies considering a new system or major upgrade. But what about the massive installed base of legacy AVL systems which have few open interfaces, if any?

Fortunately, there are ways to get data out of almost any AVL system, whether it was explicitly designed with open interfaces or not. Some of these techniques are more technologically sound than others, and some may require some relatively tricky programming, but if you can find the right software developer, almost any problem is soluble.

Here are five key strategies for extracting information from an AVL system. The first three are strongly recommended, while the last two should only be undertaken if no better interface is available, and if you have adequate technical support to implement a more complex solution.

  • Transform a proprietary API to GTFS-realtime or SIRI: Many AVL systems (both COTS and agency-homegrown) include non-standard APIs which can, with a bit of programming, be transformed into a modern, standards-compliant API. This is the approach I took with wmata-gtfsrealtime, to produce a GTFS-realtime feed from WMATA’s real-time bus data, septa-gtfsrealtime to produce a GTFS-realtime feed from SEPTA’s real-time bus and rail data, and ctatt-gtfsrealtime to produce a GTFS-realtime feed from CTA’s Train Tracker data. This is also the approach taken by onebusaway-gtfs-realtime-from-nextbus-cli, which converts from the NextBus API, and bullrunner-gtfs-realtime-generator, which converts from the Syncromatics API.
  • Query a reporting database: Some AVL systems can be configured to log vehicle positions, predicted arrival times, and other information to a database. Ostensibly these databases are meant to be used for after-the-fact incident analysis, performance reporting, etc., but there's nothing stopping an application from polling the database every 15-30 seconds to get the latest vehicle positions and predicted arrival times. Many GTFS-realtime feed producers take this approach, including ddot-avl, built by Code for America to extract real-time information from DDOT’s TransitMaster installation, HART-GTFS-realtimeGenerator, built by CUTR to extract real-time information from HART’s OrbCAD installation, and live_transit_event_trigger, built by Greenhorne & O'Mara (now part of Stantec) to produce a GTFS-realtime feed from Ride On’s OrbCAD installation.
  • Parse a published text file: Similar to the database approach, some AVL systems can be configured to dump the current state of the transit network to a simple text file (like this file from Hampton Roads Transit). This text file can be read and parsed by a translator which then generates a standards-compliant feed, which is the approach taken by hrt-bus-api, built by Code for Hampton Roads, and onebusaway-sound-transit-realtime.
  • Screen-scrape a passenger-facing Web interface: This is where we get into the less technologically-sound options. While the first three options focused on acquiring data from machine-readable sources, screen scraping involves consuming data from a human-readable source and transforming it back into machine-readable data. In this case, that might mean accessing a passenger-facing Web site with predicted arrival times, extracting the arrival times, and using that to produce a standards-compliant feed. This is the approach taken by this project, which screen-scrapes KCATA’s TransitMaster WebWatch installation to produce a GTFS-realtime feed. Compared to options which involve machine-readable data sources, screen-scraping is more brittle, and may make it more challenging to produce a robust feed, but it can be made to work.
  • Intercept internal AVL system communications: This is the last resort, but if an AVL system has no open interfaces, it may be possible to intercept communications between the components of the AVL system (such as a central server and a dispatch console or system driving signage at transit stops), decode those communications, and use them as the basis for a standards-compliant feed. This is a last resort because it will often require reverse-engineering undocumented protocols, and results in solutions which are brittle and will tend to break in unpredictable ways. But, it can be done, and if it's the only way to get data out of an AVL system, then go for it. This is the approach taken by onebusaway-king-county-metro-legacy-avl-to-siri.

As evidenced by the example links, every one of the strategies mentioned above has been implemented in at least one real-world application. No matter how old your AVL system is, no matter how far out of warranty or how unsupported it is, no matter how obsolete the technology is, some enterprising civic hacker has probably already figured out a way to get data out of the system (or is eager and ready to do so!). Every one of the tools linked in this post is open-source, and if it closely approximates your needs, you can download it today and start hacking (or find a local civic hacker and have them adapt it to meet your needs). And if none of the tools look close? Don't head for your procurement department and have them issue an RFP—instead, post on the Transit Developers Google Group; chances are your post will make its way to someone who can help, whether a local Code for America brigade, or an independent civic hacker, or another transit agency that has already solved the same problem.

Finally, I'd like to thank the participants in the Disrupting Legacy Transit Ops Software (Moving Beyond Trapeze) session at Transportation Camp DC 2015, who inspired me to write this post.