How should transit agencies make their GTFS available?

To many techies, the question of how transit agencies should make their GTFS available might seem like a silly one. They’d reply that obviously the agency should simply post their GTFS to their Web site at a reasonable URL, and make that URL readily available from the agency’s developer resources page.

Unfortunately, it isn’t nearly so simple in the real world. Instead, many agencies hide their GTFS behind a “clickwrap” license, or even require a login to download the feed. In a few particularly bad cases, developers even have to sign an agreement and return it (on paper) to get access to a feed. Some agencies don’t host their own feeds at all, instead depending on sites like the GTFS Data Exchange.

So, what are some best practices for hosting GTFS feeds?

  • Don’t rely on third parties: Think of this in terms of paper maps and schedules. How would riders feel if a transit agency told them to pick up transit maps and timetables not at the agency’s offices or stations, but rather some unrelated third party? If a transit agency has a Web site (as almost all do), then it should be capable of hosting its own GTFS feed. Sure, some agencies will complain about what their content management system “won’t let them do”, or complain that they must go through some arduous process to upload new content, but in 2014 running a Web site is a basic competency for almost any organization. Depending on a third-party site introduces additional risk and additional points of failure.
  • Help developers discover feeds: Developers shouldn’t have to hunt for GTFS feeds–there should be a prominent link on every agency’s homepage. Bonus points for participating in any applicable data catalogs, like these operated by ODOT and MassDOT for agencies in their respective states.
  • No login, no clickwrap: GTFS feeds should be downloadable by any Internet user, without having to log in or accept a license agreement. This is a must-have for being able to automate downloads of updated GTFS feeds, an essential part of any large-scale passenger information system. Don’t make it needlessly hard for developers to use your GTFS feed – if you can’t download it with wget, then you’re just making work for feed users. The only piece of information a developer should need to know to use an agency’s GTFS feed is the URL—a clean, simple URL like http://www.bart.gov/dev/schedules/google_transit.zip.
  • Support conditional HTTP GET: GTFS feeds rarely change every day, but it’s still important to get updates as soon as they’re available. But downloading a large feed (some can be 20 MB or more) every day is wasteful. So how can feed consumers stay up-to-date without wasting a lot of bandwidth? Feed producers should support conditional HTTP GET, using either the ETag or Last-Modified headers.

Agencies may balk at some of these recommendations—”But we have to track usage of the feed! But we have to have a signed license agreement!”—but the simple fact is that there are plenty of agencies that get it right. There are plenty of agencies that use a simple, reasonable license, and plenty of agencies that host their GTFS at a stable URL that supports automated downloads. If you demand a signed license agreement, or make developers log in to access the feed, you make it harder for developers to use your data. When you make it hard for developers to use your data in their apps, you make it harder for transit riders to get service information, because many riders’ first stop when they need transit information is a third-party smartphone app.

OneBusAway might be coming to Ride On, maybe?

Today on GitHub I came across this commit. I don’t quite know what’s going on, but it sure looks to me like someone at Greenhorne & O’Mara or Ride On has been experimenting with OneBusAway and Ride On’s data.

This is something in which I am keenly interested. But unlike in other cities, here there seems to be almost no interest in connecting transit agencies with each other and with local developers. There’s great value in doing both—connecting transit agencies together helps reduce duplicated effort and provide riders with harmonized, federated services. But even more importantly, connecting transit agencies with interested developers can provide transit riders with services that might have been cost-prohibitive or otherwise infeasible to for those agencies to develop in-house or through conventional procurement methods.

There are a lot of innovative developers out there, with lots of great ideas. It’s unreasonable to expect transit authorities to shoulder the risk of incubating all of those ideas, some of which might fail spectacularly, but it’s quite another thing for transit authorities to, on a best-effort basis, provide those developers with the data they need to bring their ideas to fruition.

I’d have thought that this would fall within the remit of the Mobility Lab, but more than a year after the launch of the Mobility Lab, that still hasn’t happened.

So, while there’s plenty going on, there’s not a whole lot of coordination, whether between agencies or between agencies and the community. Thus we have nearly a half-dozen real-time sign projects going on in the region—and who knows how much more duplicated work is being done, with everyone toiling behind closed doors!

Contrast that to New York City, where the MTA has been working—transparently—to develop MTA Bus Time, based on OneBusAway. When the agency began work on a GTFS-realtime feed for real-time subway arrivals from the IRT ATS system, once again, they turned to the community to get comments on the proposed specification. MTA developers are active on the agency’s mailing list to respond to questions and bug reports from developers.

In Portland, TriMet worked with OpenPlans to develop OpenTripPlanner, transparently, in full view of the community. OpenTripPlanner has proven to be a huge success, powering an first-of-its-kind regional, intermodal trip planner

Transit in the Washington, D.C. area isn’t all that different than in Portland or New York. Sure, the modes vary from city to city, and the Lexington Avenue Line by itself carries more passengers in one day than the Metrorail and Metrobus systems combined, but at its core, transit is transit. If it worked in New York, if it worked in Portland, it can work here.

This doesn’t have to be hard; in all seriousness, it takes about a minute to create a new Google Group.

When everyone works together, we can all help make transit better.

An object lesson in developer relations

Last week, I posted about some issues with Montgomery County Ride On’s new API.

A few days ago, I tried to re-start my test OneBusAway instance, including the translator I’d written to generate a proper GTFS-realtime feed for Ride On. To my surprise, I found that something had changed with the Ride On API, and my translator no longer worked properly. When I investigated, I found that the API endpoint for GTFS-realtime now correctly returned the binary Protocol Buffers format, one of the issues I’d mentioned in my earlier post.

So, I decided to review the API documentation and see what had happened. Since G&O (the contractor that built the system for Montgomery County) keeps the API documentation in a MediaWiki wiki, we can look at the recent changes quite easily.

Aside from a bit of Wiki-spam, we see a flurry of activity on May 22 and 23, documenting the following changes:

  • Release of source code: The code which powers the Ride On API has been released as open source, under the GPLv3.
  • Documentation on use of access token: As documented here, the supplied access token is intended to be included in API calls as the auth_token parameter.
  • GTFS-realtime endpoint fixed: Perhaps most importantly, the GTFS-realtime endpoint now offers a choice between the binary Protocol Buffers format (the default), and the text-based debugging output. As a result, the feed can now be used out-of-the-box with GTFS-realtime tools.

I am, of course, glad to see that these changes have been made. But where was the developer outreach? I’ve written before about the vital role that effective developer outreach plays in the success of open transit data initiatives, and this is a perfect example.

A quick email to a developer mailing list outlining the changes that had been made would have kept developers in the loop. More importantly, a developer mailing list also provides an excellent feedback mechanism, allowing developers to share their observations in working with the API, and distinguish genuine problems from transient failures or problems with their own code.

For example, right now it seems like the latitude and longitude in vehicle positions are being transposed. This appears to affect both the vehicle_positions and gtfs_realtime API methods, so I suspect it’s an issue with the underlying data, rather than any particular API method, or the clients I’m using.

It’s not clear how best to report this issue, so I’m blogging about it. It’s probably a quick fix—in working with geodata I’ve certainly transposed latitude and longitude fields many times before—but it is still something that needs to be examined. Better developer outreach would make it easier for developers to report these types of issues and get updates on their resolution.

Having said all of that, in case I’ve not adequately emphasized this point, let me be perfectly clear: in the span of a few months, Ride On has gone from providing no real-time data to providing some of the best real-time data in the region. Ride On currently has the only standards-compliant real-time feed in the region, and the only one that properly correlates static and real-time data, by using the same trip IDs across the GTFS schedule data and the GTFS-realtime feed. When it comes to providing data that can be used with existing passenger information systems, trip planners, and other tools, this is huge.

WMATA’s half-hearted open data hurts everyone

I’ve written before about WMATA’s API for train positions and API for bus route information. This time, it’s WMATA’s API for elevator and escalator status that is cause for concern. It’s good that WMATA provides this data in a machine-readable format—in fact, they’re one of only a handful of agencies to do so—but as with WMATA’s other APIs, the implementation is half-hearted at best.

Inconsistent data, the absence of a formal developer relations mechanism, and unexplained, unannounced outages are bad for everyone. They make WMATA look bad, obviously. But more importantly, they make developers look bad, and reduce the incentive for local developers to build applications using WMATA’s data. When someone finds that an app doesn’t work, or that they’re getting stale, incomplete, or inconsistent data, their first instinct is usually to blame the app or the app’s developer, not WMATA.

What’s specifically wrong with the ELES API?

  • 11-day outage, made worse by non-existant developer relations:
    From March 28 to April 9, 2012, the ELES feed returned static data. This outage was never acknowledged publicly by WMATA, in any medium.

    Because WMATA does not provide any public point of contact for developer relations, there was no way for developers to formally report the problem, nor any way for developers to get useful information like an estimated time to resolution.

    An API outage such as this may seem like the sort of thing that would only impact a handful of transit data nerds, but rest assured, there were absolutely real-world impacts: Elevator-dependent Metrorail users who relied on mobile applications which used data from the API found themselves trapped at stations where the stale data led them to erroneously believe that an elevator was in service.

    While this may have been a one-time problem, the underlying issue remains: how could a critical service have gone down for 11 days with no public notice?

  • Feed missing information from the Web site:
    Like much of the information in WMATA’s open data initiative, the ELES API presents the same data as is presented on WMATA’s Web site…or at least that’s how it’s supposed to be.

    In reality, while the Web site lists “estimated return to service” dates for each elevator/escalator, that information is omitted from the API. In addition, others have observed that the API feed and Web site don’t always seem to be in sync. This could create considerable confusion for riders who sometimes check the Web site directly and sometimes use an app which gets data from the API.

  • Feed missing information necessary for maximum usefulness:
    Before presenting this point, it’s important to explain how the elevator outage information is used by elevator-dependent riders. When an elevator-dependent rider sees that there’s an elevator outage at a transfer station that will affect them, they generally avoid the outage by transferring at another station (for example, at Fort Totten rather than Gallery Place).

    But if it’s at their origin or destination station, then they can either use another nearby station (like Judiciary Square rather than Gallery Place), or they can call for a shuttle.

    Calling for a shuttle is a difficult, time-consuming process, but in many cases, especially for outlying stations, it’s a necessity.

    Neither WMATA’s Web site nor the API contain a key piece of information needed by elevator-dependent riders: where to go to get a shuttle—which station, which exit at that station, etc. This information is displayed on the PIDS, but is simply not available on the Web in any format.

  • No master list of units:
    As I explained when I wrote about WMATA’s performance monitoring program, including the agency’s Vital Signs Report, only summary statistics are available for WMATA’s elevators and escalators. Want to know which specific units have the best or worst track records? Want to know if a major overhaul has improved a unit’s availability? Want to know how the units at transfer stations hold up, compared to their peers at less-trafficked stations? You can’t, at least not with the data in the Vital Signs Report.

    But, that doesn’t mean it’s absolutely impossible to compute those statistics; it just takes more work. For one thing, you can forget about getting historical data. However, if you’re willing to archive data from the ELES API, you can actually create your own statistics. Store that in a database, and over time you’ll build up a record of which units were out of service, and when. Transfer the result of that into an OLAP cube, and you can slice and dice to your heart’s content. Want a report on units at transfer stations? Done. Want stats on outages specifically at peak hours? Done. Want a report just on your home station? Done.

    There’s only one piece missing: a list of all elevators and escalators in the Metrorail system. Why is this necessary? In order to compute statistics with the outage data, we have to know how many units there are—in statistical terms, the universe. Of course, we can find out from WMATA’s Web site that there are a total of 588 escalators, and 239 elevators, but that’s only good enough for computing the same system-wide metric that the Vital Signs Report provides. Any more detailed analysis—like at a per-station level, or a per-line level, or any of the examples given above, requires knowing not just how many units there are, but the IDs of those units, and their locations (so statistics can be computed on a per-station, or even per-unit level).

    If WMATA had made a real commitment to transparency and open data, and if there were a developer liaison appointed, I’d imagine it might take a day or two to get such a master list of units made available as a CSV or XML file—I would have to imagine that somewhere in the 100 TB of data managed by WMATA, there must be a list of these 827 units.

    But there isn’t even anyone to ask for the data. And, to make matters worse, every such request is treated with suspicion and mistrust. There’s no sense of developers working cooperatively with WMATA; it is, from the outset, combative. Yes, some of these data will make WMATA look bad, but some will make the agency look good—especially when it can be shown that a major overhaul, such as is taking place now at Dupont Circle and will soon take place at Bethesda, improves the reliability of the overhauled units. Besides, transparency isn’t about releasing the data that make you look good, it’s about releasing data, period.

What’s the point of all this, then? When General Manager Sarles says that he “[doesn’t] want to hide problems”, or that the Metro Forward campaign is making tangible improvements for riders, I expect to see data to back up those assertions.

When elevator-dependent riders have to cope with yet another outage, I don’t want for them to find out for the first time when they get to their destination and the only notice they have is a cone in front of the elevator door. I want for there to be timely (and, more importantly, meaningful) information available, in a wide variety of formats, including a high-quality API that encourages app developers to build tools that further increase the accessibility and further widen the dissemination of that information.

Why do I expect these things? I expect these things because Metrorail is supposed to be “America’s Subway”, a world-class system at the forefront of technological innovation and operational excellence. Right now, it is neither of those things. Instead, it is a system where riders climb up and down stopped escalators in dimly-lit stations and hope that their train does not pass over another poorly-maintained track circuit which shall fail to detect that it has become occupied and engender yet another fatal collision. It is a system where secrecy and the maintenance of fiefdoms are the norm, not transparency and cooperation for the good of the riding public.

I don’t claim that open data (and better still, open data that is timely and meaningful) will solve all of those problems, but it is a small step forward, and a step that WMATA could easily take using its existing infrastructure.

To be successful, open data must be quality data

The listing above is the output of WMATA’s API method for getting a list of bus routes. Over a year ago, a developer reported corruption of the data. The corruption is suggestive of the data being transferred by way of Microsoft Excel, which is something of a frightening prospect in its own right (as Excel is no substitute for a real database, and well-known for corrupting data), but worse, more than a year after the report was made, the problem remains unresolved.

In fact, the problems with the list of bus routes returned are so severe as to make the data unusable without an inordinate amount of post-processing.
Continue reading To be successful, open data must be quality data