To be successful, open data must be quality data

The listing above is the output of WMATA’s API method for getting a list of bus routes. Over a year ago, a developer reported corruption of the data. The corruption is suggestive of the data being transferred by way of Microsoft Excel, which is something of a frightening prospect in its own right (as Excel is no substitute for a real database, and well-known for corrupting data), but worse, more than a year after the report was made, the problem remains unresolved.

In fact, the problems with the list of bus routes returned are so severe as to make the data unusable without an inordinate amount of post-processing.

As an example, here’s how Route 64 is returned in the data from WMATA:

    {
      "Name": "64 - 64 FT TOTT-PETWORTH  (43)", 
      "RouteID": "64"
    }, 
    {
      "Name": "64 - 64 FT TOTT-PETWORTH  (43)", 
      "RouteID": "64v1"
    }

Here’s what I’d expect to see from a more developer-friendly API:

    {
      "Name": "Fort Totten-Petworth Line"
      "RouteID": "64"
      "RouteIDAlternate": ["64v1"]
    }

What are the differences?

  • Route names formatted for display: In the current API output, route names are formatted in all-uppercase, heavily abbreviated, and sometimes truncated. Simply put, that’s unacceptable from a usability perspective. Route names should reflect the names used in the list of timetables and on the timetable covers themselves, including mixed case and formatting. Additionally, there’s no need to include the route ID in the route name (twice, in fact!), when that information is readily available in the RouteID field.
  • Internal identifiers removed: Except where they’ve been truncated off, the route names in the API contain internal identifiers in parentheses, which are of no use to developers, and should never have been included in public data.
  • Alternate routes clearly identified: The WMATA API identifies “alternate routes” for many bus lines. These alternates are given identifiers like 7Yc and 7Yv1. Rather than listing each alternate as its own route, they should be listed as attributes of the parent route. This, too, is important for usability. Developers shouldn’t be displaying the alternate routes to riders, because they’re not shown on bus headsigns. The alternate routes are only used when submitting queries to the API, and the design of the API should reflect this fact.

There are other glaring deficiencies in WMATA’s open data efforts, including the complete separation of the GTFS feed and real-time API. Not only do the two data sources sometimes return different data, there’s no way to go between them—the stop IDs used in the real-time API (which correspond to the stop IDs physically posted on stops) bear no relation to the stop IDs used in the GTFS feed. Populating the stop_code field in the GTFS feed, and ensuring that the feed and the API return consistent data, would make developers’ lives considerably easier. There’s also no way to get a list of bus stops for each direction served by a route—while you can get all of the stops served by a route, there’s no distinction between directions.

WMATA’s GTFS feed has long-standing problems, too, and these problems are even reflected in Google Transit. Who would think that DCDGR or DC98 are Circulator routes? The feed includes shapes, but, as the Transit Near Me team discovered, some of the shapes are just wrong. I ran into similar problems when I attempted to build a transit data bundle for use with OneBusAway from the WMATA GTFS feed. In addition, route information in the GTFS feed is fairly sparse—route_short_name should continue to contain the route number (like 5A, 42, or 16B) as it does now, but the route_long_name should also be populated with the name of the route. At present, route names are only available through the API, and, as described above, the data quality is generally poor.

What’s really frustrating about this is the fact that there’s just no good way to communicate these types of issues back to WMATA—no dedicated point of contact, no developer mailing list, no developer meetups, nothing.

I have said this before (in fact, just last Friday), and I will say it again: the developer mailing list operated by the New York MTA is a excellent example of how transit agencies (or any organization with open data) should communicate with developers. The MTA’s Aaron Donovan, along with technical staff from the MTA’s constituent agencies, make a real effort to participate on the list, respond to inquiries from developers, and work to improve the quality of their data. Sometimes the data developers need simply isn’t there—such as detailed GTFS for MTA Bus Company routes—but even then they are frank about what is and isn’t available, why the data isn’t available, and when (or if) the situation can be expected to change.

By contrast, the only official response I’ve gotten from WMATA about data issues has been deafening silence. Unofficially, I’ve heard bits and pieces from different people, and I was even able to get one data quality issue addressed. But that’s no way to run things. One of the core concepts behind open data for transit agencies is that open data indirectly encourages transit use, by giving developers the tools necessary to build apps which improve the transit-riding experience. But how can developers do that if the data they need isn’t there, or isn’t of sufficient quality? And more to the point, what does it say about an agency’s commitment to open data, to transparency, and ultimately, to its riders, if they won’t work to address data quality issues?

One thought on “To be successful, open data must be quality data”

  1. I’ve seen plenty of issues with WMATA’s GTFS in regards to the malformed shapes and the inclusion of the DC Circulator data. To be honest, I think they should be doing their own feed instead of including it in WMATA’s scheduling database. I think they were doing that at first, but abandoned that. Also, it’s worth noting that REX doesn’t have a short name and should be included since that’s a public facing route but in regards to WMATA’s API, it shows up as the R99 and nobody, outside of who knows WMATA’s operations, really knows what the R99 is. Plus, there are countless other “routes” included as well reflecting the OrbCAD feed.

    Also, about the route variants, they should be included as child elements to the parent route instead of being mislabeled as a route themselves. That would make things much easier in order to build applications instead of having to use include route variants when making a call.

    I’ve also found that the trip ID’s used in the API have no general use to be honest. There is nothing to back it against in regards to how it compares to GTFS data or Nextbus arrivals. The Nextbus arrivals use a different trip ID for the trips and might be reflective of a different schedule export WMATA uses. When using the API for MBTA and LA, you can match the trip IDs, stop numbers, and block numbers used in the predictions with the GTFS data to provide some background for the trip and it works well for GTFS-realtime for MBTA. In order for it to work for Metro, they would need to revamp everything to have better data with quality control and quality assurance to developers who use the data because the data is only as good as the provider and yes, WMATA needs to have some sort of contact that can work with the developer community and provide feedback to developers about fixes, changes, and other things regarding the data that WMATA provides to the community. Twitter and Facebook are a start, but they probably did that just to keep up with the current trend of social media.

Comments are closed.