Passively open, actively closed

What do you think of when you hear “open data”? Do you think of hackathons, APIs, data catalogs, perhaps partnerships with Socrata or Mashery, etc.? Do you think of clean data in well-defined formats with ample developer documentation?

Not all open data looks like that. Take Amtrak’s new “interactive train locator map”, for example. You might not know it, but that map is powered by a public dataset stored in Google Maps Engine. As Google’s documentation explains:

There’s an ever-growing number of public datasets available in Google Maps Engine for use by developers in their map or data visualization applications. You may retrieve this data with a simple HTTP request; no authorization is required, and authentication is accomplished through the use of an APIs Console key.

These data, then, are passively open. They are, on a technical level, available for creative reuse, innovation, and incorporation into new transformative projects. But there’s no fancy developer portal, no hackathon, no documentation. The openness of the dataset is more a side effect of having elected to host it in Google Maps Engine than a conscious decision. Once you get the map data, it’s up to you to figure out how to use it—and as for a developer community, well, you’re on your own. It’s not the end of the world, though—in the case of this dataset, it’s mostly self-documenting, and it’s not too hard to build transformative applications with the data.

Unfortunately, sometimes datasets which could easily be treated as passively open are instead made actively closed. Take, for example, GO Transit’s GO Tracker application. The Web application is powered by an XML data feed containing the real-time train data, which would make a great example of a passively open dataset. Instead, it is actively closed to innovation, development, and creative reuse. Try accessing the underlying XML feed outside of the GO Tracker application, and you’ll see that they employ technical measures to control access to the feed. While you could spoof the necessary HTTP headers to gain access, that’s not the sort of thing that comports with open data.

Open data doesn’t necessarily require any special effort. Where there are already APIs and data feeds powering Web applications, all that is required is to allow outside developers to access those same resources. In fact, as in the case of GO Transit, it often takes more effort to shut out developers, building access controls around what would otherwise be easily-reusable open data.

Open standards are a force multiplier for civic software

In software engineering, software modularity and reusability are considered best practices. Unfortunately, in the civic software world, these principles are often ignored, because governments and public bodies fail to use open standards and interfaces for their data.

When governments and other public bodies adopt open standards, everyone wins. Consider, for example, the Open311 standard. When a city implements an Open311 endpoint, its citizens suddenly have the option of using any software which has been developed to support the Open311 standard. There’s no need for civic hackers in Chicago to develop one iPhone app, only for another group of civic hackers in New York to implement a substantially similar app, just because the two cities use different APIs.

For those developers, the use of open standards acts as a force multiplier: they don’t have to know anything about the cities where their apps are being used, because they all adhere to the same standards. Civic software is, for the most part, the domain of non-profits and individuals working on their own time. Development resources are far from unlimited, and, simply put, we have neither the time nor money to spend on what might be characterized as niche applications which only have use in a limited geographic area, or with one government’s proprietary API.

Closer to home, I’ve watched over the past few years as developers have expended needless effort building transit apps for the Washington, D.C. metropolitan area, simply to accommodate local transit authorities’ refusal to publish clean, high-quality data in standard formats.

Arlington County’s Mobility Lab Transit Tech initiative has developed two applications which rely on data from transit authorities. One is a package for driving real-time transit signs, and the other is Transit Near Me, a mobile webapp for mapping transit options.

I want to emphasize that I don’t mean to minimize the work of the Mobility Lab developers—in the end, they did what they needed to in order to be able to ship a working product, given the data they had access to.

Having said that, though, these applications are not all that different from similar transit apps which have already been built. The Mobility Lab’s real-time sign, for example, is (in terms of basic design concepts) not all that different from the OneBusAway sign mode.

Granted, the Mobility Lab’s real-time sign looks more polished, and includes support for transit modes like bike sharing, but imagine if instead of building a new piece of software from the ground up, the Mobility Lab developers had worked to polish the OneBusAway sign mode and add support for other transit modes?

Had they done so, every city which uses OneBusAway would have been able to benefit immediately from the improvements.

But, there’s a problem. OneBusAway consumes real-time transit information in the GTFS-realtime and SIRI VM formats. Out of the agencies in the region, only Montgomery County Ride On and VRE provide GTFS-realtime data. The other agencies which provide real-time data use proprietary formats which are incompatible with GTFS-realtime. Without detouring too deeply into technical territory, WMATA’s proprietary API actually provides all of the information that would be necessary to construct a GTFS-realtime feed for Metrobus, were it not for the fact that the API uses route, stop, and trip identifiers which are completely different from those in the static GTFS schedule.

The same goes for Transit Near Me; it is, in essence, a mobile version of the OpenTripPlanner system map. Cities around the world have adopted OpenTripPlanner; wouldn’t they also benefit from an interactive system map optimized for mobile devices?

OpenTripPlanner is designed to consume clean, well-constructed GTFS feeds, while instead Transit Near Me must include various work-arounds for idiosyncrasies in WMATA’s data: bad shapes which must be replaced with data from shapefiles, stop IDs which are only available in the API and not the GTFS feed, etc.

I should emphasize again that this isn’t just about the Mobility Lab; their work happens to highlight the problem particularly well, but they’re not the only developers to get caught up in this maelstrom:

What’s the solution? Civic hackers need to stand together with each other, and stand up for good software engineering principles. I doubt that any one developer alone will be able to convince WMATA to get their data in order (goodness knows I’ve tried). But if we stand together and recognize that reinventing the wheel over and over again is not a productive use of our time, we may be able to convince data providers to embrace open standards. When we do, it will have benefits not just locally, but for people around the world who benefit from the work of civic hackers.

Statement to the June 2012 WMATA RAC meeting

Earlier this evening, I delivered the following statement to the WMATA Riders’ Advisory Council during the public comment period of its June 2012 meeting:

As I hope you are aware, last week the NTSB released three reports into incidents on the Metrorail system, one of which resulted in the deaths of track workers Jeff Garrard and Sung Oh. I would hope that the RAC would request a presentation from WMATA on steps taken to improve safety in the wake of these accidents. Even more importantly, though, I would hope that the RAC would ask WMATA to publicly release documents to permit independent verification and oversight of the Authority’s claims on safety.

I understand that the RAC is not accustomed to conducting investigations, but for Jeff Garrard, Sung Oh, their colleagues, and the 1.5 million individuals who use Metro ever day, I ask you to consider the vital importance of holding WMATA accountable on such a serious issue as safety.

For reference, the NTSB reports mentioned above are the following:

  • RAB1205 (derailment in Farragut North pocket track)
  • RAB1204 (rear-end collision in West Falls Church yard)
  • RAR1204 (two track workers struck and killed by hi-rail vehicle outside Rockville)

WMATA’s half-hearted open data hurts everyone

I’ve written before about WMATA’s API for train positions and API for bus route information. This time, it’s WMATA’s API for elevator and escalator status that is cause for concern. It’s good that WMATA provides this data in a machine-readable format—in fact, they’re one of only a handful of agencies to do so—but as with WMATA’s other APIs, the implementation is half-hearted at best.

Inconsistent data, the absence of a formal developer relations mechanism, and unexplained, unannounced outages are bad for everyone. They make WMATA look bad, obviously. But more importantly, they make developers look bad, and reduce the incentive for local developers to build applications using WMATA’s data. When someone finds that an app doesn’t work, or that they’re getting stale, incomplete, or inconsistent data, their first instinct is usually to blame the app or the app’s developer, not WMATA.

What’s specifically wrong with the ELES API?

  • 11-day outage, made worse by non-existant developer relations:
    From March 28 to April 9, 2012, the ELES feed returned static data. This outage was never acknowledged publicly by WMATA, in any medium.

    Because WMATA does not provide any public point of contact for developer relations, there was no way for developers to formally report the problem, nor any way for developers to get useful information like an estimated time to resolution.

    An API outage such as this may seem like the sort of thing that would only impact a handful of transit data nerds, but rest assured, there were absolutely real-world impacts: Elevator-dependent Metrorail users who relied on mobile applications which used data from the API found themselves trapped at stations where the stale data led them to erroneously believe that an elevator was in service.

    While this may have been a one-time problem, the underlying issue remains: how could a critical service have gone down for 11 days with no public notice?

  • Feed missing information from the Web site:
    Like much of the information in WMATA’s open data initiative, the ELES API presents the same data as is presented on WMATA’s Web site…or at least that’s how it’s supposed to be.

    In reality, while the Web site lists “estimated return to service” dates for each elevator/escalator, that information is omitted from the API. In addition, others have observed that the API feed and Web site don’t always seem to be in sync. This could create considerable confusion for riders who sometimes check the Web site directly and sometimes use an app which gets data from the API.

  • Feed missing information necessary for maximum usefulness:
    Before presenting this point, it’s important to explain how the elevator outage information is used by elevator-dependent riders. When an elevator-dependent rider sees that there’s an elevator outage at a transfer station that will affect them, they generally avoid the outage by transferring at another station (for example, at Fort Totten rather than Gallery Place).

    But if it’s at their origin or destination station, then they can either use another nearby station (like Judiciary Square rather than Gallery Place), or they can call for a shuttle.

    Calling for a shuttle is a difficult, time-consuming process, but in many cases, especially for outlying stations, it’s a necessity.

    Neither WMATA’s Web site nor the API contain a key piece of information needed by elevator-dependent riders: where to go to get a shuttle—which station, which exit at that station, etc. This information is displayed on the PIDS, but is simply not available on the Web in any format.

  • No master list of units:
    As I explained when I wrote about WMATA’s performance monitoring program, including the agency’s Vital Signs Report, only summary statistics are available for WMATA’s elevators and escalators. Want to know which specific units have the best or worst track records? Want to know if a major overhaul has improved a unit’s availability? Want to know how the units at transfer stations hold up, compared to their peers at less-trafficked stations? You can’t, at least not with the data in the Vital Signs Report.

    But, that doesn’t mean it’s absolutely impossible to compute those statistics; it just takes more work. For one thing, you can forget about getting historical data. However, if you’re willing to archive data from the ELES API, you can actually create your own statistics. Store that in a database, and over time you’ll build up a record of which units were out of service, and when. Transfer the result of that into an OLAP cube, and you can slice and dice to your heart’s content. Want a report on units at transfer stations? Done. Want stats on outages specifically at peak hours? Done. Want a report just on your home station? Done.

    There’s only one piece missing: a list of all elevators and escalators in the Metrorail system. Why is this necessary? In order to compute statistics with the outage data, we have to know how many units there are—in statistical terms, the universe. Of course, we can find out from WMATA’s Web site that there are a total of 588 escalators, and 239 elevators, but that’s only good enough for computing the same system-wide metric that the Vital Signs Report provides. Any more detailed analysis—like at a per-station level, or a per-line level, or any of the examples given above, requires knowing not just how many units there are, but the IDs of those units, and their locations (so statistics can be computed on a per-station, or even per-unit level).

    If WMATA had made a real commitment to transparency and open data, and if there were a developer liaison appointed, I’d imagine it might take a day or two to get such a master list of units made available as a CSV or XML file—I would have to imagine that somewhere in the 100 TB of data managed by WMATA, there must be a list of these 827 units.

    But there isn’t even anyone to ask for the data. And, to make matters worse, every such request is treated with suspicion and mistrust. There’s no sense of developers working cooperatively with WMATA; it is, from the outset, combative. Yes, some of these data will make WMATA look bad, but some will make the agency look good—especially when it can be shown that a major overhaul, such as is taking place now at Dupont Circle and will soon take place at Bethesda, improves the reliability of the overhauled units. Besides, transparency isn’t about releasing the data that make you look good, it’s about releasing data, period.

What’s the point of all this, then? When General Manager Sarles says that he “[doesn’t] want to hide problems”, or that the Metro Forward campaign is making tangible improvements for riders, I expect to see data to back up those assertions.

When elevator-dependent riders have to cope with yet another outage, I don’t want for them to find out for the first time when they get to their destination and the only notice they have is a cone in front of the elevator door. I want for there to be timely (and, more importantly, meaningful) information available, in a wide variety of formats, including a high-quality API that encourages app developers to build tools that further increase the accessibility and further widen the dissemination of that information.

Why do I expect these things? I expect these things because Metrorail is supposed to be “America’s Subway”, a world-class system at the forefront of technological innovation and operational excellence. Right now, it is neither of those things. Instead, it is a system where riders climb up and down stopped escalators in dimly-lit stations and hope that their train does not pass over another poorly-maintained track circuit which shall fail to detect that it has become occupied and engender yet another fatal collision. It is a system where secrecy and the maintenance of fiefdoms are the norm, not transparency and cooperation for the good of the riding public.

I don’t claim that open data (and better still, open data that is timely and meaningful) will solve all of those problems, but it is a small step forward, and a step that WMATA could easily take using its existing infrastructure.

Reconstructing train positions from prediction data

Recently, I’ve been investigating techniques for independently gathering data in order to be able to analyze performance on the Metrorail system. As I’ve previously lamented, the agency releases only summary performance statistics, which makes it impossible to conduct more detailed analyses. Therefore, we must begin with data collection. If WMATA made all of the data captured by AIM available to developers, this would be a much easier task. But, as I’ve noted, only train predictions are released, obscuring the actual number of trains in the system and their positions.

So, we must first sample the prediction data. We know that the predictions are updated by AIM roughly every 20 seconds. It is not known how much delay Mashery introduces, so for simplicity we will just assume that new predictions are made available every 20 seconds. Application of the Shannon-Nyquist sampling theorem therefore tells us that we must sample the data every 10 seconds.

Don’t trust Claude Shannon? Here’s an example to illustrate why we have to sample so frequently:

Suppose that we’re polling the PIDS at Metro Center once per minute. In the peaks, sometimes the interval between trains is less than 60 seconds. So, at $latex T=0$, we might sample the PIDS and find an 8-car train to Glenmont boarding. If we sample again at $latex T=60$, and once again we see that an 8-car train to Glenmont is boarding, has one train serviced the platform, or two?

We might be able to say with some certainty that two distinct trains had serviced the platform if the observed trains were on different lines, or travelling to different destinations, or if they were different lengths. But if all of the observed characteristics are identical, then we have no way to tell if we saw one train or two, unless we were to have observed, in between the two trains, that the platform was empty (that is, that no train was boarding).

Once we accept the need to sample at a particular rate in order to avoid missing a train, how often do we sample the predictions? This is where Claude Shannon comes in. As previously introduced, the sampling theorem states that:

If a function $latex f(t)$ contains no frequencies higher than $latex W$ cps, it is completely determined by giving its ordinates at a series of points spaced $latex 1/(2W)$ seconds apart.

The PIDS update every 20 seconds, or at a rate of 0.05 Hertz. Accordingly, we must sample the predictions every 10 seconds. But then what? We’ll have a database of predictions; the sampling rate ensures that we will not miss any. But how do we go from predictions to trains? This remains an open question for me.

Obviously, any time we have a prediction indicating that a train is boarding, we know that there is a train physically at the platform. That’s the only time we don’t have to guess. In all other cases, we have to start guessing. One of the more substantial problems is that the it’s hard to figure out where a train is physically, given its arrival time to a station. The WMATA GTFS feed can be used to find the average travel time between two adjacent stations, and the WMATA API can be used to get the distance between those stations. Using that data, you can estimate how many feet away from the station a train is, given the arrival time. But it’s only an estimate, and almost certainly a bad one.

Have I mentioned how much easier this would be if there were an API call that would return every train being tracked by AIM and the track circuit being occupied by the head of the train? And have I mentioned the inconsistency inherent in the fact that the API will readily return the position of every Metrobus on the road, straight from OrbCAD, but all we can get from AIM is predictions?

Anyway, suppose we can get an accurate picture of where the trains are, then what can we do with that data? When you can see all of the trains at once, you can detect bunching and gaps. In addition, the PIDS only show predictions for trains arriving in the next 20 minutes, and tend to fail miserably when trains are single-tracking. A real feed of train positions might make it possible to offer better information to passengers during track work and disruptions, when the PIDS are often blank or give bad information.

Finally, with the right data, it should be possible to correlate real-time data with the GTFS schedule, and compute on-time performance—not just as the summary metric that WMATA provides, but along a variety of dimensions: by line, by time of day, by day of week, etc. Many questions have been asked about the performance of Metrorail, and ultimately, more data is the only way to answer those questions.