Reviewing WMATA’s Transparent Metro Data Sets initiative

In August, WMATA released the first part of their Transparent Metro Data Sets initiative: an API for accessing data on the Metrorail system. I’ve finally taken the time to look over the API that WMATA has put together and come to some conclusions:

Overall, it’s an excellent effort—and while I was initially concerned by the partnership with Mashery, it seems to have paid off. The signup process looks to be straightforward, and when developers asked for JSONP support, WMATA was able to add it without much fuss. However, the API rate limits—10,000 calls per day, and no more than two or five calls per second (depending on where you look) seem to be designed for an architecture where developers cache results on their server before distributing them to users—a desktop or mobile app which made API calls directly would hit the limits pretty quickly with just a handful of users. If it is WMATA’s intention that developers should implement their own caching infrastructure, then it would be nice to see that explicitly stated in the documentation; otherwise it would make more sense to have the rate limits apply per key and per IP address, so that requests using the same key but from different IP addresses would not be counted against the same limit.

In terms of the actual API methods, it would be nice to have a method which would return information on every train being tracked by AIM (or at least every revenue train); I can see this being useful for statistical purposes, as well as for developers who want to develop their own train prediction algorithm (which is the same thing WMATA is planning to do with bus data, in order to avoid having to tangle with NextBus—they’re not releasing bus predictions, just the raw AVL data). Similarly, the train prediction algorithm used for the PIDS and now the API data has some known limitations, such as poor handling of trains which originate at stations other than the end of the line (such as the Yellow Line, or Red Line short-turns at Silver Spring), and miserable performance when trains are single-tracking. It would be interesting to see if the developer community could do better, given the raw data. Being able to get the position of every train directly would also open up some interesting possibilities in terms of data visualization.

Finally, there’s one data quality issue I noted, which is that some Yellow Line trains terminate at or originate from Greenbelt, and these are even accounted for in the data used by the trip planner, but the data returned by the Rail Lines API method does not list E10 as a destination station for the Yellow Line.