Recently, I’ve been investigating techniques for independently gathering data in order to be able to analyze performance on the Metrorail system. As I’ve previously lamented, the agency releases only summary performance statistics, which makes it impossible to conduct more detailed analyses. Therefore, we must begin with data collection. If WMATA made all of the data captured by AIM available to developers, this would be a much easier task. But, as I’ve noted, only train predictions are released, obscuring the actual number of trains in the system and their positions.

So, we must first sample the prediction data. We know that the predictions are updated by AIM roughly every 20 seconds. It is not known how much delay Mashery introduces, so for simplicity we will just assume that new predictions are made available every 20 seconds. Application of the Shannon-Nyquist sampling theorem therefore tells us that we must sample the data every 10 seconds.

Don’t trust Claude Shannon? Here’s an example to illustrate why we have to sample so frequently:

Suppose that we’re polling the PIDS at Metro Center once per minute. In the peaks, sometimes the interval between trains is less than 60 seconds. So, at $latex T=0$, we might sample the PIDS and find an 8-car train to Glenmont boarding. If we sample again at $latex T=60$, and once again we see that an 8-car train to Glenmont is boarding, has one train serviced the platform, or two?

We might be able to say with some certainty that two distinct trains had serviced the platform if the observed trains were on different lines, or travelling to different destinations, or if they were different lengths. But if all of the observed characteristics are identical, then we have no way to tell if we saw one train or two, *unless* we were to have observed, in between the two trains, that the platform was empty (that is, that no train was boarding).

Once we accept the need to sample at a particular rate in order to avoid missing a train, how often *do* we sample the predictions? This is where Claude Shannon comes in. As previously introduced, the sampling theorem states that:

If a function $latex f(t)$ contains no frequencies higher than $latex W$ cps, it is completely determined by giving its ordinates at a series of points spaced $latex 1/(2W)$ seconds apart.

The PIDS update every 20 seconds, or at a rate of 0.05 Hertz. Accordingly, we must sample the predictions every 10 seconds. But then what? We’ll have a database of predictions; the sampling rate ensures that we will not miss any. But how do we go from predictions to trains? This remains an open question for me.

Obviously, any time we have a prediction indicating that a train is boarding, we know that there is a train physically at the platform. That’s the only time we don’t have to guess. In all other cases, we have to start guessing. One of the more substantial problems is that the it’s hard to figure out where a train is physically, given its arrival time to a station. The WMATA GTFS feed can be used to find the average travel time between two adjacent stations, and the WMATA API can be used to get the distance between those stations. Using that data, you can *estimate* how many feet away from the station a train is, given the arrival time. But it’s only an estimate, and almost certainly a bad one.

Have I mentioned how much easier this would be if there were an API call that would return every train being tracked by AIM and the track circuit being occupied by the head of the train? And have I mentioned the inconsistency inherent in the fact that the API will readily return the position of every Metrobus on the road, straight from OrbCAD, but all we can get from AIM is predictions?

Anyway, suppose we can get an accurate picture of where the trains are, then what can we do with that data? When you can see all of the trains at once, you can detect bunching and gaps. In addition, the PIDS only show predictions for trains arriving in the next 20 minutes, and tend to fail miserably when trains are single-tracking. A real feed of train positions might make it possible to offer better information to passengers during track work and disruptions, when the PIDS are often blank or give bad information.

Finally, with the right data, it should be possible to correlate real-time data with the GTFS schedule, and compute on-time performanceâ€”not just as the summary metric that WMATA provides, but along a variety of dimensions: by line, by time of day, by day of week, etc. Many questions have been asked about the performance of Metrorail, and ultimately, more data is the only way to answer those questions.