Automating transit alert selection using fare collection data

Last week, WMATA launched its new MetroAlerts service, which greatly extends and improves the previous alert system, and adds alerts for bus routes. With the addition of bus alerts, the service provides real benefits for riders, allowing them to get targeted updates on the routes they use.

But this service, and others like it, still require riders to manually designate the rail and bus routes and rail stations they use, in order to receive targeted alerts. Some systems also allow riders to further customize their selection of alerts by time period. The end result is that riders are presented with a screenful of choices, when all they really want to know is if they're going to get to work on time.

So, how can we simplify the process? One approach, which I've been considering recently, is to use data from a transit agency's fare collection system to infer a rider's travel patterns, and automatically select alerts which would affect their usual trips.
What kinds of data can we get from a fare collection system? The data available will fall into three broad groups, depending on the type of fare collection used:

Origin and destination locations available: On systems with a zonal or distance-based fare, like Metrorail, the London Underground, or the PATCO Speedline, the fare collection system records both where a rider enters the system as well as where they exit the system. In this case, we can use the origin and destination stations, as well as the time and day of week of both entry and exit.
Origin location only available: This is case applies to systems like the New York City subway, where we know where the rider entered the system, but nothing more. After a rider enters the subway, they could take any combination of routes to any destination, over the subway's 600-odd track miles. All we will know is that they swiped their MetroCard at a particular station, at a particular time.
Origin route only available: This case applies to most buses, where the fare collection system does not record where a rider boards, but does record the route that the bus was operating at the time.

In all three of these cases, we'll have data on time and day of week (and in the first case, for both entry and exit). Using day of week in the process allows for us to account for cases where a rider makes a certain trip every Saturday, or commutes Monday-Friday, for example.

In addition, in the first two cases, where we know where an action took place (whether it was an entry or exit), I would argue that the station's actual location should be used in identifying patterns, rather than an opaque identifier.

To give a concrete example, let's say I commuted every day from Silver Spring to a point equidistant between Gallery Place and Metro Center.

Most days, I might get off at Gallery Place, since that would result in a shorter overall trip, but on certain days I might (for example) get off at Metro Center and walk from there in order to stop somewhere for a snack. Should that be treated as two entirely separate commuting patterns, one from Silver Spring to Gallery Place, and one from Silver Spring to Metro Center? No, that wouldn't make a lot of sense. Using geodata in the clustering would make it possible to identify the commuting pattern as going from Silver Spring to a location in the vicinity of the Metro Center and Gallery Place stations, which is a better reflection of the reality.

So, how do we actually perform this identification? I am not an expert in data mining, but I understand that the task I have set out here is cluster analysis. The first step is to get the fare collection data into a database. This can either be provided by the agency directly, or collected from the agency's Web site using a screen-scraper. Then, a clustering algorithm like DBSCAN or OPTICS can be used to extract clusters from the data.

After the clusters have been identified, then we can turn those into alert selections. In any of the three cases described above, selections can be made based on time of day and day of week.

In the first case, we can also narrow down the selected alerts based on the origin and destination stations. Where those stations are not on the same line, some inferences will have to be made about where the rider might transfer; then alerts can be selected which affect only the relevant portions of the relevant lines.

In the second case, where we only know that the rider entered at a particular station, the safest course of action is probably simply to enable all rail alerts for a certain period of time; without destination data or further input from the rider, there's no way to know what alerts might be relevant.

Finally, in the third case, where route data is not available, the solution is simple: enable all alerts for the specified route around the time of boarding.

In any event, the selection of alerts should not be completely automated. Rather, what I envision is an interface which presents the automatically-determined selections to the rider and allows them to choose the ones they want to keep, edit them if necessary, and also add alert selections which were not automatically identified.

As I see it, the real benefit comes from being able to look at a rider's fare collection data over time, and automatically update the alert selections as their travel patterns change. Has the rider started taking a new bus every Saturday afternoon? When the observation reaches a certain confidence threshold, then add it to the list of automatically-selected alerts, and send the rider an email, letting them know that that has been done. Have they quit riding the train to Metro Center every morning, but now they're going to Pentagon City? Again, update the alert selections, so that they receive only the transit alerts most relevant to their trips.

Transit alert systems like MetroAlerts are fantastic, but the reality is that there is some burden involved in subscribing and maintaining one's subscription. Many riders will subscribe to an initial set of alerts, and then forget all about the system. When their commute changes, they'll keep on getting alerts tailored to their old commute. They'll probably have forgotten how to log into MetroAlerts to change their alert selections, and so they'll grow increasingly frustrated as the transit authority spams them with alerts about a bus they no longer ride. By using available data, we can increase the intelligence of the alert selection process, and make it more rider-friendly and less labor-intensive.