How should transit agencies make their GTFS available?

To many techies, the question of how transit agencies should make their GTFS available might seem like a silly one. They’d reply that obviously the agency should simply post their GTFS to their Web site at a reasonable URL, and make that URL readily available from the agency’s developer resources page.

Unfortunately, it isn’t nearly so simple in the real world. Instead, many agencies hide their GTFS behind a “clickwrap” license, or even require a login to download the feed. In a few particularly bad cases, developers even have to sign an agreement and return it (on paper) to get access to a feed. Some agencies don’t host their own feeds at all, instead depending on sites like the GTFS Data Exchange.

So, what are some best practices for hosting GTFS feeds?

  • Don’t rely on third parties: Think of this in terms of paper maps and schedules. How would riders feel if a transit agency told them to pick up transit maps and timetables not at the agency’s offices or stations, but rather some unrelated third party? If a transit agency has a Web site (as almost all do), then it should be capable of hosting its own GTFS feed. Sure, some agencies will complain about what their content management system “won’t let them do”, or complain that they must go through some arduous process to upload new content, but in 2014 running a Web site is a basic competency for almost any organization. Depending on a third-party site introduces additional risk and additional points of failure.
  • Help developers discover feeds: Developers shouldn’t have to hunt for GTFS feeds–there should be a prominent link on every agency’s homepage. Bonus points for participating in any applicable data catalogs, like these operated by ODOT and MassDOT for agencies in their respective states.
  • No login, no clickwrap: GTFS feeds should be downloadable by any Internet user, without having to log in or accept a license agreement. This is a must-have for being able to automate downloads of updated GTFS feeds, an essential part of any large-scale passenger information system. Don’t make it needlessly hard for developers to use your GTFS feed – if you can’t download it with wget, then you’re just making work for feed users. The only piece of information a developer should need to know to use an agency’s GTFS feed is the URL—a clean, simple URL like http://www.bart.gov/dev/schedules/google_transit.zip.
  • Support conditional HTTP GET: GTFS feeds rarely change every day, but it’s still important to get updates as soon as they’re available. But downloading a large feed (some can be 20 MB or more) every day is wasteful. So how can feed consumers stay up-to-date without wasting a lot of bandwidth? Feed producers should support conditional HTTP GET, using either the ETag or Last-Modified headers.

Agencies may balk at some of these recommendations—”But we have to track usage of the feed! But we have to have a signed license agreement!”—but the simple fact is that there are plenty of agencies that get it right. There are plenty of agencies that use a simple, reasonable license, and plenty of agencies that host their GTFS at a stable URL that supports automated downloads. If you demand a signed license agreement, or make developers log in to access the feed, you make it harder for developers to use your data. When you make it hard for developers to use your data in their apps, you make it harder for transit riders to get service information, because many riders’ first stop when they need transit information is a third-party smartphone app.

GTFS-realtime for WMATA buses

I’ve posted many times about the considerable value of open standards for real-time transit data. While it’s always best if a transit authority offers its own feeds using open standards like GTFS-realtime or SIRI, converting available real-time data from a proprietary API into an open format still gets the job done. After a few months of kicking the problem around, I’ve finally written a tool to produce GTFS-realtime StopTimeUpdate, VehiclePosition, and Alert messages for Metrobus, as well as GTFS-realtime Alert messages for Metrorail.

The tool, wmata-gtfsrealtime, isn’t nearly as straightforward as it might be, because while the WMATA API appears to provide all of the information you’d need to create a GTFS-realtime feed, you’ll quickly discover that the route, stop, and trip identifiers returned by the API bear no relation to those used in WMATA’s GTFS feed.

One of the basic tenets of GTFS-realtime is that it is designed to directly integrate with GTFS, and for that reason identifiers must be shared across GTFS and GTFS-realtime feeds.

In WMATA’s case, this means that it is necessary to first map routes in the API to their counterparts in the GTFS feed, and then, for each vehicle, map its trip to the corresponding trip in the GTFS feed. This is done by querying a OneBusAway TransitDataService (via Hessian remoting) for active trips for the mapped route, then finding the active trip which most closely matches the vehicle’s trip.

Matching is done by constructing a metric space in which the distance between a stoptime in the API data and its counterpart in the GTFS feed is defined as an (x, y, t) tuple—that is, our notion of “distance” becomes distance in both space and time. The distances fed into the metric are actually halved, in order to bias the scores towards matching based on time, while allowing some leeway for stops which are wrongly located in either the GTFS or real-time data.

The resulting algorithm will map all but one or two of the 900-odd vehicles on the road during peak hours. Spot-checking arrivals for stops in OneBusAway against arrivals for the same stop in NextBus shows relatively good agreement; of course, considering that NextBus is a “black box”, unexplained variances in NextBus arrival times are to be expected.

You may wonder why we can’t provide better data for Metrorail; the answer is simple: the API is deficient. As I’ve previously discussed, the rail API only provides the same data you get from looking at the PIDS in stations. Unfortunately, that’s not what we need to produce a GTFS-realtime feed. At a minimum, we would need to be able to get a list of all revenue trains in the system, including their current schedule deviation, and a trip ID which would either match a trip ID in the GTFS feed, or be something we could easily map to a trip ID in the GTFS feed.

This isn’t how it’s supposed to be. Look at this diagram, then, for a reality check, look at this one (both are from a presentation by Jamey Harvey, WMATA’s former Enterprise Architect). WMATA’s data management practices are, to say the least, sorely lacking. For most data, there’s no single source of truth. The problem is particularly acute for bus stops; one database might have the stop in one location and identified with one ID, while another database might have the same physical stop identified with a different number, and coordinates that place it in an entirely different location.

Better data management practices would make it easier for developers to develop innovative applications which increase the usability of transit services, and, ultimately improve mobility for the entire region. Isn’t that what it’s supposed to be about, at the end of the day?

WMATA’s open data efforts are good, but could be better

Last week, ReadWriteWeb profiled WMATA’s open data efforts—from the agency’s initial (and ultimately unsuccessful) efforts to monetize the data, through the release of a real-time API and GTFS feed, and the eventual inclusion of WMATA’s data in Google Transit.

The ReadWriteWeb article paints this as a complete success story, in which, as David Alpert puts it, WMATA “got religion on open data”.

The reality is somewhat different. A GTFS feed and real-time data API may have been a substantial step forward in 2008, when this process started, but today there are many other categories of data WMATA could expose, and open, interoperable formats they could use to do so (particularly for real-time data). In addition, WMATA’s communications with developers could be better. While some agencies have active discussion groups where agency staff communicate freely with developers, at WMATA developers still get a somewhat chilly reception.

How could WMATA’s open data efforts be improved? Here are four suggestions:
Continue reading WMATA’s open data efforts are good, but could be better

Taming MTA New York City Transit’s bus GTFS feeds

If you go to the MTA’s developer resources page, you’ll find that while there is one GTFS feed to download for the subway (and Staten Island Railway), there are five feeds for bus data—one per borough. Your first reaction might be one of annoyance—after all, the agency almost certainly keeps data for all five boroughs in the same system internally, so why not release the data in the same structure?

However, if you look at the files more closely, you’ll soon see why they’re structured the way they are: they are, simply put, massive. The problem is in the stop_times.txt file; the largest, for Brooklyn, is nearly 700 megabytes. Concatenate them together, and you get a 2 gigabyte file containing more than 30 million records. (This is a result of how the feeds are constructed, as dissected in this thread on the developer mailing list)

Most tools designed for working with GTFS feeds simply can’t handle anything that large (or they choke badly). Yet, at the same time, many tools also assume that there will be a single feed per agency, so the per-borough feeds (which have some degree of overlap) can be something of a pain to work with.

This leads to a conundrum: you can work with the feeds one borough at a time (although even then, with some difficulty, as even the individual borough feeds are rather large), but there’s no good way to see the whole city’s bus service at once.

It turns out that with some ingenuity, this problem can be solved, although doing so takes some time and CPU resources. The basic strategy is to first naively merge the feeds together, and then refactor the merged feed, to reduce the number of stop times. The refactoring is described in this post by Brian Ferris.

Actually merging the feeds together isn’t that hard; the agency.txt, calendar.txt, calendar_dates.txt, routes.txt, and shapes.txt files are identical across the five feeds. The stops.txt file has to be merged and then deduplicated, but this can be done with simple command-line tools. For the trips.txt and stop_times.txt files, there’s no other option than to concatenate them together. This does result in a massive stop_times.txt file, but it’s only temporary.

After producing the naively concatenated feed, apply the previously-mentioned OneBusAway GTFS transformer (described in more detail here) to the feed.

The transformer will need about 8 GB of memory to run (so launch the JVM with -Xmx10G, or thereabouts), and on an EC2 large instance, it’ll take about 10 minutes. When it’s done, you’ll have a stop_times.txt file which contains around 6 million records, which isn’t quite so bad (considering that the entire merged and refactored feed for the five boroughs ends up being about the same size as the unmodified feed for Brooklyn alone, it’s actually almost good).

As an aside, here’s how I constructed the merged feed; I’m always a fan of solutions which make use of basic Unix tools.

mkdir nyct_bus_merged
cd nyct_bus_merged
cp ../google_transit_manhattan/{agency.txt,calendar.txt,calendar_dates.txt,routes.txt,shapes.txt} .
 
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/stops.txt; do
	tail -n +2 $file >> stops_unmerged.txt
done;

head -n 1 ../google_transit_manhattan/stops.txt > stops.txt
cat stops_unmerged.txt | sort | uniq >> stops.txt
rm stops_unmerged.txt


head -n 1 ../google_transit_manhattan/trips.txt > trips.txt
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/trips.txt; do
	tail -n +2 $file >> trips.txt
done;

head -n 1 ../google_transit_manhattan/stop_times.txt > stop_times.txt
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/stop_times.txt; do
	tail -n +2 $file >> stop_times.txt
done;
#then zip the feed and apply the GTFS transformer

(Finally, a disclaimer: I haven’t extensively tested the feed which is the result of the process described in this post. It’s possible that this process has unintended consequences which could affect its integrity or usefulness for certain applications.)

Siri and SIRI: ask your phone when the bus is coming

This is going to be a confusing post, because we’re going to talk about SIRI, the Service Interface for Real-Time Information, and Siri, Apple’s new app for the iPhone 4S with an uncanny ability to listen, answer questions, and do things on the iPhone. But there’s a point here, and that’s that open standards like SIRI make it easier for software like Siri to work well. Siri knows what it knows because it’s able to tie into services on the Web, like Wikipedia, Wolfram Alpha, and Yelp. It’s not magic; it’s just a matter of having the right data.

Of course, with more data, Siri can do more: imagine being able to stand at a bus stop and ask when the next bus is coming, and get an answer without having to know who runs the local public transit services, or what routes serve that stop, or what the stop code is, and so on.

Building such an app, an app that lets you stand at any bus stop and ask “When is the next bus?” is a lot easier if public transport operators around the world release their data in a common, open format. Doing so means that developers don’t have to write custom software to support every city’s transit system, nor do they have to pay licensing fees to use proprietary protocols.

When every public transport operator uses a common, open format to release their data, developers can build support for a single standard, and know it will work everywhere. It means that they can tap into repositories of open data (like the GTFS Data Exchange) and use every feed they find there, without writing any additional code.

Fortunately, we have those standards, both for static schedule data and real-time data. A huge number of transit authorities release their static schedule data using GTFS, and there is growing traction behind SIRI. Hopefully, MTA New York City Transit’s adoption of SIRI will help to improve its traction in the US, where proprietary formats currently dominate.

The other major aspect to this—and what we lack right now to make this a reality—is a geo-enabled database of real-time feeds, a Web service that would allow you to pass a latitude, longitude, and radius, and get back pointers to feeds for every transit authority in the area. Without this kind of auto-discovery, developers must manually catalog feeds, and either build their own infrastructure for associating feeds with geographic areas, or, more likely, display a huge list of feeds and ask the user to pick their transit authority from the list. This, of course, brings us back to the original problem: what if you know that there are buses where you are, but not who operates them or where to get information?

You might think you could use Google Transit for this, but you actually can’t; Google doesn’t offer an API for the information they store for Google Transit. It seems like a simple proposition, considering the amount of transit data Google ingests from various transit providers, but despite continuous clamoring from developers, transit data at Google is still a one-way proposition. However, sites like the GTFS Data Exchange have developed to fill the void, and I expect eventually there will be a site that will provide a similar function for real-time data.

In short, the technology to make public transit information easier to access is all there; Siri is remarkably adept at understanding natural-language queries, and many, many transit authorities have adopted open standards to release their data. All that remains is to put the two together, and then you’ll be able to ask your phone the same kinds of questions you ask of your resident public transit wizards: “What bus do I take to…” or “Where do I catch the bus to…” or, everyone’s favorite, “When is the bus coming?”