For San Francisco, and governments everywhere, technology startups are the perfect partners

A recent article in the New York Times describes SMART Muni, “an Apple iPad app that uses Global Positioning System technology to track all of the city’s buses in real time, allowing transit managers and passengers to monitor problems and delays”. It sounds like the perfect success story: civic coders taking open data (Muni tracks its buses and trains with NextBus, which provides an XML data feed) and using that data to improve operations and create real value for the agency.

Unfortunately, it’s not a success story: the app has never been used in production. As the article explains, “Muni hopes to put the app to good use some day, but the agency is $29 million over budget and cannot afford to buy the iPads required to run the software…[nor] is the city willing to invest $100,000 to run a pilot program.”

The costs involved here—a few hundred dollars each for some iPads, perhaps a few thousand dollars to fund a stipend for a civic coder, even $100,000 for a pilot—pale in comparison to the costs associated with the big-name IT consulting firms that governments are used to dealing with.

In addition, startups, teams of civic coders, and open source projects can often deliver a working prototype or even a completed project much faster than conventional development teams. As the New York Times describes, “a small team of volunteers took just 10 days last summer to create [the app].”

Unfortunately, the City of San Francisco is out of touch with the realities of technology: “‘Start-ups fail at a high rate,’ said Jay Nath, chief innovation officer of San Francisco. ‘As stewards of taxpayer dollars, we need to be thoughtful of using that money wisely and not absorbing too much risk.'” Nath is right about one thing: start-ups do fail at an alarming rate. But that’s not the risk you might think it is, because startups aren’t like conventional development projects.

Unlike conventional projects, startups fail fast. Instead of wasting years and millions of dollars, when a startup has an idea that isn’t going anywhere, it winds up quickly. Maybe it was a bad (or even outright infeasible) idea to begin with, or the startup had the wrong team, or they tried to do too much at once. Maybe their idea’s been superseded by a newer, even better technology. Whatever the reason may be, the startup doesn’t just grind away for years, running up a million-dollar bill. Instead, they admit that they can’t deliver, and get out gracefully.

Consider, for example, the FBI’s Virtual Case File, a five-year, $170-million development effort that never actually delivered any working software. Imagine if the VCF project had failed after three or six months, not five years. Imagine if it had spent less than a million dollars before failing, not $170 million. Of course, the project still wouldn’t be done—but we’d have known that something was wrong up front, instead of finding out five years later, after millions of taxpayer dollars had been wasted on a doomed development effort.

More importantly, startups do have the agility necessary to keep up with the ever-changing technology marketplace. A development effort that takes five or ten years is bound to deliver a product that is obsolete as soon as it arrives, unless major changes are made along the way.

The conventional development practices used by many government agencies and their contractors don’t incorporate that kind of agility. Specifications and requirements are written early in the project’s life, perhaps even before a development team has been selected (if the project must be put out for bids). Even if the requirements are found to be lacking—or flat-out wrong—development marches on. In the end, the team will deliver a product that meets the requirements (thus satisfying the bean-counters) but which is already out-of-date and which doesn’t actually do what users need it to do.

I alluded to these problems in my recent coverage of WMATA’s initiative to install real-time information displays at bus stops. By only considering bids from vendors with “standard, proven products” and “successful existing and fully operational implementations, in multiple transit agencies”, they potentially shut out innovative startups (or even teams of civic coders, like the Mobility Lab).

It’s entirely possible that the first team to tackle a thorny problem may fail—but rather than casting them as “failures that burn holes in the city’s budget”, we’ve got to communicate to governments and taxpayers alike that not all failures are the same. There’s a big difference between a project that runs for years, spends millions of dollars, and has nothing to show for it in the end, and a project that fails after just a few months, has spent well less than a million dollars, and can identify what went wrong, so the next project will be more successful.

When it comes to technology, the best way for governments to be good ‘stewards of taxpayer dollars’ is to adopt successful development practices: small, agile, competent teams, that build inexpensive, flexible products, and fail quickly if they can’t get the job done. The old way—forking over millions and millions to high-priced contractors until they finally declare defeat, then taking it up in a years-long legal battle—just doesn’t look like good stewardship anymore. Sure, established companies may have a long track record that startups don’t, but what’s it a record of? We don’t need any more million-dollar failures. We need smart civic coders developing next-generation solutions like SMART Muni, and we need governments to accept, embrace, and support them.

Taming MTA New York City Transit’s bus GTFS feeds

If you go to the MTA’s developer resources page, you’ll find that while there is one GTFS feed to download for the subway (and Staten Island Railway), there are five feeds for bus data—one per borough. Your first reaction might be one of annoyance—after all, the agency almost certainly keeps data for all five boroughs in the same system internally, so why not release the data in the same structure?

However, if you look at the files more closely, you’ll soon see why they’re structured the way they are: they are, simply put, massive. The problem is in the stop_times.txt file; the largest, for Brooklyn, is nearly 700 megabytes. Concatenate them together, and you get a 2 gigabyte file containing more than 30 million records. (This is a result of how the feeds are constructed, as dissected in this thread on the developer mailing list)

Most tools designed for working with GTFS feeds simply can’t handle anything that large (or they choke badly). Yet, at the same time, many tools also assume that there will be a single feed per agency, so the per-borough feeds (which have some degree of overlap) can be something of a pain to work with.

This leads to a conundrum: you can work with the feeds one borough at a time (although even then, with some difficulty, as even the individual borough feeds are rather large), but there’s no good way to see the whole city’s bus service at once.

It turns out that with some ingenuity, this problem can be solved, although doing so takes some time and CPU resources. The basic strategy is to first naively merge the feeds together, and then refactor the merged feed, to reduce the number of stop times. The refactoring is described in this post by Brian Ferris.

Actually merging the feeds together isn’t that hard; the agency.txt, calendar.txt, calendar_dates.txt, routes.txt, and shapes.txt files are identical across the five feeds. The stops.txt file has to be merged and then deduplicated, but this can be done with simple command-line tools. For the trips.txt and stop_times.txt files, there’s no other option than to concatenate them together. This does result in a massive stop_times.txt file, but it’s only temporary.

After producing the naively concatenated feed, apply the previously-mentioned OneBusAway GTFS transformer (described in more detail here) to the feed.

The transformer will need about 8 GB of memory to run (so launch the JVM with -Xmx10G, or thereabouts), and on an EC2 large instance, it’ll take about 10 minutes. When it’s done, you’ll have a stop_times.txt file which contains around 6 million records, which isn’t quite so bad (considering that the entire merged and refactored feed for the five boroughs ends up being about the same size as the unmodified feed for Brooklyn alone, it’s actually almost good).

As an aside, here’s how I constructed the merged feed; I’m always a fan of solutions which make use of basic Unix tools.

mkdir nyct_bus_merged
cd nyct_bus_merged
cp ../google_transit_manhattan/{agency.txt,calendar.txt,calendar_dates.txt,routes.txt,shapes.txt} .
 
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/stops.txt; do
	tail -n +2 $file >> stops_unmerged.txt
done;

head -n 1 ../google_transit_manhattan/stops.txt > stops.txt
cat stops_unmerged.txt | sort | uniq >> stops.txt
rm stops_unmerged.txt


head -n 1 ../google_transit_manhattan/trips.txt > trips.txt
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/trips.txt; do
	tail -n +2 $file >> trips.txt
done;

head -n 1 ../google_transit_manhattan/stop_times.txt > stop_times.txt
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/stop_times.txt; do
	tail -n +2 $file >> stop_times.txt
done;
#then zip the feed and apply the GTFS transformer

(Finally, a disclaimer: I haven’t extensively tested the feed which is the result of the process described in this post. It’s possible that this process has unintended consequences which could affect its integrity or usefulness for certain applications.)

Updates to tph.py

I have made some updates to tph.py, my tool for generating plots of transit service levels from GTFS feeds. Most importantly, these updates fix compatibility issues which kept it from working with certain agencies’ GTFS feeds. To start, here are two plots generated from the BART GTFS feed, using the new version of tph.py.

The first plot’s not actually all that useful, since it mixes up rail service and the AirBART bus shuttle, but it does demonstrate a point, which will be explained later. The second plot is actually useful; it readily demonstrates that the BART system was not designed to provide frequent service to the extremities of the network—while service through the core peaks at 22 trains per hour, most branches on average get four trains per hour.

Anyway, supporting the BART GTFS feed required two major changes: supporting feeds which do not use the direction_id field, and supporting feeds which use the frequencies.txt file (which is used in the BART GTFS feed for AirBART, hence its inclusion above) rather than explicit stoptimes for every trip. As a result of these changes, tph.py should now support any GTFS feed. However, feeds which do not use the direction_id field do require additional configuration to assign directions to routes and trips. This is all documented in the new documentation on the configuration file format.

In addition, tph.py‘s innards have had an overhaul; it no longer uses Google’s transitfeed module for parsing GTFS feeds. Instead, it uses a fork of the gtfs module. gtfs imports the feed and stores it in a SQLite database, using SQLAlchemy. This takes time upfront, but makes tph.py a lot faster to run. It also makes the code cleaner; some operations which previously required several nested for loops can now be done with a single SQL query.

A Python script for visualizing GTFS data

See here for recent updates to tph.py; the software no longer uses Google’s transitfeed module, in favor of the SQLite-backed gtfs library, and some of the restrictions described below have been removed.

Over the weekend, I put together a little Python project which I am now releasing: tph.py, a tool for visualizing transit service levels using data from GTFS feeds. Over at Raschke on Transport, I’ve posted some examples, but here I’d like to discuss the technical underpinnings.

The script uses Google’s transitfeed library to parse the GTFS data. Loading a large GTFS schedule with transitfeed can take several minutes, but I wouldn’t consider that to be unexpected when dealing with a CSV file that might be close to 200 MB. Still, though, I’d like to see a uniform way for getting GTFS data into (for example) a SQLite database, so that SQL queries can be issued against the dataset. In addition, the script depends on certain optional fields being present in the GTFS dataset—in particular, trips must use direction_id, and either trips or stoptimes must have the headsign defined (a headsign will be synthesized from the last stop in a trip if none is present otherwise). These issues could be worked around, but for now it’s easier to assume that these fields will be present.

After having extracted the hourly service values for each of the target routes, the script uses matplotlib to generate the service graph. The matplotlib API is complex, but by the same token I was able to generate the plot I wanted without too much effort.

Because the most time-intensive part of the script is the initial step of loading the GTFS schedule into memory, I designed the script to generate multiple plots in each run. However, I could also see the script being used interactively—matplotlib can be used to generate clickable, interactive plots, and I can envision a future version which would allow a user to click on a particular hour and drill down and further examine the service for that hour (breaking the runs down by destination, for example).