Taming MTA New York City Transit’s bus GTFS feeds

If you go to the MTA’s developer resources page, you’ll find that while there is one GTFS feed to download for the subway (and Staten Island Railway), there are five feeds for bus data—one per borough. Your first reaction might be one of annoyance—after all, the agency almost certainly keeps data for all five boroughs in the same system internally, so why not release the data in the same structure?

However, if you look at the files more closely, you’ll soon see why they’re structured the way they are: they are, simply put, massive. The problem is in the stop_times.txt file; the largest, for Brooklyn, is nearly 700 megabytes. Concatenate them together, and you get a 2 gigabyte file containing more than 30 million records. (This is a result of how the feeds are constructed, as dissected in this thread on the developer mailing list)

Most tools designed for working with GTFS feeds simply can’t handle anything that large (or they choke badly). Yet, at the same time, many tools also assume that there will be a single feed per agency, so the per-borough feeds (which have some degree of overlap) can be something of a pain to work with.

This leads to a conundrum: you can work with the feeds one borough at a time (although even then, with some difficulty, as even the individual borough feeds are rather large), but there’s no good way to see the whole city’s bus service at once.

It turns out that with some ingenuity, this problem can be solved, although doing so takes some time and CPU resources. The basic strategy is to first naively merge the feeds together, and then refactor the merged feed, to reduce the number of stop times. The refactoring is described in this post by Brian Ferris.

Actually merging the feeds together isn’t that hard; the agency.txt, calendar.txt, calendar_dates.txt, routes.txt, and shapes.txt files are identical across the five feeds. The stops.txt file has to be merged and then deduplicated, but this can be done with simple command-line tools. For the trips.txt and stop_times.txt files, there’s no other option than to concatenate them together. This does result in a massive stop_times.txt file, but it’s only temporary.

After producing the naively concatenated feed, apply the previously-mentioned OneBusAway GTFS transformer (described in more detail here) to the feed.

The transformer will need about 8 GB of memory to run (so launch the JVM with -Xmx10G, or thereabouts), and on an EC2 large instance, it’ll take about 10 minutes. When it’s done, you’ll have a stop_times.txt file which contains around 6 million records, which isn’t quite so bad (considering that the entire merged and refactored feed for the five boroughs ends up being about the same size as the unmodified feed for Brooklyn alone, it’s actually almost good).

As an aside, here’s how I constructed the merged feed; I’m always a fan of solutions which make use of basic Unix tools.

mkdir nyct_bus_merged
cd nyct_bus_merged
cp ../google_transit_manhattan/{agency.txt,calendar.txt,calendar_dates.txt,routes.txt,shapes.txt} .
 
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/stops.txt; do
	tail -n +2 $file >> stops_unmerged.txt
done;

head -n 1 ../google_transit_manhattan/stops.txt > stops.txt
cat stops_unmerged.txt | sort | uniq >> stops.txt
rm stops_unmerged.txt


head -n 1 ../google_transit_manhattan/trips.txt > trips.txt
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/trips.txt; do
	tail -n +2 $file >> trips.txt
done;

head -n 1 ../google_transit_manhattan/stop_times.txt > stop_times.txt
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/stop_times.txt; do
	tail -n +2 $file >> stop_times.txt
done;
#then zip the feed and apply the GTFS transformer

(Finally, a disclaimer: I haven’t extensively tested the feed which is the result of the process described in this post. It’s possible that this process has unintended consequences which could affect its integrity or usefulness for certain applications.)

Announcing htmlbib, a tool for rendering BibTeX files as interactive HTML

For some time now, I’ve been working on an annotated bibliography of articles on various topics in transportation (particularly the history of automatic fare collection from 1960 to the present, as well as the SelTrac train control system and its origins in Germany). I’ve been compiling the information using BibDesk, and I’d like to be able to share it with a wider audience, in the hope that it might be useful to someone.

At a bare minimum, posting the BibTeX file online somewhere would fulfill my desire to get the information out there. But not everyone out there who might benefit from the bibliography uses BibTeX. For many people, I fear a .bib file would be nothing more than unintelligible gibberish; outside of academic circles (and even then, outside of the hard sciences), TeX is not particularly well-known.

The next alternative would be to post the bibliography online as a PDF or HTML file. This alternative is considerably more accessible to non-BibTeX users, but actually makes life harder for people who would like to be able to copy references (as BibTeX source) to use in their own BibTeX files (common practice in communities of TeX users). Merely rendering the entire contents of the file also loses some of the metadata—the comments associated with entries, the groups and keywords, etc.

There are also specialized tools (like bibtex2html) for converting a BibTeX file to HTML. But there, still, the results fall short; the output is mostly static text. I wanted a tool that would make good use of the keywords entered in BibDesk, and which would provide links between publications and authors. I also wanted a tool which would be equally useful for BibTeX users, who would be helped by having access to the BibTeX source for each entry, and non-BibTeX users, who would be helped by having formatted bibliography entries. I therefore set out to built a tool that would meet my needs; the result is htmlbib.

One of the items of concern for me was that the bibliography entries be formatted properly; after having taken care to make sure that the information was added to BibDesk so that it would be rendered well, I did not want to have some generic template used to create HTML for each entry. So, I ended up cobbling together an arrangement that actually uses BibTeX and tex4ht to produce HTML for each entry using the desired BibTeX style (in my case, IEEEtran), so that the entries look the same in the preview as they would in an actual publication. This is slow, but the preview results are cached, so subsequent runs are faster.

As for parsing the BibTeX file, since I’m already familiar with scripting BibDesk, I decided to use appscript to call BibDesk from Python. The result is therefore not portable from OS X, but it suits my needs. There are BibTeX parsing libraries for Python, so porting to another platform would only require substituting one of those libraries of the calls to BibDesk; the rest is pure Python, with the exception of lxml, and the aforementioned preview code, which expects a functioning TeX installation on the system.

The HTML is produced using Jinja2 templates, which for now are stored in the application egg. The default, built-in template is built on Blueprint CSS and jQuery along with jQuery Tools. It wouldn’t be too hard to provide an option for using user-specified templates instead of the built-in template.

I’ve uploaded some sample output to demonstrate what htmlbib does.

Google Apps Script: like AppleScript for the Web

I’ve always been a huge fan of AppleScript for automating tasks in scriptable appplications and (more importantly) gluing scriptable applications together. Particularly when working with applications which are designed to take full advantage of AppleScript, like BibDesk, Delicious Library, and XTension, AppleScript makes even complex tasks easy. Unlike macros which are confined to a single application, AppleScript is based on top of Apple Events, making it easy to target any scriptable application, even on a remote Mac over the network. More importantly, AppleScripts aren’t macros; they don’t just play back keyboard and mouse events; you get a real object-oriented view of the data being manipulated. But really good scriptable applications are hard to come by, and of course AppleScript does you no good if you’re using cloud-based applications like Google Docs.

Browser automation tools, like Selenium, and libraries like mechanize help fill the gap somewhat, but they’re far from providing the same rich environment that AppleScript does. To give a concrete example, I was recently working on a spreadsheet listing Twitter accounts for the top 50 transit agencies in the US (more on that project here). In the spreadsheet, I’d listed agencies’ accounts by username (that is, @username). But what I really wanted was a link to each account on Twitter (that is, http://twitter.com/username). I could have entered the links manually, but that would have required needless manual work. If I were using a conventional spreadsheet application on the desktop, I could have used whatever macro or scripting facility it provided, or I could have exported the file to CSV and used sed and awk to get the job done. But I was working in the cloud; I knew there had to be a better way.

Enter Google Apps Script. Google Apps Script provides for Google’s cloud-based applications the same scriptability that AppleScript provides for desktop applications on the Mac. In only a few minutes, after studying the documentation, I was able to produce a script which achieved the desired effect.

Buoyed by my quick success, I decided to try going a step further: what if I could use the Twitter API to automatically set each cell’s comment to the most recent Tweet? Doing so would give viewers a quick preview of the Twitter account’s content, without leaving the spreadsheet. Working off of some sample code from Google, I quickly wrote another script to do the job. I ran into trouble for a while until I found that the “Callback URL” in the Twitter application settings must be set to https://spreadsheets.google.com/macros; once that was done, everything worked perfectly. (Incidentally, the error message given in that case, “unexpected error”, is completely useless, and gives no clue as to the actual problem.) From there, all I had to do was set up a time-based trigger to run the script automatically so the Tweets would update periodically, and I was done.

For me, the real point—and the power of Google Apps Script—is how quickly and easily I was able to not only automate otherwise-tedious processes, but draw in data from disparate sources and display it automatically. I’ve only scratched the surface of what can be done with Google Apps Script; the technology can be made to do a lot more.

Using shortDOIs automatically in BibDesk

I’m quite a fan of using DOIs to refer to online resources when possible. However, some DOIs are a bit ungainly, and particularly for readers working from a printed bibliography, they’re outright inconvenient. Who wants to type in something like 10.1002/(SICI)1097-0258(19980815/30)17:15/16<1661::AID-SIM968>3.0.CO;2-2? Even when working from a digital copy, a string that long is bound to get mangled somewhere if it gets copied and pasted around, sent in emails, etc. You could use a conventional URL-shortening service, but that’s probably not appropriate in the context of a published paper. So, how can you continue to get the benefits of the DOI system without exposing your readers to long, ugly URLs?

The answer is the shortDOI service, which transforms DOIs into shortcuts that are a lot easier for your readers to use. Every shortDOI generated is itself a DOI, so the conventional risk of a URL-shortener shutting down and taking the shortcuts along with it isn’t a problem. As long as the DOI system is functioning, shortDOIs will be resolvable.

For example, the DOI 10.1109/JRPROC.1929.221679 can be dereferenced by using the URL http://dx.doi.org/10.1109/JRPROC.1929.221679. When this is shortened with shortDOI, the result is the DOI 10/bpc. This can be dereferenced with the URL http://dx.doi.org/10/bpc (note that that’s no different than any other DOI), but, more importantly, it can also be dereferenced with the URL http://doi.org/bpc. It’s this last URL that is important for our purposes, as it’s the shortest.

Now, shortDOIs can be manually generated, but why bother, if the process can be automated? I use BibDesk for managing references, and BibDesk is a scriptable application, so an AppleScript was the easiest solution to the problem. I’ve posted the script on GitHub; you can find it here. For every publication in a BibDesk document which has a DOI entered and which does not have a shortDOI shortcut, it will retrieve the shortDOI shortcut for the publication’s DOI, and store it in the URL field.

Announcing aerodb

I have, on several occasions, written about using data from the DBpedia project to produce a freely-available database of aerodromes, aerodrome identifiers, and locations. I previously presented a SPARQL query which could be used to perform the necessary extraction from the DBpedia SPARQL endpoint. Now, I am releasing aerodb, a Python project which encapsulates the SPARQL query in a command-line tool, and provides other utilities for working with the data. The raw data extracted from Wikipedia is noisy; some location identifiers are used in more than one article, resulting in duplicates. aerodb includes a file which is used to de-duplicate these entries (which was produced by manually inspecting the linked Wikipedia articles). The README file for the project contains more information (including how Wikipedians can help), so I won’t duplicate all that here.

The final result contains 8,522 aerodromes (based on the data presently available from DBpedia, which will change over time). You can get the results as a JSON, CSV, or KML file from the GitHub downloads page. (The KML file is rather fun to look at in Google Earth; it gets a bit sluggish and has some rendering issues, but works well otherwise.)

(Note: I use the term aerodrome where others might use the more common ‘airport’ as there is a legal distinction in some countries (including Canada), where the term ‘aerodrome’ is most widely applicable—that is to say, aerodb produces a database which contains aerodromes which are, by the Canadian definition, not airports)

A Python script for visualizing GTFS data

See here for recent updates to tph.py; the software no longer uses Google’s transitfeed module, in favor of the SQLite-backed gtfs library, and some of the restrictions described below have been removed.

Over the weekend, I put together a little Python project which I am now releasing: tph.py, a tool for visualizing transit service levels using data from GTFS feeds. Over at Raschke on Transport, I’ve posted some examples, but here I’d like to discuss the technical underpinnings.

The script uses Google’s transitfeed library to parse the GTFS data. Loading a large GTFS schedule with transitfeed can take several minutes, but I wouldn’t consider that to be unexpected when dealing with a CSV file that might be close to 200 MB. Still, though, I’d like to see a uniform way for getting GTFS data into (for example) a SQLite database, so that SQL queries can be issued against the dataset. In addition, the script depends on certain optional fields being present in the GTFS dataset—in particular, trips must use direction_id, and either trips or stoptimes must have the headsign defined (a headsign will be synthesized from the last stop in a trip if none is present otherwise). These issues could be worked around, but for now it’s easier to assume that these fields will be present.

After having extracted the hourly service values for each of the target routes, the script uses matplotlib to generate the service graph. The matplotlib API is complex, but by the same token I was able to generate the plot I wanted without too much effort.

Because the most time-intensive part of the script is the initial step of loading the GTFS schedule into memory, I designed the script to generate multiple plots in each run. However, I could also see the script being used interactively—matplotlib can be used to generate clickable, interactive plots, and I can envision a future version which would allow a user to click on a particular hour and drill down and further examine the service for that hour (breaking the runs down by destination, for example).

Documentation for geolucidate

As promised, I have completed documentation for geolucidate, and uploaded it to PyPI (where it can be viewed here). Aside from documenting the API, the documentation also includes a discussion on the input formats which are accepted. This also merits a new source release, which has been uploaded to PyPI as version 0.2. There are no substantive changes other than documentation in version 0.2, but two test cases have been added (they passed when they were added, so no further source changes were required).