Python context managers for CFFI resource management

I have a project at work that involves calling into a third-party C library from Python, using CFFI. Using the library involves working with two different types of “contexts”—things that in an object-oriented language would be classes, but in C are nothing more than structs or void pointers to opaque memory regions.

In pseudo-Python, working with one of these contexts looks like this:

context ="ApplicationSpecificStruct *")
lib.do_something_with_context(context, some_data)

I’ve omitted error handling, but in reality each one of those library calls returns an error code which must be checked and handled after each call (no exceptions in C, remember!).

In order to simplify the process of working with the context, and especially of ensuring that it is still freed if an exception is thrown while it is being used, I have found Python’s context management protocol helpful.

Using contextlib.contextmanager, we can create a simple context manager that encapsulates the process of creating and freeing the C library’s context.

def application_context():
  context ="ApplicationSpecificStruct *")

    yield context

with application_context() as the_context:
  lib.do_something_with_context(the_context, some_data)

If an exception is thrown inside the with-block, the C library’s context will still get freed.

Now, I suspect some will argue that this is a cop-out—that the more Pythonic thing to do would be to create proper object-oriented wrappers for the library’s context types. There are certain advantages to this approach; principally, it enables cleaner code, in which lib.do_something_with_context(the_context, some_data) becomes simply context.do_something(some_data).

But building fully object-oriented wrappers is both more tedious and more time-consuming, and for what I’m doing the context manager approach is perfectly suitable. Besides, even if I’d implemented fully object-oriented wrappers, I’d still want them to implement the context manager protocol—it’s the Pythonic way to ensure that a resource is closed after it’s been used, as in this example from the Python documentation:

with open("hello.txt") as f:
  for line in f:
    print line,

Taming MTA New York City Transit’s bus GTFS feeds

If you go to the MTA’s developer resources page, you’ll find that while there is one GTFS feed to download for the subway (and Staten Island Railway), there are five feeds for bus data—one per borough. Your first reaction might be one of annoyance—after all, the agency almost certainly keeps data for all five boroughs in the same system internally, so why not release the data in the same structure?

However, if you look at the files more closely, you’ll soon see why they’re structured the way they are: they are, simply put, massive. The problem is in the stop_times.txt file; the largest, for Brooklyn, is nearly 700 megabytes. Concatenate them together, and you get a 2 gigabyte file containing more than 30 million records. (This is a result of how the feeds are constructed, as dissected in this thread on the developer mailing list)

Most tools designed for working with GTFS feeds simply can’t handle anything that large (or they choke badly). Yet, at the same time, many tools also assume that there will be a single feed per agency, so the per-borough feeds (which have some degree of overlap) can be something of a pain to work with.

This leads to a conundrum: you can work with the feeds one borough at a time (although even then, with some difficulty, as even the individual borough feeds are rather large), but there’s no good way to see the whole city’s bus service at once.

It turns out that with some ingenuity, this problem can be solved, although doing so takes some time and CPU resources. The basic strategy is to first naively merge the feeds together, and then refactor the merged feed, to reduce the number of stop times. The refactoring is described in this post by Brian Ferris.

Actually merging the feeds together isn’t that hard; the agency.txt, calendar.txt, calendar_dates.txt, routes.txt, and shapes.txt files are identical across the five feeds. The stops.txt file has to be merged and then deduplicated, but this can be done with simple command-line tools. For the trips.txt and stop_times.txt files, there’s no other option than to concatenate them together. This does result in a massive stop_times.txt file, but it’s only temporary.

After producing the naively concatenated feed, apply the previously-mentioned OneBusAway GTFS transformer (described in more detail here) to the feed.

The transformer will need about 8 GB of memory to run (so launch the JVM with -Xmx10G, or thereabouts), and on an EC2 large instance, it’ll take about 10 minutes. When it’s done, you’ll have a stop_times.txt file which contains around 6 million records, which isn’t quite so bad (considering that the entire merged and refactored feed for the five boroughs ends up being about the same size as the unmodified feed for Brooklyn alone, it’s actually almost good).

As an aside, here’s how I constructed the merged feed; I’m always a fan of solutions which make use of basic Unix tools.

mkdir nyct_bus_merged
cd nyct_bus_merged
cp ../google_transit_manhattan/{agency.txt,calendar.txt,calendar_dates.txt,routes.txt,shapes.txt} .
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/stops.txt; do
	tail -n +2 $file >> stops_unmerged.txt

head -n 1 ../google_transit_manhattan/stops.txt > stops.txt
cat stops_unmerged.txt | sort | uniq >> stops.txt
rm stops_unmerged.txt

head -n 1 ../google_transit_manhattan/trips.txt > trips.txt
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/trips.txt; do
	tail -n +2 $file >> trips.txt

head -n 1 ../google_transit_manhattan/stop_times.txt > stop_times.txt
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/stop_times.txt; do
	tail -n +2 $file >> stop_times.txt
#then zip the feed and apply the GTFS transformer

(Finally, a disclaimer: I haven’t extensively tested the feed which is the result of the process described in this post. It’s possible that this process has unintended consequences which could affect its integrity or usefulness for certain applications.)

Announcing htmlbib, a tool for rendering BibTeX files as interactive HTML

For some time now, I’ve been working on an annotated bibliography of articles on various topics in transportation (particularly the history of automatic fare collection from 1960 to the present, as well as the SelTrac train control system and its origins in Germany). I’ve been compiling the information using BibDesk, and I’d like to be able to share it with a wider audience, in the hope that it might be useful to someone.

At a bare minimum, posting the BibTeX file online somewhere would fulfill my desire to get the information out there. But not everyone out there who might benefit from the bibliography uses BibTeX. For many people, I fear a .bib file would be nothing more than unintelligible gibberish; outside of academic circles (and even then, outside of the hard sciences), TeX is not particularly well-known.

The next alternative would be to post the bibliography online as a PDF or HTML file. This alternative is considerably more accessible to non-BibTeX users, but actually makes life harder for people who would like to be able to copy references (as BibTeX source) to use in their own BibTeX files (common practice in communities of TeX users). Merely rendering the entire contents of the file also loses some of the metadata—the comments associated with entries, the groups and keywords, etc.

There are also specialized tools (like bibtex2html) for converting a BibTeX file to HTML. But there, still, the results fall short; the output is mostly static text. I wanted a tool that would make good use of the keywords entered in BibDesk, and which would provide links between publications and authors. I also wanted a tool which would be equally useful for BibTeX users, who would be helped by having access to the BibTeX source for each entry, and non-BibTeX users, who would be helped by having formatted bibliography entries. I therefore set out to built a tool that would meet my needs; the result is htmlbib.

One of the items of concern for me was that the bibliography entries be formatted properly; after having taken care to make sure that the information was added to BibDesk so that it would be rendered well, I did not want to have some generic template used to create HTML for each entry. So, I ended up cobbling together an arrangement that actually uses BibTeX and tex4ht to produce HTML for each entry using the desired BibTeX style (in my case, IEEEtran), so that the entries look the same in the preview as they would in an actual publication. This is slow, but the preview results are cached, so subsequent runs are faster.

As for parsing the BibTeX file, since I’m already familiar with scripting BibDesk, I decided to use appscript to call BibDesk from Python. The result is therefore not portable from OS X, but it suits my needs. There are BibTeX parsing libraries for Python, so porting to another platform would only require substituting one of those libraries of the calls to BibDesk; the rest is pure Python, with the exception of lxml, and the aforementioned preview code, which expects a functioning TeX installation on the system.

The HTML is produced using Jinja2 templates, which for now are stored in the application egg. The default, built-in template is built on Blueprint CSS and jQuery along with jQuery Tools. It wouldn’t be too hard to provide an option for using user-specified templates instead of the built-in template.

I’ve uploaded some sample output to demonstrate what htmlbib does.

Google Apps Script: like AppleScript for the Web

I’ve always been a huge fan of AppleScript for automating tasks in scriptable appplications and (more importantly) gluing scriptable applications together. Particularly when working with applications which are designed to take full advantage of AppleScript, like BibDesk, Delicious Library, and XTension, AppleScript makes even complex tasks easy. Unlike macros which are confined to a single application, AppleScript is based on top of Apple Events, making it easy to target any scriptable application, even on a remote Mac over the network. More importantly, AppleScripts aren’t macros; they don’t just play back keyboard and mouse events; you get a real object-oriented view of the data being manipulated. But really good scriptable applications are hard to come by, and of course AppleScript does you no good if you’re using cloud-based applications like Google Docs.

Browser automation tools, like Selenium, and libraries like mechanize help fill the gap somewhat, but they’re far from providing the same rich environment that AppleScript does. To give a concrete example, I was recently working on a spreadsheet listing Twitter accounts for the top 50 transit agencies in the US (more on that project here). In the spreadsheet, I’d listed agencies’ accounts by username (that is, @username). But what I really wanted was a link to each account on Twitter (that is, I could have entered the links manually, but that would have required needless manual work. If I were using a conventional spreadsheet application on the desktop, I could have used whatever macro or scripting facility it provided, or I could have exported the file to CSV and used sed and awk to get the job done. But I was working in the cloud; I knew there had to be a better way.

Enter Google Apps Script. Google Apps Script provides for Google’s cloud-based applications the same scriptability that AppleScript provides for desktop applications on the Mac. In only a few minutes, after studying the documentation, I was able to produce a script which achieved the desired effect.

Buoyed by my quick success, I decided to try going a step further: what if I could use the Twitter API to automatically set each cell’s comment to the most recent Tweet? Doing so would give viewers a quick preview of the Twitter account’s content, without leaving the spreadsheet. Working off of some sample code from Google, I quickly wrote another script to do the job. I ran into trouble for a while until I found that the “Callback URL” in the Twitter application settings must be set to; once that was done, everything worked perfectly. (Incidentally, the error message given in that case, “unexpected error”, is completely useless, and gives no clue as to the actual problem.) From there, all I had to do was set up a time-based trigger to run the script automatically so the Tweets would update periodically, and I was done.

For me, the real point—and the power of Google Apps Script—is how quickly and easily I was able to not only automate otherwise-tedious processes, but draw in data from disparate sources and display it automatically. I’ve only scratched the surface of what can be done with Google Apps Script; the technology can be made to do a lot more.

Using shortDOIs automatically in BibDesk

I’m quite a fan of using DOIs to refer to online resources when possible. However, some DOIs are a bit ungainly, and particularly for readers working from a printed bibliography, they’re outright inconvenient. Who wants to type in something like 10.1002/(SICI)1097-0258(19980815/30)17:15/16<1661::AID-SIM968>3.0.CO;2-2? Even when working from a digital copy, a string that long is bound to get mangled somewhere if it gets copied and pasted around, sent in emails, etc. You could use a conventional URL-shortening service, but that’s probably not appropriate in the context of a published paper. So, how can you continue to get the benefits of the DOI system without exposing your readers to long, ugly URLs?

The answer is the shortDOI service, which transforms DOIs into shortcuts that are a lot easier for your readers to use. Every shortDOI generated is itself a DOI, so the conventional risk of a URL-shortener shutting down and taking the shortcuts along with it isn’t a problem. As long as the DOI system is functioning, shortDOIs will be resolvable.

For example, the DOI 10.1109/JRPROC.1929.221679 can be dereferenced by using the URL When this is shortened with shortDOI, the result is the DOI 10/bpc. This can be dereferenced with the URL (note that that’s no different than any other DOI), but, more importantly, it can also be dereferenced with the URL It’s this last URL that is important for our purposes, as it’s the shortest.

Now, shortDOIs can be manually generated, but why bother, if the process can be automated? I use BibDesk for managing references, and BibDesk is a scriptable application, so an AppleScript was the easiest solution to the problem. I’ve posted the script on GitHub; you can find it here. For every publication in a BibDesk document which has a DOI entered and which does not have a shortDOI shortcut, it will retrieve the shortDOI shortcut for the publication’s DOI, and store it in the URL field.