Python context managers for CFFI resource management

I have a project at work that involves calling into a third-party C library from Python, using CFFI. Using the library involves working with two different types of “contexts”—things that in an object-oriented language would be classes, but in C are nothing more than structs or void pointers to opaque memory regions.

In pseudo-Python, working with one of these contexts looks like this:

context = ffi.new("ApplicationSpecificStruct *")
lib.create_context(context)
lib.do_something_with_context(context, some_data)
lib.free_context(context)

I’ve omitted error handling, but in reality each one of those library calls returns an error code which must be checked and handled after each call (no exceptions in C, remember!).

In order to simplify the process of working with the context, and especially of ensuring that it is still freed if an exception is thrown while it is being used, I have found Python’s context management protocol helpful.

Using contextlib.contextmanager, we can create a simple context manager that encapsulates the process of creating and freeing the C library’s context.

@contextmanager
def application_context():
  context = ffi.new("ApplicationSpecificStruct *")
  lib.create_context(context)

  try:
    yield context
  finally:
    lib.free_context(context)

with application_context() as the_context:
  lib.do_something_with_context(the_context, some_data)

If an exception is thrown inside the with-block, the C library’s context will still get freed.

Now, I suspect some will argue that this is a cop-out—that the more Pythonic thing to do would be to create proper object-oriented wrappers for the library’s context types. There are certain advantages to this approach; principally, it enables cleaner code, in which lib.do_something_with_context(the_context, some_data) becomes simply context.do_something(some_data).

But building fully object-oriented wrappers is both more tedious and more time-consuming, and for what I’m doing the context manager approach is perfectly suitable. Besides, even if I’d implemented fully object-oriented wrappers, I’d still want them to implement the context manager protocol—it’s the Pythonic way to ensure that a resource is closed after it’s been used, as in this example from the Python documentation:

with open("hello.txt") as f:
  for line in f:
    print line,

Announcing htmlbib, a tool for rendering BibTeX files as interactive HTML

For some time now, I’ve been working on an annotated bibliography of articles on various topics in transportation (particularly the history of automatic fare collection from 1960 to the present, as well as the SelTrac train control system and its origins in Germany). I’ve been compiling the information using BibDesk, and I’d like to be able to share it with a wider audience, in the hope that it might be useful to someone.

At a bare minimum, posting the BibTeX file online somewhere would fulfill my desire to get the information out there. But not everyone out there who might benefit from the bibliography uses BibTeX. For many people, I fear a .bib file would be nothing more than unintelligible gibberish; outside of academic circles (and even then, outside of the hard sciences), TeX is not particularly well-known.

The next alternative would be to post the bibliography online as a PDF or HTML file. This alternative is considerably more accessible to non-BibTeX users, but actually makes life harder for people who would like to be able to copy references (as BibTeX source) to use in their own BibTeX files (common practice in communities of TeX users). Merely rendering the entire contents of the file also loses some of the metadata—the comments associated with entries, the groups and keywords, etc.

There are also specialized tools (like bibtex2html) for converting a BibTeX file to HTML. But there, still, the results fall short; the output is mostly static text. I wanted a tool that would make good use of the keywords entered in BibDesk, and which would provide links between publications and authors. I also wanted a tool which would be equally useful for BibTeX users, who would be helped by having access to the BibTeX source for each entry, and non-BibTeX users, who would be helped by having formatted bibliography entries. I therefore set out to built a tool that would meet my needs; the result is htmlbib.

One of the items of concern for me was that the bibliography entries be formatted properly; after having taken care to make sure that the information was added to BibDesk so that it would be rendered well, I did not want to have some generic template used to create HTML for each entry. So, I ended up cobbling together an arrangement that actually uses BibTeX and tex4ht to produce HTML for each entry using the desired BibTeX style (in my case, IEEEtran), so that the entries look the same in the preview as they would in an actual publication. This is slow, but the preview results are cached, so subsequent runs are faster.

As for parsing the BibTeX file, since I’m already familiar with scripting BibDesk, I decided to use appscript to call BibDesk from Python. The result is therefore not portable from OS X, but it suits my needs. There are BibTeX parsing libraries for Python, so porting to another platform would only require substituting one of those libraries of the calls to BibDesk; the rest is pure Python, with the exception of lxml, and the aforementioned preview code, which expects a functioning TeX installation on the system.

The HTML is produced using Jinja2 templates, which for now are stored in the application egg. The default, built-in template is built on Blueprint CSS and jQuery along with jQuery Tools. It wouldn’t be too hard to provide an option for using user-specified templates instead of the built-in template.

I’ve uploaded some sample output to demonstrate what htmlbib does.

Updates to tph.py

I have made some updates to tph.py, my tool for generating plots of transit service levels from GTFS feeds. Most importantly, these updates fix compatibility issues which kept it from working with certain agencies’ GTFS feeds. To start, here are two plots generated from the BART GTFS feed, using the new version of tph.py.

The first plot’s not actually all that useful, since it mixes up rail service and the AirBART bus shuttle, but it does demonstrate a point, which will be explained later. The second plot is actually useful; it readily demonstrates that the BART system was not designed to provide frequent service to the extremities of the network—while service through the core peaks at 22 trains per hour, most branches on average get four trains per hour.

Anyway, supporting the BART GTFS feed required two major changes: supporting feeds which do not use the direction_id field, and supporting feeds which use the frequencies.txt file (which is used in the BART GTFS feed for AirBART, hence its inclusion above) rather than explicit stoptimes for every trip. As a result of these changes, tph.py should now support any GTFS feed. However, feeds which do not use the direction_id field do require additional configuration to assign directions to routes and trips. This is all documented in the new documentation on the configuration file format.

In addition, tph.py‘s innards have had an overhaul; it no longer uses Google’s transitfeed module for parsing GTFS feeds. Instead, it uses a fork of the gtfs module. gtfs imports the feed and stores it in a SQLite database, using SQLAlchemy. This takes time upfront, but makes tph.py a lot faster to run. It also makes the code cleaner; some operations which previously required several nested for loops can now be done with a single SQL query.

Announcing aerodb

I have, on several occasions, written about using data from the DBpedia project to produce a freely-available database of aerodromes, aerodrome identifiers, and locations. I previously presented a SPARQL query which could be used to perform the necessary extraction from the DBpedia SPARQL endpoint. Now, I am releasing aerodb, a Python project which encapsulates the SPARQL query in a command-line tool, and provides other utilities for working with the data. The raw data extracted from Wikipedia is noisy; some location identifiers are used in more than one article, resulting in duplicates. aerodb includes a file which is used to de-duplicate these entries (which was produced by manually inspecting the linked Wikipedia articles). The README file for the project contains more information (including how Wikipedians can help), so I won’t duplicate all that here.

The final result contains 8,522 aerodromes (based on the data presently available from DBpedia, which will change over time). You can get the results as a JSON, CSV, or KML file from the GitHub downloads page. (The KML file is rather fun to look at in Google Earth; it gets a bit sluggish and has some rendering issues, but works well otherwise.)

(Note: I use the term aerodrome where others might use the more common ‘airport’ as there is a legal distinction in some countries (including Canada), where the term ‘aerodrome’ is most widely applicable—that is to say, aerodb produces a database which contains aerodromes which are, by the Canadian definition, not airports)

A Python script for visualizing GTFS data

See here for recent updates to tph.py; the software no longer uses Google’s transitfeed module, in favor of the SQLite-backed gtfs library, and some of the restrictions described below have been removed.

Over the weekend, I put together a little Python project which I am now releasing: tph.py, a tool for visualizing transit service levels using data from GTFS feeds. Over at Raschke on Transport, I’ve posted some examples, but here I’d like to discuss the technical underpinnings.

The script uses Google’s transitfeed library to parse the GTFS data. Loading a large GTFS schedule with transitfeed can take several minutes, but I wouldn’t consider that to be unexpected when dealing with a CSV file that might be close to 200 MB. Still, though, I’d like to see a uniform way for getting GTFS data into (for example) a SQLite database, so that SQL queries can be issued against the dataset. In addition, the script depends on certain optional fields being present in the GTFS dataset—in particular, trips must use direction_id, and either trips or stoptimes must have the headsign defined (a headsign will be synthesized from the last stop in a trip if none is present otherwise). These issues could be worked around, but for now it’s easier to assume that these fields will be present.

After having extracted the hourly service values for each of the target routes, the script uses matplotlib to generate the service graph. The matplotlib API is complex, but by the same token I was able to generate the plot I wanted without too much effort.

Because the most time-intensive part of the script is the initial step of loading the GTFS schedule into memory, I designed the script to generate multiple plots in each run. However, I could also see the script being used interactively—matplotlib can be used to generate clickable, interactive plots, and I can envision a future version which would allow a user to click on a particular hour and drill down and further examine the service for that hour (breaking the runs down by destination, for example).