When some data is better than no data

Last week I learned, by way of Peter Hicks’ recent FOI request to TfL, that the open data formerly provided at timetables.data.tfl.gov.uk, had in fact been formally withdrawn, without much fanfare (and without a suitable replacement).

As background, the Common User Format is a specification developed by London Underground to communicate timetables from the scheduling system (a thing called CART, I believe?) to all of the various downstream consumers—presumably including the (semi-manual) preparation of printed Working Timetables and Timetable Notices, the punching of programme machine rolls, and (one imagines), the programme machines’ solid-state successors, as well as the loading of CBTC backend systems, such as the SelTrac SRS.

VERY LOUD EXCITED SOUNDhttps://t.co/6gYRXQKHL4
— Kurt Raschke (@kurtraschke) September 10, 2018

I first discovered this dataset in September 2018, and I instantly realized it was the “missing link” I’d been looking for, which I’d deduced had to exist in some form, somewhere. While CUF is very much a “legacy” format, the public dataset is itself fairly new—the first post about it on TfL’s Tech Forum is from January 2018. It’s a fairly simple dataset, too—just an Amazon S3 bucket full of Zip archives of CSV files. Unlike TfL’s Journey Planner dataset, which describes what trains do from a passenger-facing perspective, CUF timetables describe what the trains themselves do, and consequently the data include attributes like train and trip numbers and other operational parameters, as well as including certain non-revenue train movements.

otherwise, how would the programme machine rolls get punched, how would the SelTrac SRS get loaded, etc., etc.?

And now we know.
— Kurt Raschke (@kurtraschke) September 10, 2018

It took a while to really get going (such is the fate of all side projects), but by late 2019 I’d built a Java parser and object model for CUF timetables, discovered the UK bank holidays API, and built a bunch of other scaffolding (such as this mapping of London Underground station codes across systems¹). My goal, by the way, was to produce standards-compliant GTFS and GTFS-realtime feeds for the London Underground², using an amalgamation of the TransXChange Journey Planner dataset, CUF timetables, NaPTAN, the legacy Trackernet API, and a few other things. This has long, long been my “white whale”, so much so that I once started writing code to scrape the PDF versions of Working Timetables before abandoning that in frustration. I’d already built a TransXChange to GTFS converter years ago, for fun, so the new part was matching TransXChange trips to CUF trips, and thence to data from the Trackernet API. It didn’t always work well, but by early 2020 I thought I had a working prototype. Optimistically, I thought I might have something to show off at TransportationCamp New England 2020³. Given the long-standing relationship between the MIT Transit Lab and TfL, I thought this would be especially fitting.

Then, the pandemic hit. TransportationCamp New England 2020 never happened, and for a while I wanted very much to think about things other than transit when I wasn’t working. But eventually service on the London Underground began to return to normal, and I started picking up side projects again. But there were no recent timetables in the timetables.data.tfl.gov.uk bucket!

I’d had, in the back of my mind, a certain fear that this wonderful open data resource might just have been a casualty of the pandemic. But it wasn’t until Peter Hicks’ recent FOI request came back that we had official confirmation that this dataset was indeed done for. So much so that TfL intends to delete the S3 bucket entirely, to keep people from being misled by its existence.

This website is no longer live or maintained, this has been superseded by data provision through the APIs – see further detail below. (We will be speaking to the relevant team to close this particular site to avoid confusion in the future).

But why? Why take away this useful, valuable open data resource?

I can think of a few reasons why this decision might have been made. Given the timing, the first that comes to mind is that it was deemed “too much work” to keep the S3 bucket up-to-date. This shouldn’t have been the case—it’s just a matter of dropping some .zip files in an S3 bucket, and should have been easy to automate—but perhaps there were other internal technical hurdles to overcome.

The other possibility (the one I worry about more) is that, as a legacy dataset of interest to a minority of developers, the CUF timetables were deemed ill-suited to an enterprise strategy of publishing open data through modern, standards-compliant APIs. Indeed, TfL’s response suggests this, wrongly claiming that the CUF dataset has been superseded by the TransXChange Journey Planner dataset (which it most certainly hasn’t!).

Don’t get me wrong, five-star Linked Open Data is great. But open data is better than no data. In today’s world the CUF format doesn’t have much going for it, but it works. And it’s really not that hard to work with; it didn’t take that long at all for me to build a Java object model and parser, turning the tangle of CSV rows into rich object relationships and enums. And even if I’d been handed a real mess—say, fixed-width records with text in EBCDIC and numbers in COBOL-style packed decimal—well, I’d have muddled through. Heck, I’d have muddled through if it meant reverse-engineering a proprietary format, since doesn’t everyone love an evening spent rummaging around with a hex editor?

In all seriousness, though, the point remains: while the CUF dataset may have served an obscure set of uses, and while it may not have fit with an enterprise vision of what modern open data looks like, it was a valuable, irreplaceable dataset.

Now, there is a possibility I should acknowledge: what if there’s some new internal timetable format, some bright, shiny, new replacement for CUF? It’s certainly possible, but the fact that TfL responded to a FOI request with a fresh set of CUF timetables suggests that the format still exists internally and is still in use.

In any case, the reality is that neither TfL’s Unified API nor the Journey Planner dataset completely replace the CUF dataset for static timetables, in the same way that the Unified API does not fully replace the legacy Trackernet API (which TfL has also been talking about deprecating for a few years now).

I recognize that the use cases for which we really need the CUF dataset are indeed obscure ones—and if there were some better replacement, or if keeping it updated were some serious burden (and not just shoveling some .zip files into an S3 bucket, an eminently automatable task), I might not be as annoyed. But this seems to be a case of “this dataset is complicated, use the shiny new one!” (although in this case the Journey Planner dataset isn’t actually new either) without ensuring that the new dataset genuinely meets all of the use cases of the old one.

In the absence of further explanation from TfL, it all feels a bit antithetical to the principles of open data.

Sadly, not updated for Battersea Power Station Station. ↩
I’m aware that there are some data brokers who already provide this service—but I want to do it in the open, with publicly-available open data and open source software, because otherwise, what’s the point? ↩
In reality, there were tons of problems, many having to do with the Hammersmith & City line, but at least at the time they seemed surmountable, with enough fiddling and swearing. ↩