Announcing aerodb

I have, on several occasions, written about using data from the DBpedia project to produce a freely-available database of aerodromes, aerodrome identifiers, and locations. I previously presented a SPARQL query which could be used to perform the necessary extraction from the DBpedia SPARQL endpoint. Now, I am releasing aerodb, a Python project which encapsulates the SPARQL query in a command-line tool, and provides other utilities for working with the data. The raw data extracted from Wikipedia is noisy; some location identifiers are used in more than one article, resulting in duplicates. aerodb includes a file which is used to de-duplicate these entries (which was produced by manually inspecting the linked Wikipedia articles). The README file for the project contains more information (including how Wikipedians can help), so I won’t duplicate all that here.

The final result contains 8,522 aerodromes (based on the data presently available from DBpedia, which will change over time). You can get the results as a JSON, CSV, or KML file from the GitHub downloads page. (The KML file is rather fun to look at in Google Earth; it gets a bit sluggish and has some rendering issues, but works well otherwise.)

(Note: I use the term aerodrome where others might use the more common ‘airport’ as there is a legal distinction in some countries (including Canada), where the term ‘aerodrome’ is most widely applicable—that is to say, aerodb produces a database which contains aerodromes which are, by the Canadian definition, not airports)

First airports, now airlines

Last month, I wrote about using the DBpedia project to extract data about airports from Wikipedia articles. The next logical extension is to use DBpedia to try to generate a similar data set for airlines, including their IATA and ICAO designators. Unfortunately, it looks like it won’t be nearly as easy to extract airline data as it was to extract airport data. Here’s why:

  • DBpedia does not extract airline ICAO and IATA codes: While DBpedia’s taxonomy includes a class for airlines, the infobox mapping does not extract ICAO and IATA codes. Based on my understanding of DBpedia’s architecture, this can be fixed easily.
  • Many airlines don’t actually have a Wikipedia article: Curiously, that doesn’t mean their information isn’t in Wikipedia; it just means it’s only in the list of airlines, a page which lists many defunct airlines, and is probably only machine-parsable with some effort. Either way, that means there’s no {{Infobox airline}}, and so nothing for the DBpedia mapper to extract.
  • Duplicate codes: While ICAO airline designators are guaranteed to be unique, IATA airline designators are not. There’s no good way to automatically disambiguate usage of a multiply-assigned code. This is not a fault of DBpedia or Wikipedia, though; the IATA created this mess by deciding to reuse codes from defunct airlines, and assign the same code to airlines operating in geographically separate areas. No matter where you get your data from—Wikipedia or straight from the IATA—you’ll have to contend with this problem.

More free airport data

I have continued work on my project to extract free airport data from Wikipedia using DBpedia‘s SPARQL endpoint. My initial approach proved to be naïve, although not bad for a first SPARQL query. The current version of the query is more complex, but also generates considerably more useful data. So, what have I done to bloat a 15-line query to almost 70 lines? Here are the issues I addressed in revising the query:

  • Use geo:lat and geo:long rather than grs:point: I was originally using the grs:point predicate to get geographic data, but I later discovered that while some airports with geographic data do not have a grs:point, they all have the geo:lat and geo:long predicates. However, there’s a further complication. Some airports have their latitude and longitude stored as xsd:double values, while others have xsd:float values, and some have both. We only need one or the other, not both.
  • Capture FAA, TC LIDs: A lot of things that look like ICAO and IATA codes actually turn out to be Transport Canada or FAA location identifiers. So, the query now has additional clauses which return the FAA or Transport Canada LID. However, the FAA or TC LID will only be returned if the LID differs from the airport’s ICAO or IATA code (or the airport does not have an ICAO or IATA code), and the LID does not conflict with another airport’s ICAO or IATA code. Some airports have been assigned a LID by Transport Canada or the FAA which actually conflicts with another airport’s ICAO or IATA code, and in those cases we prefer the internationally-assigned code.
  • Request data in batches: It’s not nice to scrape data in huge batches; besides I think there may be an upper bound on the size of a result set the DBpedia SPARQL endpoint will return. The revised query uses the OFFSET and LIMIT modifiers to fetch results in small batches.

Cheap airport data

Let’s say you wanted to have a database of airports, including each airport’s ICAO and IATA codes, plus the airport’s name and location. You could pay ICAO $185.00 for a paper copy of Document 7910, Location Indicators. That gets you a 252-page tome which is probably not that useful for programmatic lookup. So, you could pay $945.00 for an annual subscription to the online version of Document 7910.

Or, you could execute a fifteen-line SPARQL query against the DBpedia SPARQL endpoint:

PREFIX dbo: <>
PREFIX grs: <>
SELECT ?name ?icao ?iata ?coordinates ?airport
    ?airport rdf:type <> .
    ?airport dbo:icaoLocationIdentifier ?icao .
    FILTER regex(?icao, "^[A-Z0-9]{4}$")
    ?airport dbo:iataLocationIdentifier ?iata .
    FILTER regex(?iata, "^[A-Z0-9]{3}$")
        ?airport rdfs:label ?name
        FILTER ( lang(?name) = "en" )
    ?airport grs:point ?coordinates .

That gets you around 2000 airports, their ICAO and IATA codes, and their locations. The usual disclaimers concerning Wikipedia data apply, but it’s not bad for a free dataset. It would be great if ICAO would release their data for free, and even better if they’d subscribe to the W3C’s Linked Open Data principles, but until that happens, this is a good substitute. Not only that, but by being based on Wikipedia data, end-users can play a direct role in maintaining data quality. See bad data, fix it in Wikipedia, and see the updates in DBpedia and your own applications. For example, some airports have invalid values for their ICAO and/or IATA codes (which necessitates the two FILTER clauses in the SPARQL above). By inverting those two filters, you can generate a list of airports with invalid ICAO and/or IATA codes in their Wikipedia entries. You can then take that list, and go back to Wikipedia and make the necessary updates (or verify that updates have been made since March 2010, when the current DBpedia dataset was extracted from Wikipedia). It’s my understanding that DBpedia will at some point in the future move to a live-update model, where updates to Wikipedia will be funneled through the extraction process and into DBpedia on a real-time basis.