First airports, now airlines

Last month, I wrote about using the DBpedia project to extract data about airports from Wikipedia articles. The next logical extension is to use DBpedia to try to generate a similar data set for airlines, including their IATA and ICAO designators. Unfortunately, it looks like it won’t be nearly as easy to extract airline data as it was to extract airport data. Here’s why:

  • DBpedia does not extract airline ICAO and IATA codes: While DBpedia’s taxonomy includes a class for airlines, the infobox mapping does not extract ICAO and IATA codes. Based on my understanding of DBpedia’s architecture, this can be fixed easily.
  • Many airlines don’t actually have a Wikipedia article: Curiously, that doesn’t mean their information isn’t in Wikipedia; it just means it’s only in the list of airlines, a page which lists many defunct airlines, and is probably only machine-parsable with some effort. Either way, that means there’s no {{Infobox airline}}, and so nothing for the DBpedia mapper to extract.
  • Duplicate codes: While ICAO airline designators are guaranteed to be unique, IATA airline designators are not. There’s no good way to automatically disambiguate usage of a multiply-assigned code. This is not a fault of DBpedia or Wikipedia, though; the IATA created this mess by deciding to reuse codes from defunct airlines, and assign the same code to airlines operating in geographically separate areas. No matter where you get your data from—Wikipedia or straight from the IATA—you’ll have to contend with this problem.