I have continued work on my project to extract free airport data from Wikipedia using DBpedia's SPARQL endpoint. My initial approach proved to be naïve, although not bad for a first SPARQL query. The current version of the query is more complex, but also generates considerably more useful data. So, what have I done to bloat a 15-line query to almost 70 lines? Here are the issues I addressed in revising the query:

  • Use geo:lat and geo:long rather than grs:point: I was originally using the grs:point predicate to get geographic data, but I later discovered that while some airports with geographic data do not have a grs:point, they all have the geo:lat and geo:long predicates. However, there's a further complication. Some airports have their latitude and longitude stored as xsd:double values, while others have xsd:float values, and some have both. We only need one or the other, not both.
  • Capture FAA, TC LIDs: A lot of things that look like ICAO and IATA codes actually turn out to be Transport Canada or FAA location identifiers. So, the query now has additional clauses which return the FAA or Transport Canada LID. However, the FAA or TC LID will only be returned if the LID differs from the airport's ICAO or IATA code (or the airport does not have an ICAO or IATA code), and the LID does not conflict with another airport's ICAO or IATA code. Some airports have been assigned a LID by Transport Canada or the FAA which actually conflicts with another airport's ICAO or IATA code, and in those cases we prefer the internationally-assigned code.
  • Request data in batches: It's not nice to scrape data in huge batches; besides I think there may be an upper bound on the size of a result set the DBpedia SPARQL endpoint will return. The revised query uses the OFFSET and LIMIT modifiers to fetch results in small batches.