Highly efficient timezone lookups using a ClickHouse polygon dictionary

I use ClickHouse for a wide range of data projects, and I recently needed to go from geographic coordinates to the time zone identifier for the time zone observed at those coordinates. The set of locations was sufficiently small that I probably could have just gone through and done it by hand, but I wanted to come up with an approach that would remain sustainable even as the dataset grew.

Fortunately, Evan Siroky maintains a tool which uses data from OpenStreetMap and the IANA timezone database¹ to produce a set of shapefiles of time zone boundaries.

How can we get this data into ClickHouse and query it efficiently? With ClickHouse’s POLYGON() dictionary layout, it turns out to be pretty easy. ClickHouse dictionaries are a kind of in-memory table designed for fast lookups, and polygon dictionaries are specifically designed to support fast point-in-polygon queries using a gridded index (similar to S2 or H3).

We begin by using the gdal command-line tool from GDAL to output a GeoParquet file, which ClickHouse natively supports. (We also force the layer to be uniformly comprised of MultiPolygons, as mixed geometry types will cause problems later.) Because GDAL supports reading from HTTP sources as well as unpacking Zip archives on-the-fly, we do not need to download the data first—we can let GDAL do it for us as part of the conversion process.

$ gdal vector set-geom-type --multi --lco COMPRESSION=ZSTD /vsizip//vsicurl/https://github.com/evansiroky/timezone-boundary-builder/releases/download/2026a/timezones-1970.shapefile.zip/combined-shapefile-1970.shp timezones.parquet

(If you’re following these steps some time after publication, you may want to update the URL above to point to a more recent data release.)

We can then drop timezones.parquet in our ClickHouse server’s user_files directory. Returning to ClickHouse, we create a new table, a refreshable materialized view to populate that table, and, most importantly, a dictionary. Because ClickHouse natively supports GeoParquet, the WKB geometries in the Parquet file’s geometry column will automatically get translated into ClickHouse-style MultiPolygons. (If this were not the case we could have just called readWKBMultiPolygon() manually.)

CREATE USER IF NOT EXISTS dict_update NOT IDENTIFIED HOST LOCAL SETTINGS readonly=1;

CREATE DATABASE timezones;

CREATE TABLE timezones.timezones
(
    tzid     String,
    geometry MultiPolygon
)
ENGINE = MergeTree()
ORDER BY tzid;

GRANT SELECT ON timezones.timezones TO dict_update;

CREATE MATERIALIZED VIEW timezones.rmv_timezones
REFRESH EVERY 24 HOURS TO timezones.timezones AS
SELECT tzid, geometry
FROM file('timezones.parquet')
WHERE tzid IN (SELECT time_zone FROM system.time_zones);

CREATE DICTIONARY timezones.dict_timezones
(
    geometry Array(Array(Array(Tuple(Float64, Float64)))),
    tzid     String
)
PRIMARY KEY geometry
SOURCE (CLICKHOUSE(DATABASE 'timezones' TABLE 'timezones' USER 'dict_update'))
LAYOUT (POLYGON(STORE_POLYGON_KEY_COLUMN 1))
LIFETIME (86400);

This may appear somewhat overcomplicated—after all, we could have just directly inserted timezones.parquet into a new table using a local ClickHouse client—but the intention is to create a durable data pipeline which will make it easy to update the data whenever a new dataset is released. By using a refreshable materialized view and the file() table function, all we have to do is drop a new copy of timezones.parquet in place and wait for the view to refresh (or trigger it manually). (You may be wondering why we can’t skip the table and just use the FILE() dictionary source directly. The answer is that it doesn’t seem to support reading GeoParquet, and also doesn’t support expressions, so we can’t add a call to readWKBMultiPolygon(). So, we use an intermediate table. No great harm done.)

Anyway, with all of this in place, we can now perform spatial lookups and get back a timezone identifier:

SELECT dictGet(timezones.dict_timezones, 'tzid', (-74.04, 40.71));

At least on my server, the initial load is sluggish—around two minutes—but after that the queries fly. Looking up timezones for 100,000 randomly generated points completes in under half a second, whereas I gave up waiting for the equivalent naive query with pointInPolygon() after minutes.

And now, with all that out of the way, I can get back to the actual project which led to this!

which I still have the unfortunate habit of calling “the Olson tz database” after its long-time maintainer ↩