CHUNK BY

chunk_by(chunkkey: tuple, chunksize: int)

  • Parameters
    • chunkkey: tuple - the key or composite key to chunk on (using cityHash64)
    • chunksize: int - size of processing splits, generally no more than the number of CPU cores.
  • Returns: vulkn.datatable.SelectQueryDataTable

CHUNK BY is a Vulkn extension that addresses some performance bottlenecks in ClickHouse in cases where finalizing a function, generally an aggregation, results in utilization of only a single thread. This is only of use when a result or aggregation can be multiplexed by a key.

Example

When using histogram function across unique keys we can accelerate the execution by telling ClickHouse to split the query into multiple chunks/keys and process these independently.

Either of the following queries:

v.table('timeseries_devices').select('key',funcs.agg.histogram(10, bytes)).group_by('key').chunk_by('key',2).s
v.q('select key, histogram(10)(bytes) from timeseries_devices group by key chunk by (key, 2)').s

Will be converted to the following valid ClickHouse SQL:

SELECT key, histogram(10)(bytes) FROM timeseries_devices WHERE cityHash64(key)%2 = 0 GROUP BY key
UNION ALL
SELECT key, histogram(10)(bytes) FROM timeseries_devices WHERE cityHash64(key)%2 = 1 GROUP BY key

Note that CHUNK BY will only work in specific cases and will be deprecated once the ClickHouse core team address any lagging performance issues in this area.