Getting Started
System Requirements
- Any recent Linux supported by ClickHouse and Python
- Latest ClickHouse server and client binaries (>= 19.15.x). See the ClickHouse documentation for details on installing ClickHouse.
- Python (>= 3.7.x). We recommend the latest Anaconda Python distribution.
Installation
Vulkn itself is a pure Python application and can be installed using pip.
Note
VULKИ 19.0.0 has only been used / tested within the latest Ubuntu and Mint LTS environments and is unlikely to function as expected under other distributions.
The following has been validated on Ubuntu Xenial and Bionic. Vulkn should work / install on any other system running Python 3.7.x or greater.
sudo add-apt-repository ppa:deadsnakes/ppa sudo apt update sudo apt install -y python3.7 python3.7-dev python3-pip sudo python3.7 -m pip install vulkn
You should also ensure your systems console encoding is set to a valid non-latin/ASCII encoding such as en_US.UTF-8 (or your preferred/local extended character set)
sudo localectl set-locale LANG=en_US.UTF-8
Using the VULKN CLI
The Vulkn CLI is a simple command line interface that enables interactive exploration and analytics on the Linux console. To follow the example below run vulkn --local
from a terminal.
(base) ironman@hulk ~$ vulkn --local 2019.10.11 10:14:28.709940 [ 1 ] {} <Information> : Starting ClickHouse 19.16.1.1467 with revision 54427 2019.10.11 10:14:28.709992 [ 1 ] {} <Information> Application: starting up 2019.10.11 10:14:28.712126 [ 1 ] {} <Trace> Application: Will mlockall to prevent executable memory from being paged out. It may take a few seconds. 2019.10.11 10:14:28.732955 [ 1 ] {} <Trace> Application: The memory map of clickhouse executable has been mlock'ed 2019.10.11 10:14:28.733106 [ 1 ] {} <Debug> Application: Set max number of file descriptors to 1048576 (was 1024). 2019.10.11 10:14:28.733114 [ 1 ] {} <Debug> Application: Initializing DateLUT. 2019.10.11 10:14:28.733117 [ 1 ] {} <Trace> Application: Initialized DateLUT with time zone 'UTC'. 2019.10.11 10:14:28.734904 [ 1 ] {} <Debug> ConfigReloader: Loading config '/tmp/ironman-8155347b-7857-453b-8af7-7199cf49b25a/etc/users.xml' 2019.10.11 10:14:28.735279 [ 1 ] {} <Information> Application: Loading metadata from /tmp/ironman-8155347b-7857-453b-8af7-7199cf49b25a/clickhouse/ 2019.10.11 10:14:28.735797 [ 1 ] {} <Debug> Application: Loaded metadata. 2019.10.11 10:14:28.735812 [ 1 ] {} <Information> BackgroundSchedulePool: Create BackgroundSchedulePool with 16 threads 2019.10.11 10:14:28.737570 [ 1 ] {} <Information> Application: Listening http://[::1]:8124 2019.10.11 10:14:28.737641 [ 1 ] {} <Information> Application: Listening for connections with native protocol (tcp): [::1]:9001 2019.10.11 10:14:28.737690 [ 1 ] {} <Information> Application: Listening http://127.0.0.1:8124 2019.10.11 10:14:28.737723 [ 1 ] {} <Information> Application: Listening for connections with native protocol (tcp): 127.0.0.1:9001 2019.10.11 10:14:28.738090 [ 1 ] {} <Information> Application: Available RAM: 62.60 GiB; physical cores: 8; logical cores: 16. 2019.10.11 10:14:28.738103 [ 1 ] {} <Information> Application: Ready for connections. Добро пожаловать to VULKИ version 2019.1.1! ██╗ ██╗██╗ ██╗██╗ ██╗ ██╗███╗ ██╗ ██║ ██║██║ ██║██║ ██║ ██╔╝████╗ ██║ ██║ ██║██║ ██║██║ █████╔╝ ██╔██╗ ██║ ╚██╗ ██╔╝██║ ██║██║ ██╔═██╗ ██║╚██╗██║ ╚████╔╝ ╚██████╔╝███████╗██║ ██╗██║ ╚████║ ╚═══╝ ╚═════╝ ╚══════╝╚═╝ ╚═╝╚═╝ ╚═══╝ The developer friendly real-time analytics engine powered by ClickHouse. VULKИ entrypoint initialized as 'v'.
Invoking vulkn
with the --local
option has automatically launched a temporary workspace (ClickHouse backend). See the Workspaces section for more detail on what a Workspace is.
Your first Vulkn application
Now it's time to have some fun with Vulkn. The following example uses ArrayVectors to create some sample columns. Enter the following into the Vulkn CLI.
num_rows = 25000000 series_key = ArrayVector.range(1,250000).cast(String).map('concat', String('device-')).take(num_rows).cache() timestamp = ArrayVector.rand(DateTime('2019-01-01 00:00:00'), DateTime('2019-01-01 23:59:59'), num_rows).cast(DateTime).cache() temperature = ArrayVector.norm(80,10,int(num_rows/2)).cast(Float32).join(ArrayVector.norm(95,5,int(num_rows/2)).cast(Float32)).cache() bytes = ArrayVector.rand(1, 8192, num_rows).cast(UInt16).cache()
So whats going on here?
First we're creating a list of 250k integers (UInt64
).
series_key = ArrayVector.range(1,250000)
Next we convert the integer types to a String
type.
series_key = ArrayVector.range(1,250000).cast(String)
Now we can use the concat
function to prepend the string 'device-' to each value
series_key = ArrayVector.range(1,250000).cast(String).map('concat', String('device-'))
We now pull out num_rows (25 million) values from the vector. The take()
function repeats or truncates the vector depending on the required length.
series_key = ArrayVector.range(1,250000).cast(String).map('concat', String('device-')).take(num_rows)
Finally we cache the vector. Until this point our cast()
, map()
and take()
operations are yet to be executed. This is called lazy evaluation.
series_key = ArrayVector.range(1,250000).cast(String).map('concat', String('device-')).take(num_rows).cache()
The timestamp and bytes columns are random values of DateTime and UInt16 types within a range. The temperature column is a bit different and is constructed by joining two separate vectors, each representing a normal distribution (bell) curve with different properties - this approximates morning and afternoon peaks in temperature for a device based on utilisation.
From Vectors to Tables
The four equal length vectors can now be combined to form a table. We have to assign names/aliases for both the table and vectors.
data = v.table.fromVector( 'default.timeseries_devices', (series_key.alias('series_key'), timestamp.alias('timestamp'), temperature.alias('temp'), bytes.alias('bytes')), engines.Memory(), replace=True)
The fromVector()
call returns a DataTable
. This is a standard relational 'table' (or query) similar to DataTables found within other data projects. Now we can execute aggregations and queries, including joins and subqueries with other DataTables against this object.
>>> data.count().show() row count() ----- --------- 1 25000000 (1 row) >>> data.select('*').limit(10).show() row series_key timestamp temp bytes ----- ------------ ------------------- ------- ------- 1 device-1 2019-01-01 21:46:09 82.0007 2104 2 device-2 2019-01-01 09:49:26 74.9356 3662 3 device-3 2019-01-01 20:36:41 77.5353 7642 4 device-4 2019-01-01 00:03:46 77.4789 3219 5 device-5 2019-01-01 09:00:39 72.738 6559 6 device-6 2019-01-01 18:41:12 83.689 3478 7 device-7 2019-01-01 07:54:49 68.2859 34 8 device-8 2019-01-01 16:34:28 88.9767 6903 9 device-9 2019-01-01 12:45:35 74.6222 2367 10 device-10 2019-01-01 09:20:28 63.2291 4436 (10 rows)
You can also issue standard SQL queries against the table.
>>> v.q("SELECT max(timestamp), avg(temp), median(bytes) FROM default.timeseries_devices").show() row max(timestamp) avg(temp) median(bytes) ----- ------------------- ----------- --------------- 1 2019-01-01 23:59:59 87.5018 4114 (1 row)
Where to next?
Now that you've created a few vectors, a DataTable and executed some queries it's time to explore other topics such as ASOF joins, SQL extensions and loading data.