Exploring Efficiency in the Cloud: The One Billion Rows Challenge with Snowflake and Databend
BohuTANGJan 5, 2024
Gunnar Morling's challenge, initially posed to Java users, involved processing a dataset of 1 billion rows, focusing on temperature measurement values. This challenge, highlighted in his blog post, The One Billion Row Challenge, has intrigued many in the database community. While the original challenge was Java-centric, it sparked interest across various databases, examining their speed and efficiency. Efforts like the 1BRC in SQL with DuckDB and 1 billion rows challenge in PostgreSQL and ClickHouse demonstrate this widespread fascination.
In this post, let's shift the focus to cloud data warehouses, particularly Snowflake and Databend Cloud, exploring how they handle such extensive datasets. The tests are aimed at understanding how these platforms efficiently sort and group large-scale data, especially when stored in cloud-based object storage. The test involves two main tasks:
- Directly sort and group 1 billion rows of data stored in object storage.
- Load these rows into a cloud data warehouse, then perform sorting and grouping.
The dataset used for this test is in the Parquet format, known for its columnar storage efficiency. Hosted on AWS S3 and publicly accessible here (5.1 GB), it allows anyone to replicate these tests on Snowflake and Databend Cloud.
The tests use a MEDIUM(32vCPU) warehouse size in Snowflake and Databend Cloud to process a dataset stored in the AWS US East (Ohio) region. The cost is $8 per hour for Snowflake and $4 per hour for Databend Cloud.
Snowflake vs. Databend: A Brief Overview
Snowflake is a well-established cloud-based data warehouse that provides a SQL interface for querying data. It employs a columnar database that stores data in cloud-based object storage like AWS S3. Snowflake has gained popularity for its user-friendliness and scalability in data warehousing.
Databend Cloud represents the new wave of cloud data warehouses, designed to handle massive-scale analytics with reduced cost and complexity. It's built on Databend, the open-source alternative to Snowflake, built from scratch in Rust, offering similar capabilities.
Preparations: Generate Parquet File
The tests follow the steps similar to Robin Moffatt for generating the raw data and exporting it to a Parquet file.
Snowflake vs. Databend: Sort & Group Data in Object Storage
Sort & Group Data in Object Storage with Snowflake
Run the following SQL commands in Snowflake:
CREATE OR REPLACE STAGE wizardbend URL='s3://wizardbend/';
CREATE OR REPLACE FILE FORMAT myformat TYPE = 'parquet';
SELECT
$1:station_name,
MIN($1:measurement) AS min_measurement,
AVG($1:measurement) AS mean_measurement,
MAX($1:measurement) AS max_measurement
FROM
@wizardbend/1brc/measurements.parquet (file_format => 'myformat')
GROUP BY
$1:station_name
ORDER BY
$1:station_name;
Result:
Time | 21m 35s |
Cost | $2.87 |
Sort & Group Data in Object Storage with Databend Cloud
In Databend Cloud, run the following commands:
CREATE STAGE IF NOT EXISTS wizardbend
URL = 's3://wizardbend/'
CONNECTION = (ALLOW_ANONYMOUS = 'true');
SELECT station_name,
MIN(measurement) AS min_measurement,
AVG(measurement) AS mean_measurement,
MAX(measurement) AS max_measurement
FROM @wizardbend/1brc/measurements.parquet
GROUP BY station_name
ORDER BY station_name;
Result:
Time | 9.8s |
Cost | $0.01 |
Snowflake vs. Databend: Load & Process Data in Cloud Data Warehouses
Next, we'll examine the efficiency of Snowflake and Databend Cloud by loading, sorting, and grouping 1 billion rows of data.
Load 1 Billion Rows into Snowflake
CREATE TABLE onebrc (
station_name VARCHAR NULL,
measurement DOUBLE NULL
);
COPY INTO onebrc
FROM (
SELECT $1:station_name::VARCHAR,
$1:measurement::DOUBLE
FROM @wizardbend/1brc/measurements.parquet
(FILE_FORMAT => 'myformat')
);
Result:
Time | 26m 7s |
Cost | $3.48 |
Load 1 Billion Rows into Databend Cloud
CREATE TABLE onebrc (
station_name VARCHAR NULL,
measurement DOUBLE NULL
);
COPY INTO onebrc
FROM (
SELECT $1:station_name::VARCHAR,
$1:measurement::DOUBLE
FROM @wizardbend/1brc/measurements.parquet
(FILE_FORMAT => 'myformat')
);
Result:
Time | 19.1s |
Cost | $0.02 |
Sort & Group 1 Billion Rows in Snowflake
Excluding the utilization of cache (result cache and local disk cache), the SQL query is as follows:
SELECT station_name,
MIN(measurement) AS min_measurement,
AVG(measurement) AS mean_measurement,
MAX(measurement) AS max_measurement
FROM onebrc
GROUP BY station_name
ORDER BY station_name;
Result:
Time | 3.1s |
Cost | $0.007 |
Sort & Group 1 Billion Rows in Databend Cloud
Again, without result caching:
SELECT station_name,
MIN(measurement) AS min_measurement,
AVG(measurement) AS mean_measurement,
MAX(measurement) AS max_measurement
FROM onebrc
GROUP BY station_name
ORDER BY station_name;
Result:
Time | 3.6s |
Cost | $0.004 |
Conclusion
All the comparisons presented can be independently run and verified on both Snowflake and Databend Cloud platforms:
- Snowflake ($400 coupon): https://signup.snowflake.com/
- Databend Cloud ($200 coupon): https://www.databend.com/apply
Task | Snowflake (Cost & Time) | Databend Cloud (Cost & Time) |
---|---|---|
Read Parquet File | $2.87 (21m 35s) | $0.01 (9.8s) |
Load 1 Billion Rows | $3.48 (26m 7s) | $0.02 (19.1s) |
Read Native Format | $0.007 (3.1s) | $0.004 (3.6s) |
Total | $6.357 | $0.034 |
In this benchmark comparison between Snowflake and Databend, it's evident that cloud data warehouses, particularly Databend Cloud, excel in efficiently processing large datasets stored in object storage like AWS S3. They offer not only superior speed but also remarkable cost-effectiveness, making them the go-to choice for handling complex data challenges.
Subscribe to our newsletter
Stay informed on feature releases, product roadmap, support, and cloud offerings!