Blog

Exploring Efficiency in the Cloud: The One Billion Rows Challenge with Snowflake and Databend

avatarBohuTANGJan 5, 2024
Exploring Efficiency in the Cloud: The One Billion Rows Challenge with Snowflake and Databend

Gunnar Morling's challenge, initially posed to Java users, involved processing a dataset of 1 billion rows, focusing on temperature measurement values. This challenge, highlighted in his blog post, The One Billion Row Challenge, has intrigued many in the database community. While the original challenge was Java-centric, it sparked interest across various databases, examining their speed and efficiency. Efforts like the 1BRC in SQL with DuckDB and 1 billion rows challenge in PostgreSQL and ClickHouse demonstrate this widespread fascination.

In this post, let's shift the focus to cloud data warehouses, particularly Snowflake and Databend Cloud, exploring how they handle such extensive datasets. The tests are aimed at understanding how these platforms efficiently sort and group large-scale data, especially when stored in cloud-based object storage. The test involves two main tasks:

  • Directly sort and group 1 billion rows of data stored in object storage.
  • Load these rows into a cloud data warehouse, then perform sorting and grouping.

The dataset used for this test is in the Parquet format, known for its columnar storage efficiency. Hosted on AWS S3 and publicly accessible here (5.1 GB), it allows anyone to replicate these tests on Snowflake and Databend Cloud.

The tests use a MEDIUM(32vCPU) warehouse size in Snowflake and Databend Cloud to process a dataset stored in the AWS US East (Ohio) region. The cost is $8 per hour for Snowflake and $4 per hour for Databend Cloud.

Snowflake vs. Databend: A Brief Overview

Snowflake is a well-established cloud-based data warehouse that provides a SQL interface for querying data. It employs a columnar database that stores data in cloud-based object storage like AWS S3. Snowflake has gained popularity for its user-friendliness and scalability in data warehousing.

Databend Cloud represents the new wave of cloud data warehouses, designed to handle massive-scale analytics with reduced cost and complexity. It's built on Databend, the open-source alternative to Snowflake, built from scratch in Rust, offering similar capabilities.

Preparations: Generate Parquet File

The tests follow the steps similar to Robin Moffatt for generating the raw data and exporting it to a Parquet file.

Snowflake vs. Databend: Sort & Group Data in Object Storage

Sort & Group Data in Object Storage with Snowflake

Run the following SQL commands in Snowflake:

CREATE OR REPLACE STAGE wizardbend URL='s3://wizardbend/';

CREATE OR REPLACE FILE FORMAT myformat TYPE = 'parquet';

SELECT
$1:station_name,
MIN($1:measurement) AS min_measurement,
AVG($1:measurement) AS mean_measurement,
MAX($1:measurement) AS max_measurement
FROM
@wizardbend/1brc/measurements.parquet (file_format => 'myformat')
GROUP BY
$1:station_name
ORDER BY
$1:station_name;

Result:

Time21m 35s
Cost$2.87
One Billion Row Challenge

Sort & Group Data in Object Storage with Databend Cloud

In Databend Cloud, run the following commands:

CREATE STAGE IF NOT EXISTS wizardbend
URL = 's3://wizardbend/'
CONNECTION = (ALLOW_ANONYMOUS = 'true');

SELECT station_name,
MIN(measurement) AS min_measurement,
AVG(measurement) AS mean_measurement,
MAX(measurement) AS max_measurement
FROM @wizardbend/1brc/measurements.parquet
GROUP BY station_name
ORDER BY station_name;

Result:

Time9.8s
Cost$0.01
One Billion Row Challenge

Snowflake vs. Databend: Load & Process Data in Cloud Data Warehouses

Next, we'll examine the efficiency of Snowflake and Databend Cloud by loading, sorting, and grouping 1 billion rows of data.

Load 1 Billion Rows into Snowflake

CREATE TABLE onebrc (
station_name VARCHAR NULL,
measurement DOUBLE NULL
);

COPY INTO onebrc
FROM (
SELECT $1:station_name::VARCHAR,
$1:measurement::DOUBLE
FROM @wizardbend/1brc/measurements.parquet
(FILE_FORMAT => 'myformat')
);

Result:

Time26m 7s
Cost$3.48
One Billion Row Challenge

Load 1 Billion Rows into Databend Cloud

CREATE TABLE onebrc (
station_name VARCHAR NULL,
measurement DOUBLE NULL
);

COPY INTO onebrc
FROM (
SELECT $1:station_name::VARCHAR,
$1:measurement::DOUBLE
FROM @wizardbend/1brc/measurements.parquet
(FILE_FORMAT => 'myformat')
);

Result:

Time19.1s
Cost$0.02
One Billion Row Challenge

Sort & Group 1 Billion Rows in Snowflake

Excluding the utilization of cache (result cache and local disk cache), the SQL query is as follows:

SELECT station_name,
MIN(measurement) AS min_measurement,
AVG(measurement) AS mean_measurement,
MAX(measurement) AS max_measurement
FROM onebrc
GROUP BY station_name
ORDER BY station_name;

Result:

Time3.1s
Cost$0.007
One Billion Row Challenge

Sort & Group 1 Billion Rows in Databend Cloud

Again, without result caching:

SELECT station_name,
MIN(measurement) AS min_measurement,
AVG(measurement) AS mean_measurement,
MAX(measurement) AS max_measurement
FROM onebrc
GROUP BY station_name
ORDER BY station_name;

Result:

Time3.6s
Cost$0.004
One Billion Row Challenge

Conclusion

All the comparisons presented can be independently run and verified on both Snowflake and Databend Cloud platforms:

TaskSnowflake (Cost & Time)Databend Cloud (Cost & Time)
Read Parquet File$2.87 (21m 35s)$0.01 (9.8s)
Load 1 Billion Rows$3.48 (26m 7s)$0.02 (19.1s)
Read Native Format$0.007 (3.1s)$0.004 (3.6s)
Total$6.357$0.034

In this benchmark comparison between Snowflake and Databend, it's evident that cloud data warehouses, particularly Databend Cloud, excel in efficiently processing large datasets stored in object storage like AWS S3. They offer not only superior speed but also remarkable cost-effectiveness, making them the go-to choice for handling complex data challenges.

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!