From Git to Fuse Engine
BohuTANGJul 5, 2022
Databend, developed with Rust, is a new open-source data warehouse architected toward the cloud. It is committed to providing fast elastic expansion capabilities and a pay-as-you-go user experience. GitHub: https://github.com/datafuselabs/databend
Introduction
This post explains the Databend base: Fuse Engine, a powerful columnar storage engine. The engine was designed by the Databend community with the following principles: Powerful performance, simple architecture, and high reliability.
Before we start, check out a challenging task that Databend completed: With the Fuse Engine deployed on AWS S3, a transaction wrote 22.89 TB of raw data in around one and a half hour.
mysql> INSERT INTO ontime_new SELECT * FROM ontime_new;
Query OK, 0 rows affected (1 hour 34 min 36.82 sec)
Read 31619274180 rows, 22.89 TB in 5675.207 sec., 5.57 million rows/sec., 4.03 GB/sec.
Meanwhile, the following conditions were met as well:
- Distributed transactions: Multiple computing nodes can read and write the same data simultaneously (This is the first problem that an architecture that separates storage from compute must solve).
- Snapshot isolation: Different versions of data do not affect each other so you can do Zero-Copy Cloning for tables.
- Retrospective ability: You are allowed to switch to any version of the data, so you can recover with the Time Travel feature.
- Data merging: A new version of data can be generated after merging.
- Simple and robust: Data relationships are described using files, and you can recover entire data system based on these files.
From above, you will find that Fuse Engine is "Git-inspired". Before introducing the design of Fuse Engine, let's take a look at how the bottom layer of Git works.
How Git Works
Git implements data version control (including branch, commit, checkout, and merge) in a distributed environment. Based on Git semantics, it is possible to create a distributed storage engine. There are also some products built on Git-like on the market, such as Nessie - Transactional Catalog for Data Lakes and lakeFS.
To better explore the underlying working mechanism of Git, we use Git semantics to complete a series of "data" operations from the perspective of the database.
- Prepare a file named with the content:
cloud.txt
2022/05/06, Databend, Cloud
- Commit the file to Git.
cloud.txt
git commit -m "Add cloud.txt"
- Git generates a snapshot (Commit ID: ).
7d972c7ba9213c2a2b15422d4f31a8cbc9815f71
git log
commit 7d972c7ba9213c2a2b15422d4f31a8cbc9815f71 (HEAD)
Author: BohuTANG <[email protected]>
Date: Fri May 6 16:44:21 2022 +0800
Add cloud.txt
- Prepare another file named .
warehouse.txt
2022/05/07, Databend, Warehouse
- Commit the file to Git.
warehouse.txt
git commit -m "Add warehouse.txt"
- Git generates another snapshot (Commit ID: ).
15af34e4d16082034e1faeaddd0332b3836f1424
commit 15af34e4d16082034e1faeaddd0332b3836f1424 (HEAD)
Author: BohuTANG <[email protected]>
Date: Fri May 6 17:41:43 2022 +0800
Add warehouse.txt
commit 7d972c7ba9213c2a2b15422d4f31a8cbc9815f71
Author: BohuTANG <[email protected]>
Date: Fri May 6 16:44:21 2022 +0800
Add cloud.txt
Git now keeps two versions of the data:
ID 15af34e4d16082034e1faeaddd0332b3836f1424,Version2
ID 7d972c7ba9213c2a2b15422d4f31a8cbc9815f71,Version1
We can switch between versions by the Commit ID, which implements the functions of Time Travel and Table Zero-Copy. How does Git make it possible in the bottom layer? It's not rocket science. Git introduces these types of object files to describe the relationship:
- Commit: Describes tree object information
- Tree: Describes blob object information
- Blob: Describes file information
HEAD File
First, we need to know the HEAD pointer:
cat .git/HEAD
15af34e4d16082034e1faeaddd0332b3836f1424
Commit File
The Commit file records metadata related to the commit, such as the current tree and parent, as well as the committer, etc.
File path:
.git/objects/15/af34e4d16082034e1faeaddd0332b3836f1424
File content:
git cat-file -p 15af34e4d16082034e1faeaddd0332b3836f1424
tree 576c63e580846fa6df2337c1f074c8d840e0b70a
parent 7d972c7ba9213c2a2b15422d4f31a8cbc9815f71
author BohuTANG <[email protected]> 1651830103 +0800
committer BohuTANG <[email protected]> 1651830103 +0800
Add warehouse.txt
Tree File
The Tree file records all the files of the current version.
File path:
.git/objects/57/6c63e580846fa6df2337c1f074c8d840e0b70a
File content:
git cat-file -p 576c63e580846fa6df2337c1f074c8d840e0b70a
100644 blob 688de5069f9e873c7e7bd15aa67c6c33e0594dde cloud.txt
100644 blob bdea812b9602ed3c6662a2231b3f1e7b52dc1ccb warehouse.txt
Blob File
The Blob files are raw data files. You can view the file content using
git cat-file
git cat-file -p 688de5069f9e873c7e7bd15aa67c6c33e0594dde
2022/05/06, Databend, Cloud
git cat-file -p bdea812b9602ed3c6662a2231b3f1e7b52dc1ccb
2022/05/07, Databend, Warehouse
Fuse Engine
Databend's Fuse Engine was designed in a way similar to Git. It introduces three description files:
- Snapshot: Describes segment object information.
- Segment: Describes block object information.
- Block: Describes parquet file information.
Let's repeat the operations we just did with Git in Fuse Engine.
CREATE TABLE git(file VARCHAR, content VARCHAR);
-
Write
to Fuse Engine.cloud.txt
INSERT INTO git VALUES('cloud.txt', '2022/05/06, Databend, Cloud');
-
Fuse Engine generates a snapshot (Snapshot ID:
).6450690b09c449939a83268c49c12bb2
CALL system$fuse_snapshot('default', 'git');
*************************** 1. row ***************************
snapshot_id: 6450690b09c449939a83268c49c12bb2
snapshot_location: 53/133/_ss/6450690b09c449939a83268c49c12bb2_v1.json
format_version: 1
previous_snapshot_id: NULL
segment_count: 1
block_count: 1
row_count: 1
bytes_uncompressed: 68
bytes_compressed: 351 -
Write
to Fuse Engine.warehouse.txt
INSERT INTO git VALUES('warehouse.txt', '2022/05/07, Databend, Warehouse');
-
Fuse Engine generates another snapshot (Snapshot ID
) that is linked to the previous one (Snapshot ID:efe2687fd1fc48f8b414b5df2cec1e19
).6450690b09c449939a83268c49c12bb2
CALL system$fuse_snapshot('default', 'git');
*************************** 1. row ***************************
snapshot_id: efe2687fd1fc48f8b414b5df2cec1e19
snapshot_location: 53/133/_ss/efe2687fd1fc48f8b414b5df2cec1e19_v1.json
format_version: 1
previous_snapshot_id: 6450690b09c449939a83268c49c12bb2
segment_count: 2
block_count: 2
row_count: 2
*************************** 2. row ***************************
snapshot_id: 6450690b09c449939a83268c49c12bb2
snapshot_location: 53/133/_ss/6450690b09c449939a83268c49c12bb2_v1.json
format_version: 1
previous_snapshot_id: NULL
segment_count: 1
block_count: 1
row_count: 1Fuse Engine now keeps two versions of the data:
ID efe2687fd1fc48f8b414b5df2cec1e19, Version2
ID 6450690b09c449939a83268c49c12bb2, Version1That's very similar to Git. Right?
HEAD
Git needs a HEAD as an entry. So does Fuse Engine. Check the HEAD of Fuse Engine:
SHOW CREATE TABLE git\G;
*************************** 1. row ***************************
Table: git
Create Table: CREATE TABLE `git` (
`file` VARCHAR,
`content` VARCHAR
) ENGINE=FUSE SNAPSHOT_LOCATION='53/133/_ss/efe2687fd1fc48f8b414b5df2cec1e19_v1.json'
SNAPSHOT_LOCATION
efe2687fd1fc48f8b414b5df2cec1e19
6450690b09c449939a83268c49c12bb2
CALL system$fuse_snapshot('default', 'git')\G;
*************************** 1. row ***************************
snapshot_id: efe2687fd1fc48f8b414b5df2cec1e19
snapshot_location: 53/133/_ss/efe2687fd1fc48f8b414b5df2cec1e19_v1.json
format_version: 1
previous_snapshot_id: 6450690b09c449939a83268c49c12bb2
segment_count: 2
block_count: 2
row_count: 2
*************************** 2. row ***************************
snapshot_id: 6450690b09c449939a83268c49c12bb2
snapshot_location: 53/133/_ss/6450690b09c449939a83268c49c12bb2_v1.json
format_version: 1
previous_snapshot_id: NULL
segment_count: 1
block_count: 1
row_count: 1
Then create a new table (git_v1) and point
SNAPSHOT_LOCATION
CREATE TABLE git_v1(`file` VARCHAR, `content` VARCHAR) SNAPSHOT_LOCATION='53/133/_ss/6450690b09c449939a83268c49c12bb2_v1.json';
SELECT * FROM git_v1;
+-----------+-----------------------------+
| file | content |
+-----------+-----------------------------+
| cloud.txt | 2022/05/06, Databend, Cloud |
+-----------+-----------------------------+
Snapshot File
Stores the segment information.
File path:
53/133/_ss/efe2687fd1fc48f8b414b5df2cec1e19_v1.json
File content:
{
"format_version":1,
"snapshot_id":"efe2687f-d1fc-48f8-b414-b5df2cec1e19",
"prev_snapshot_id":[
"6450690b-09c4-4993-9a83-268c49c12bb2",
1
],
"segments":[
[
"53/133/_sg/df56e911eb26446b9f8fac5acc65a580_v1.json"
],
[
"53/133/_sg/d0bff902b98846469480b23c2a8f93d7_v1.json"
]
]
... ...
}
Segment File
Stores block information.
File path:
53/133/_sg/df56e911eb26446b9f8fac5acc65a580_v1.json
File content:
{
"format_version":1,
"blocks":[
{
"row_count":1,
"block_size":76,
"file_size":360,
"location":[
"53/133/_b/ba4a60013e27479e856f739aefeadfaf_v0.parquet",
0
],
"compression":"Lz4Raw"
}
]
... ...
}
Block File
The underlying data of Fuse Engine uses Parquet format, and each file is composed of multiple blocks.
Summary
In the early design period (October 2021) of Databend's Fuse Engine, the requirements were very clear, but the solution selection didn't go smoothly. At that time, the Databend community investigated a large number of Table Format solutions (such as Iceberg) on the market. The challenge was to choose between using an existing solution and building a new one. Finally, we decided to develop a simple and suitable Storage Engine that uses the Parquet standard as the storage format. Fuse Engine stores the Parquet Footer separately to reduce unnecessary Seek operations, and introduces a more flexible indexing mechanism, for example, operations such as Aggregation and Join can have their own indexes for acceleration.
Feel free to deploy Fuse Engine with your object storage to have a different experience on the big data analysis: https://docs.databend.com/doc/deploy
Databend on GitHub: https://github.com/datafuselabs/databend
Subscribe to our newsletter
Stay informed on feature releases, product roadmap, support, and cloud offerings!