This Week in Databend #102
PsiACEJul 16, 2023
Databend is a modern cloud data warehouse, serving your massive-scale analytics needs at low cost and complexity. Open source alternative to Snowflake. Also available in the cloud: https://app.databend.com .
What's On In Databend
Stay connected with the latest news about Databend.
Creating Bloom Indexes for Specified Columns
Creating bloom indexes consumes a significant amount of CPU resources. For wide tables, where only a few columns may require point queries or data ingestion performance is more important, creating bloom indexes for all columns may not be a good idea.
The
bloom_index_columns
To create a table with bloom indexes:
CREATE TABLE table_name (
column_name1 column_type1,
column_name2 column_type2,
...
) ... bloom_index_columns='columnName1[, ...]'.
To create or modify bloom indexes for an existing table:
After modifying the Bloom index options, Databend will not create indexes for existing data. The changes will only affect the subsequent data.ALTER TABLE <db.table_name> SET OPTIONS(bloom_index_columns='columnName1[, ...]');
To disable the bloom indexing:
ALTER TABLE <db.table_name> SET OPTIONS(bloom_index_columns='');
If you are interested in learning more, please check out the resources listed below.
Databend SQL Conformance
Databend aims to conform to the SQL standard, with particular support for ISO/IEC 9075:2011, also known as SQL:2011. Databend incorporates many features required by the SQL standard, often with slight differences in syntax or function.
We have summarized the level of conformity of Databend to the SQL:2011 standard, hoping it can help you further understand Databend's SQL Conformance.
If you are interested in learning more, please check out the resources listed below.
Code Corner
Discover some fascinating code snippets or projects that showcase our work or learning journey.
Understanding Databend Recluster Pipeline
A well-clustered table may become chaotic in some storage blocks negatively affecting the query performance. For example, the table continues to have DML operations (INSERT / UPDATE / DELETE).
The re-clustering operation does not cluster the table from the ground up. It selects and reorganizes the most chaotic existing storage blocks by calculating based on the clustering algorithm.
The recluster pipeline is as follows:
┌──────────┐ ┌───────────────┐ ┌─────────┐
│FuseSource├────►│CompoundBlockOp├────►│SortMerge├────┐
└──────────┘ └───────────────┘ └─────────┘ │
┌──────────┐ ┌───────────────┐ ┌─────────┐ │ ┌──────────────┐ ┌─────────┐
│FuseSource├────►│CompoundBlockOp├────►│SortMerge├────┤────►│MultiSortMerge├────►│Resize(N)├───┐
└──────────┘ └───────────────┘ └─────────┘ │ └──────────────┘ └─────────┘ │
┌──────────┐ ┌───────────────┐ ┌─────────┐ │ │
│FuseSource├────►│CompoundBlockOp├────►│SortMerge├────┘ │
└──────────┘ └───────────────┘ └─────────┘ │
┌──────────────────────────────────────────────────────────────────────────────────────────────┘
│ ┌──────────────┐
│ ┌───►│SerializeBlock├───┐
│ │ └──────────────┘ │
│ │ ┌──────────────┐ │ ┌─────────┐ ┌────────────────┐ ┌─────────────────┐ ┌──────────┐
└───►│───►│SerializeBlock├───┤───►│Resize(1)├───►│SerializeSegment├────►│TableMutationAggr├────►│CommitSink│
│ └──────────────┘ │ └─────────┘ └────────────────┘ └─────────────────┘ └──────────┘
│ ┌──────────────┐ │
└───►│SerializeBlock├───┘
If you are interested in learning more, please check out the resources listed below:
Highlights
We have also made these improvements to Databend that we hope you will find helpful:
- Added support for distributed COPY INTO.
- Read document Docs | ATTACH TABLE to learn how to attach an existing table to another one.
- Read documents Docs | Deepnote and Docs | MindsDB to learn how Databend can better collaborate with your data science projects.
- Read documents Docs | Window Functions and Docs | Bitmap Functions to fully understand the BITMAP and window functions supported by Databend.
What's Up Next
We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.
Accelerate CTE through Materialization
Inlining Common Table Expression (CTE) is a good idea, but if the CTE is particularly heavy, such as in TPCH Q15, the cost can be prohibitively expensive. In this case, it's better to introduce materialization for the expensive CTE.
--- TPCH Q15
WITH revenue AS
(SELECT l_suppkey AS supplier_no,
sum(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM lineitem
WHERE l_shipdate >= TO_DATE ('1996-01-01')
AND l_shipdate < TO_DATE ('1996-04-01')
GROUP BY l_suppkey)
SELECT s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
FROM supplier,
revenue
WHERE s_suppkey = supplier_no
AND total_revenue =
(SELECT max(total_revenue)
FROM revenue)
ORDER BY s_suppkey;
Issue #12067 | Feature: speed up CTE by materialization
Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.com/i-m-feeling-lucky to get started.
Changelog
You can check the changelog of Databend Nightly for details about our latest developments.
Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.14-nightly...v1.2.25-nightly
Subscribe to our newsletter
Stay informed on feature releases, product roadmap, support, and cloud offerings!