Blog

From LLM to Data Warehousing: How to Achieve AI-Driven Data Processing and Analysis

avatarDatabendLabsFeb 24, 2025
From LLM to Data Warehousing: How to Achieve AI-Driven Data Processing and Analysis

Discussions around Large Language Models (LLMs) have surged again, especially with the introduction of the Chain of Thought (CoT) mechanism in many advanced models. By simulating human reasoning, CoT enables models to break down complex tasks step by step, producing more accurate and reliable results. This breakthrough has significantly improved LLM performance in areas like mathematics and logical reasoning.

At the same time, DeepSeek has gained rapid traction in the market due to its low-cost and high-performance advantages. By adopting innovative architectures and training methods, DeepSeek has significantly reduced inference costs, lowering the barriers to AI adoption. This disruptive advancement has drawn widespread industry attention and discussion. Meanwhile, as more LLMs emerge, market competition continues to intensify.

The Impact of LLMs on the Data Industry

In the database industry, LLM applications are driving intelligent advancements in data processing while also fueling innovation in data warehouse and database technologies. In the AI era, data has become the key resource for enterprise success. Although companies can access the same foundational LLMs, those that effectively utilize their own data to build business-driven LLM applications will stand out in the competition.

The widespread adoption of LLMs is transforming data management, storage, and processing. As LLM technology matures, traditional database systems face unprecedented challenges. More companies are now exploring the fusion of LLMs with database technologies to redefine data querying, analysis, and processing methods. To meet the needs of LLM-powered data workloads, many database vendors are actively innovating in areas such as:

  • Enhancing data processing performance
  • Improving model inference capabilities
  • Optimizing data storage and access efficiency

Databend's Exploration of LLM Integration

As a cloud-native lakehouse provider, Databend recognizes the crucial role of data processing technologies in enterprise success. Since 2022, we have been exploring the integration of LLMs with the lakehouse architecture to enhance the intelligence of data warehouses.

After ChatGPT launched in 2022, our database engineering team initiated the first phase of exploration. We integrated OpenAI's API into the data warehouse and leveraged vectorization and Retrieval-Augmented Generation (RAG) to help users improve query efficiency and enable intelligent data processing.

During this phase, we introduced AskBend, which allows users to store their knowledge bases in Databend and conduct intelligent Q&A using embedding-based queries combined with RAG. This solution enables users to query Databend and receive AI-generated responses based on stored documents, showcasing LLMs' immense potential in data warehousing.

However, this approach also revealed some key issues, particularly in data privacy and cost control. Enterprises had to upload their data to external platforms for processing, leading to potential privacy risks and high service costs.

Challenges of Data Privacy and Cost Control

Although the introduction of LLM technology provides users with convenient intelligent services, in a traditional LLM application model, users must upload their data to an external platform for processing. This not only fails to effectively ensure data privacy protection but also makes the cost of using LLM services very high. For example, in the case of OpenAI, API call fees can be a significant expense for enterprises engaged in large-scale data processing. Especially when the data volume reaches hundreds of thousands or even more, the cumulative cost of API calls and token usage can increase rapidly.

To address these challenges, Databend further advanced its exploration of LLM technology in 2024, starting to integrate open-source LLMs, such as the open-source large models provided by Hugging Face, to solve the issues of data privacy and cost control. With this approach, user data is entirely stored locally and does not need to be uploaded to a cloud platform, ensuring data privacy. At the same time, the introduction of open-source LLMs significantly reduces the cost of model inference, allowing enterprises to use LLM technology more flexibly to meet their growing data processing needs. This year, with the explosive popularity of DeepSeek, we have also begun exploring its integration, combining it with DeepSeek's API for further experimentation.

In data analysis scenarios, the powerful capabilities of LLMs are particularly evident, especially in cases with small datasets. For many users, especially those unfamiliar with SQL or data warehouse operations, LLM technology can automatically generate query scripts and perform data analysis through natural language, significantly lowering the barrier to use.

For example, a user only needs to ask:

"Please analyze last year's sales data and identify the fastest-growing product."

The LLM will automatically generate an SQL query, execute the analysis in the data warehouse, and finally return the analysis results along with a visualized presentation.

However, as data volume grows, LLMs face certain limitations when processing large-scale datasets. Particularly in real-time data processing scenarios, traditional data warehouse systems still play an indispensable role. Therefore, Databend is further exploring ways to integrate LLM inference with real-time data processing technology to achieve intelligent analysis and processing of big data.

Exploration of LLM Applications

In practical applications, an increasing number of enterprises are successfully automating data analysis and enhancing the intelligence of business decision-making by leveraging LLM + data services. For example, e-commerce companies can integrate LLM technology with data warehouses to conduct in-depth analysis of user behavior, enabling precise decision-making for advertising placement and product recommendations. Additionally, LLMs can be applied in customer service, where they work in conjunction with data warehouses to build personalized recommendation systems based on users' historical behavior and preferences, ultimately improving customer satisfaction and loyalty.

Currently, Databend integrates its SQL-based data processing capabilities with the natural language processing and data understanding strengths of LLMs like DeepSeek, enabling users to efficiently process data and extract valuable insights. This approach has already been applied in internal quality assurance systems, AI function services, and unstructured data processing, significantly reducing manual analysis time and workload.

Internal Quality Assurance System

To minimize the impact on enterprise users during the upgrade process, we have developed a set of smoke tests based on DeepSeek's SQL model. The core technology of the test data generation engine relies on DeepSeek's powerful data processing capabilities, enabling it to generate data distributions that are closer to the user's real business scenarios based on the SQL model, especially for test data that is likely to trigger boundary issues. This approach not only improves test coverage but also more effectively identifies potential system risks, providing enterprise users with more reliable quality assurance.

AI Functions

Databend offers users a range of AI Functions to implement data ETL, allowing users to directly call functions in SQL and leverage AI capabilities to extract greater value from their data. Initially, all Databend services were supported by OpenAI's API. However, due to high costs and privacy concerns, we have replaced OpenAI with open-source large models in certain scenarios.

Currently, based on AI functions, we have primarily implemented the following capabilities:

  • ai_text_similarity: Text similarity;
  • ai_mask: Data masking for protecting sensitive information, such as addresses and phone numbers. Previously, relying on manual labor for this task would require significant manpower, especially with large data volumes; AI greatly improves efficiency in this regard;
  • ai_extract: Entity extraction, which identifies and extracts specific entity information from text. For instance, if your data contains entities like addresses and gender, this function can extract the data containing that information;
  • ai_sentiment: Sentiment analysis (positive/negative/mixed) used to determine the emotional tendency of the text. For example, in e-commerce, this capability can assess the sentiment direction of comments;
  • ai_classify: Classification, which categorizes text according to predefined categories;
  • ai_translate: Translation, which converts text from one language to another;

These capabilities essentially fall under the category of data cleansing tasks, where AI can replace human labor, significantly reducing labor costs. Throughout the implementation of this solution, we have made several optimizations, such as modifying the UDF Server. The traditional row-by-row processing method is inefficient, so we switched to a batch processing model, which greatly enhances data processing efficiency. We also revamped the original model with vectorization technology to make it more suitable for running on low-spec GPUs, thereby lowering operational costs. Additionally, we implemented a fine-grained billing and monitoring mechanism to ensure transparency in resource usage and cost control. Currently, these capabilities are not yet available for public testing but will be opened up shortly.

Databend's AI functions are designed to be user-friendly, even for those who are not familiar with machine learning or natural language processing. With Databend, you can quickly and easily integrate powerful AI features into your SQL queries, elevating your data analysis to a new level.

Unlike before, we have changed the model for integrating large models. Instead of running the AI cluster internally within Databend's Query, we now connect to open-source large models through Databend's Item Function Server model, where Databend only needs to define an API for the function. Once deployed, users can automatically connect to the large model in the cloud. This setup allows for easy switching to other better open-source large models in the future, with all data processing occurring on the user's machine. This not only addresses data privacy and compliance issues but also achieves cost control. If users require it, we can even provide a private deployment for them.

Unstructured Data Processing

Many users of Databend often need to extract entity information from unstructured data and convert it into structured data to uncover its data value in real business scenarios. Databend utilizes Deepseek's data processing and analysis capabilities to extract data information more efficiently and output it in JSON format.

Here is an example implemented using DeepSeek V3:

For instance, we input the following text:

“Please send an email to [email protected] to contact us, or visit our office at 401 RYLAND ST. STE 200-A, Reno, NV 89502, USA.”

Deepseek V3 can output:

{
"email": "[email protected]",
"address": "401 RYLAND ST. STE 200-A, Reno, NV 89502, USA"
}

With the continuous development of LLM and lakehouse technologies, future data processing will become more intelligent and automated. Enterprises will be able to automatically generate data models, analysis reports, and even perform real-time optimization and adjustments on data using LLM technology, thereby enhancing decision-making efficiency and accuracy. Whether it's real-time data analysis or cross-domain data integration, LLM will become an important driving force in data technology.

From an industry development perspective, the future integration of AI and data will not be limited to traditional data warehouse and lakehouse systems but will gradually expand into more fields. Enterprises will be able to leverage LLM technology to gain better insights into market changes, optimize products and services, and provide users with a more personalized experience.

Conclusion: Exploration and Practice

It is worth mentioning that through a relaxed open-source strategy, DeepSeek gained tens of millions of users and attention in just one month. The success of DeepSeek not only demonstrates the significant role of open-source strategies in promoting the popularization and application of new technologies but also provides an effective path to challenge industry giants. This open approach lowers technical barriers, accelerates technology adoption, and drives overall industry development.

In the database industry, open-source strategies have also proven to be an effective path. Taking Databend as an example, as an open-source cloud-native data lakehouse product, Databend has attracted a large number of developers and users through its open-source model, quickly establishing an active community and ecosystem. This open-source strategy has allowed Databend to rapidly become an open-source alternative to the database giant Snowflake, offering users a more cost-effective big data solution.

Currently, the integration of LLM and data is still evolving. For database practitioners and technology developers, how to better leverage LLM technology to enhance the intelligence level of data warehouses and how to promote the deep integration of LLM and data while ensuring data privacy and cost control are directions that require continuous exploration in the future.

Do you have better insights and practical experiences? Do you think there are more efficient ways to integrate LLM and data? Or in your practice, are there better technical solutions that can further advance this trend? We welcome you to share your thoughts and experiences with us in Slack or other ways as we explore the future of data technology together.

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!