Can we chat with our company database like with ChatGPT?

Introduction

What if querying your database was as easy as asking ChatGPT a question? Data access and analysis are essential elements for the success of any business. However, effective data exploitation is often hindered by the need for advanced technical skills, particularly in SQL or BI, and a deep understanding of the company's data schema. This technical barrier limits access to information for a large majority of employees, thus creating a bottleneck in some decision-making processes.

Interest in accessible and simple AI interfaces is growing. We can imagine a world where corporate databases are not only accessible to a limited fraction of employees but to anyone through a simple conversational experience like with ChatGPT.

‍

Discussion avec un LLM capable de connaitre sa base de données

‍

Thus, this use case aims to meet two main objectives:

Democratize access to data: allow all profiles, technical or not, to easily interact with data.
Optimize team time: reduce the time spent on data extraction and manipulation so that they can focus on high value-added analyses.

These objectives are accompanied by an essential priority: guaranteeing secure and reliable use of the database, thus ensuring the protection of sensitive information.

Generative AI performance on BIRD benchmark

The BIRD Benchmark, published in 2023, focuses on evaluating AI performance on text-to-SQL tasks. This refers to the ability of an AI model to translate a question formulated in natural language into a reliable and correct SQL query, allowing it to interact computationally with its database.

The BIRD database centralizes 95 other databases (33.4 GB) from 37 different professional domains (represented below). It has 12,751 question-answer pairs: each question formulated in natural language has its associated answer expressed in SQL language.

‍

‍

‍

The BIRD benchmark stands out for its relevance, as it reflects a complexity and quality of data similar to that encountered in a real company. It thus offers a realistic framework for evaluating performance. As a reference, human performance on this benchmark, achieved by data engineers, is 92.9%. Below is the AI performance on this benchmark at the time of writing this article:

‍

Results of BIRD benchmark. A gap remains

Since 2023, a real race has begun to match human performance, but despite spectacular advances in the field of LLMs, AI researchers have not yet been able to reach this level.

If you wish to discover the technical approach adopted by one of the winning teams, you will find more details in the references cited in this article.

Aqsone case study for procurement

Driven by a desire to explore this innovative use case, a team from the Aqsone Lab was formed to conduct a series of experiments. Its objective: to evaluate the feasibility of conversational data exploration by relying on existing solutions and approaches, without directly participating in the BIRD benchmark.

Here are the key steps of our approach:

1. Analysis of existing solutions

We reviewed existing solutions, whether they were off-the-shelf tools like DataGPT and DataChat or technical approaches such as SQL code generation from natural language (text-to-SQL) or Python code generation from natural language.

The qualitative evaluation was based on a series of criteria, including features, cost, data confidentiality, compatibility with multi-table databases, ease of use, ability to generate data visualizations, and the overall conversational experience.

Our analysis highlighted major limitations of some existing tools, particularly in terms of transparency on their operation (and price), the quality of their documentation, the guarantees on data confidentiality and security, as well as their ability to handle multi-table databases.

2. Creation of a reference dataset and evaluation dataset

To best reflect real business data, we generated an SAP purchasing database. This dataset, designed to be accessible and understandable by a standard purchaser, was used as the basis for our experiments.

‍

‍

An evaluation dataset is also created. It is composed of 30 sets of question-answers: one question formulated in natural language and one answer in SQL.

‍

‍

3. Definition of the approach to experiment

We implemented and tested three different approaches based on LLMs and existing frameworks, such as PandasAI, LangChain, LamaIndex and Vanna AI: these tools have the advantage of being already preconfigured to meet our use case. Therefore, they require very little customization.

‍

Approach	How does it work?	Pros	Cons	Framework examples
Agent	The LLM acts as a "brain" that directs a sequence of operations to complete the task.	Flexible, autonomous, capable of managing errors in real-time.	Can be complex to configure.	langchain_community.agent_toolkits.sql.toolkit, VannaDefault, PandasAI
Chain	The LLM follows a predefined workflow, often simple and sequential.	Practical, fast, suited for repetitive processes.	Less flexible for complex tasks.	langchain_experimental.sql.SQLDatabaseChain
Semantic	The LLM understands the meaning of questions and maps them to SQL structures.	Performs well with complex and ambiguous queries.	Computationally expensive.	llama_index (SQLDatabaseLlama, NLSQLRetrieve, ChainOfTableQueryEngine), PandasAI v2

4. Evaluation des approches

The "Execution Accuracy" score is calculated based on the initially created evaluation dataset. This score compares the execution results of the ground-truth SQL query and the predicted SQL query on the database content. Here are the results obtained:

‍

Results tables of Aqsone Benchmark on procurement dataset

‍

Thus, we conclude that using a more recent language model (GPT-4) combined with an Agent approach leveraging metadata and few-shots, or a LangChain native chain approach, delivers the best results on our dataset of 30 question-answer pairs.

5. Development of an intuitive user interface

To facilitate interaction with the LLM and the database, we have designed a dedicated user interface based on Streamlit.

‍

‍

Nous avons intégré une fonctionnalité permettant de visualiser la requête SQL utilisée pour répondre à une question. Voici un exemple :

‍

Conclusions

Our experiments have demonstrated the promising potential of LLMs to facilitate data access in enterprises. Approaches based on chains or agents stood out for their performance, achieving an execution accuracy of 70% on our experimental dataset. With the use of GPT-4, these approaches exhibited impressive native knowledge of SAP data and can offer a satisfactory level of explainability through the display of generated SQL queries.

These results encourage us to believe that AI could soon surpass data analysis experts in terms of speed and efficiency, particularly in the Text-to-SQL domain. The existence of the BIRD benchmark, dedicated to evaluating AI performance on text-to-SQL tasks, is a major asset for tracking progress and setting performance standards for this use case.

Despite these promising advances, here are the three main challenges that remain:

1. Ensuring reliability and accuracy of LLM responses

Possible solutions:

Provide high-quality and standardized data (e.g., SAP) as input
Use prompt engineering techniques to:
- Include contextual information about the database, such as descriptions, schema, DDL, primary keys, foreign keys, attributes, metadata, jargon, business context, types of joins, filters, date formats, etc. The more context the LLM has about the database, the more likely it is to provide an accurate answer.
- Encourage the LLM to rephrase the user’s question and outline the approach to answer it
- Prompt the LLM to formulate correct SQL queries that follow best SQL practices (preventing syntax issues, etc.)
- Apply advanced methods like multiple selection (self-consistency), where multiple answers are generated and the most coherent one is chosen
Verify the SQL query after generation to ensure it works on the database and ask the LLM to correct errors if necessary

2. Ensuring a good user experience

The interaction with the tool should be intuitive, much like ChatGPT. Best practices include:

Allow users to rephrase questions and review interaction history
Enable the LLM to seek clarifications if the question is complex or incomplete
Improve the explainability of the LLM's responses by displaying either the generated SQL query or, ideally, a detailed explanation of the reasoning behind it

3. Ensuring database protection

It’s crucial to guarantee secure interactions, especially at a time when data, often sensitive, plays a central role in decision-making. The use of external APIs like OpenAI raises legitimate concerns regarding data security and privacy. Therefore, robust solutions must be implemented to protect sensitive information:

Using Azure OpenAI could offer increased privacy levels. If necessary, it’s recommended to use an open-source LLM locally, though this solution can be costly and sometimes limited, as the most performant models are rarely available open source.
Ensure the security of interactions with the database by rigorously controlling access and write permissions to maintain data integrity.

The use of LLMs for conversational data exploration is a major innovation that could transform how businesses use and exploit their data. The progress made so far is encouraging and opens the door to an exciting future.

References

Benchmark BIRD

BIRD Paper: [Link to the BIRD paper]
BIRD Benchmark: [Official BIRD Benchmark site]
Summary of Distyl IA's Work who led the BIRD ranking between July and August 2024 with an accuracy of 71.83%.
Blog article: [Link to the blog article]
Paper: [Link to the paper]
Their approach involved:

Using GPT-4 from OpenAI
Iterative fine-tuning of GPT-4 on specific Text-to-SQL tasks using a dataset with triplets (natural language question, corresponding SQL query, and associated schema elements).
Maximizing model context by providing complete schema information to the model (instead of traditional "schema linking").
Augmenting context with detailed column descriptions, query hints, and expected result ordering and aggregations.
Self-consistency methods: Generating multiple answers and selecting the most relevant one.
Self-correction: Simulating a database administrator LLM to identify and correct errors in generated SQL queries if needed.