What if querying your database was as easy as asking ChatGPT a question? Data access and analysis are essential elements for the success of any business. However, effective data exploitation is often hindered by the need for advanced technical skills, particularly in SQL or BI, and a deep understanding of the company's data schema. This technical barrier limits access to information for a large majority of employees, thus creating a bottleneck in some decision-making processes.
Interest in accessible and simple AI interfaces is growing. We can imagine a world where corporate databases are not only accessible to a limited fraction of employees but to anyone through a simple conversational experience like with ChatGPT.
Thus, this use case aims to meet two main objectives:
These objectives are accompanied by an essential priority: guaranteeing secure and reliable use of the database, thus ensuring the protection of sensitive information.
The BIRD Benchmark, published in 2023, focuses on evaluating AI performance on text-to-SQL tasks. This refers to the ability of an AI model to translate a question formulated in natural language into a reliable and correct SQL query, allowing it to interact computationally with its database.
The BIRD database centralizes 95 other databases (33.4 GB) from 37 different professional domains (represented below). It has 12,751 question-answer pairs: each question formulated in natural language has its associated answer expressed in SQL language.
The BIRD benchmark stands out for its relevance, as it reflects a complexity and quality of data similar to that encountered in a real company. It thus offers a realistic framework for evaluating performance. As a reference, human performance on this benchmark, achieved by data engineers, is 92.9%. Below is the AI performance on this benchmark at the time of writing this article:
Since 2023, a real race has begun to match human performance, but despite spectacular advances in the field of LLMs, AI researchers have not yet been able to reach this level.
If you wish to discover the technical approach adopted by one of the winning teams, you will find more details in the references cited in this article.
Driven by a desire to explore this innovative use case, a team from the Aqsone Lab was formed to conduct a series of experiments. Its objective: to evaluate the feasibility of conversational data exploration by relying on existing solutions and approaches, without directly participating in the BIRD benchmark.
Here are the key steps of our approach:
We reviewed existing solutions, whether they were off-the-shelf tools like DataGPT and DataChat or technical approaches such as SQL code generation from natural language (text-to-SQL) or Python code generation from natural language.
The qualitative evaluation was based on a series of criteria, including features, cost, data confidentiality, compatibility with multi-table databases, ease of use, ability to generate data visualizations, and the overall conversational experience.
Our analysis highlighted major limitations of some existing tools, particularly in terms of transparency on their operation (and price), the quality of their documentation, the guarantees on data confidentiality and security, as well as their ability to handle multi-table databases.
To best reflect real business data, we generated an SAP purchasing database. This dataset, designed to be accessible and understandable by a standard purchaser, was used as the basis for our experiments.
An evaluation dataset is also created. It is composed of 30 sets of question-answers: one question formulated in natural language and one answer in SQL.
We implemented and tested three different approaches based on LLMs and existing frameworks, such as PandasAI, LangChain, LamaIndex and Vanna AI: these tools have the advantage of being already preconfigured to meet our use case. Therefore, they require very little customization.
The "Execution Accuracy" score is calculated based on the initially created evaluation dataset. This score compares the execution results of the ground-truth SQL query and the predicted SQL query on the database content. Here are the results obtained:
Thus, we conclude that using a more recent language model (GPT-4) combined with an Agent approach leveraging metadata and few-shots, or a LangChain native chain approach, delivers the best results on our dataset of 30 question-answer pairs.
To facilitate interaction with the LLM and the database, we have designed a dedicated user interface based on Streamlit.
Nous avons intégré une fonctionnalité permettant de visualiser la requête SQL utilisée pour répondre à une question. Voici un exemple :
Our experiments have demonstrated the promising potential of LLMs to facilitate data access in enterprises. Approaches based on chains or agents stood out for their performance, achieving an execution accuracy of 70% on our experimental dataset. With the use of GPT-4, these approaches exhibited impressive native knowledge of SAP data and can offer a satisfactory level of explainability through the display of generated SQL queries.
These results encourage us to believe that AI could soon surpass data analysis experts in terms of speed and efficiency, particularly in the Text-to-SQL domain. The existence of the BIRD benchmark, dedicated to evaluating AI performance on text-to-SQL tasks, is a major asset for tracking progress and setting performance standards for this use case.
Despite these promising advances, here are the three main challenges that remain:
1. Ensuring reliability and accuracy of LLM responses
Possible solutions:
2. Ensuring a good user experience
The interaction with the tool should be intuitive, much like ChatGPT. Best practices include:
3. Ensuring database protection
It’s crucial to guarantee secure interactions, especially at a time when data, often sensitive, plays a central role in decision-making. The use of external APIs like OpenAI raises legitimate concerns regarding data security and privacy. Therefore, robust solutions must be implemented to protect sensitive information:
The use of LLMs for conversational data exploration is a major innovation that could transform how businesses use and exploit their data. The progress made so far is encouraging and opens the door to an exciting future.