Translating Natural Language Queries into Structured Query Language (SQL) Using Large Language Models (LLM)
(Project No: 125E386)

·        SPONSOR: Scientific and Technical Research Council of Turkey - TÜBITAK

ABSTRACT: Relational databases are essential for the structured and secure storage of large amounts of data, offering quick access to this information. Companies, researchers, and organizations rely on these databases to maximize the potential of their extensive data. However, accessing and querying this data often requires knowledge of structured query languages like SQL, which can be a barrier for users without technical expertise. Translating natural language queries into structured query languages like SQL (Text-to-SQL) is a critical step in making database access more user-friendly and accessible to a broader audience.

The significance of Text-to-SQL systems lies in enabling users to interact with large databases and retrieve meaningful insights without needing to learn SQL. However, several challenges exist in this process. Complex database schemas, ambiguous user queries, and difficulty in accurately interpreting user intent all contribute to the complexity of Text-to-SQL systems. Despite ongoing research, there is still no system that can match human-level performance in this task.

This project aims to tackle these challenges by utilizing large language models (LLMs) to improve the accuracy and effectiveness of translating natural language queries into SQL.

The project will focus on the following key areas:

1.     Iterative Query Enrichment with Large Language Models: To improve the ability of large language models to understand database schemas and generate correct SQL queries, natural language queries will be enriched iteratively with relevant database components (such as tables, columns, and conditions) using LLMs. This enrichment process will enable the models to produce more accurate and relevant SQL statements.

2.     Improving Schema Linking with Iterative Reasoning Feedback: The performance of the schema linking mechanism, which is one of the subsystems used in converting natural language queries into structured query language, will be improved through a feedback mechanism. Existing datasets will be modified to generate a new dataset for this feedback mechanism. Using this dataset, parameter-efficient fine-tuning will be performed on open-source large language models, resulting in two separate models. One of these models will be trained to select the necessary database elements (filtering) for a given user query, while the other will be trained to provide feedback on the correctness of the selected database elements. By iterating through the database filtering and feedback loop, schema linking will be improved, enabling the system to generate more accurate SQL queries from natural language queries.

3.     Fast Schema Filtering: To improve the speed and reduce the cost of schema filtering (the selection of necessary database elements), which is a crucial part of converting natural language queries into structured query language, the database will be pre-processed to generate potential sub-schemas. The vectors representing these sub-schemas will be computed using large language models. When a natural language query needs to be translated into SQL, the vector computed for the query will be compared with the precomputed vectors of the sub-database schemas. The most relevant sub-database schema will then be selected for query translation. This method ensures that schema filtering is completed quickly and efficiently.

By developing these approaches, the project will advance the ability to translate natural language queries into SQL accurately and efficiently. This innovative work will not only contribute to the research community but also provide practical solutions for industries looking to make data access and analysis more accessible.

 

·        DURATION: November 2025 - November 2027

·        PRINCIPAL INVESTIGATOR: Özgür Ulusoy

·        GRADUATE STUDENTS: Hasan Alp Caferoğlu

·        BUDGET: 2,264,440 TL (~$54,000)