ABSTRACT: Relational databases are essential
for the structured
and secure storage of large amounts of data, offering quick access to
this information. Companies, researchers, and organizations rely on these databases
to maximize the potential of their extensive data. However, accessing and querying this
data often requires knowledge of structured query languages like SQL, which can be a barrier for users
without technical expertise. Translating natural language queries into structured
query languages like SQL (Text-to-SQL) is a critical step in making database access more user-friendly
and accessible to a broader audience.
The significance of Text-to-SQL systems lies in enabling users to interact
with large databases and retrieve
meaningful insights without needing to learn SQL. However,
several challenges exist in this process.
Complex database schemas, ambiguous user queries, and
difficulty in accurately interpreting user intent all contribute
to the complexity
of Text-to-SQL systems. Despite ongoing research, there is still no system that
can match human-level performance in this task.
This project aims to
tackle these challenges by utilizing
large language models (LLMs) to
improve the accuracy and effectiveness
of translating natural language queries into SQL.
The project will focus
on the following key areas:
1. Iterative Query Enrichment with Large Language Models: To improve
the ability of large language models to understand
database schemas and generate correct
SQL queries, natural language queries will be enriched iteratively with relevant database components (such as tables, columns, and conditions) using LLMs. This
enrichment process will enable the
models to produce more accurate
and relevant SQL statements.
2. Improving Schema Linking with Iterative
Reasoning Feedback: The performance of the schema linking mechanism, which is one of the subsystems
used in converting natural language queries into structured
query language, will be improved through a feedback mechanism. Existing datasets will be modified to generate
a new dataset for this feedback
mechanism. Using this dataset, parameter-efficient fine-tuning will be performed on open-source large language models, resulting in two separate models.
One of these models will be trained to select
the necessary database elements (filtering) for a given user query,
while the other will be trained
to provide feedback on the correctness of the selected database elements. By iterating
through the database filtering and feedback loop,
schema linking will be improved, enabling the system
to generate more accurate SQL queries from natural
language queries.
3. Fast Schema Filtering: To improve
the speed and reduce the
cost of schema filtering (the selection of necessary database elements), which is a crucial part of converting natural language queries into structured
query language, the database will
be pre-processed to generate potential sub-schemas. The vectors representing these sub-schemas will be computed using large language
models. When a natural language query needs to
be translated into SQL, the vector computed
for the query
will be compared with the precomputed
vectors of the sub-database schemas. The most relevant
sub-database schema will then be selected
for query translation. This method ensures that schema filtering
is completed quickly and efficiently.
By developing these approaches, the project will advance
the ability to translate natural
language queries into SQL accurately and efficiently. This innovative work will not only
contribute to the research community
but also provide practical solutions for industries looking to make
data access and analysis more accessible.