Seminar in Computer Engineering

Bilkent University
Department of Computer Engineering
M.S.THESIS PRESENTATION

Improving Text-to-SQL Translation Through Direct Schema Linking and Synthetic Data Driven Fine-Tuning

Hasan Alp Caferoğlu
Master Student
(Supervisor: Prof. Dr.Özgür Ulusoy)

Computer Engineering Department
Bilkent University

Abstract: Translating natural language queries into Structured Query Language (Text-to-SQL or NLQ-to-SQL) is a fundamental problem at the intersection of natural language processing and databases, with the goal of enabling natural language interfaces to databases and making data access more accessible to non-expert users. Although recent advances in large language models have substantially improved Text-to-SQL systems, major challenges remain, including handling complex database schemas, resolving ambiguity in user queries, generating SQL queries with intricate structures that accurately reflect user intent, and providing high-quality supervision for database-specific adaptation. These challenges arise both at inference time, where the model must accurately align the user question to the underlying database structure, and at training time, where effective adaptation requires reliable and sufficiently diverse supervision tailored to the target database. To address these two complementary aspects of the Text-to-SQL problem, two lines of research on large language model based Text-to-SQL systems are presented. The first, E-SQL, investigates direct schema linking through a pipeline composed of candidate SQL generation, candidate predicate generation, question enrichment, and SQL refinement, where the natural language question is reformulated with schema-aware information such as relevant tables, columns, values, possible predicates, and SQL construction cues in order to better align user intent with database structure and reduce ambiguity over complex schemas. Experimental results show that this approach achieves competitive performance on standard benchmarks, particularly on complex queries, while also indicating that a basic schema filtering technique may degrade performance when advanced proprietary large language models are used. The second study, SING-SQL, introduces a fully automated two-stage framework for generating high-quality synthetic Text-to-SQL data for a target relational database without relying on SQL logs or manual annotation. The framework first hierarchically partitions the database schema into diverse sub-schemas for systematic coverage, and then synthesizes SQL-text pairs across controlled complexity levels through a quality-aware process involving validation, executability checks, automatic repair, reasoning trace generation, and column-focused balancing. The generated synthetic data is then used to fine-tune compact models for database-specific adaptation, and the results demonstrate that such models can achieve strong in-domain performance under database-specialized evaluation settings. Together, these studies address both schema-aware SQL generation at inference time and scalable supervision for in-domain adaptation, advancing Text-to-SQL systems across both general-purpose and database-specialized settings.

DATE: June 12, Friday @ 13:30

Place: EA 409