Bilkent University
Department of Computer Engineering
CS 590/690 SEMINAR
Agentic Evaluation of Software Engineering Artifacts via a Platform Testbed
Hesam Matinpouya
Ph.D. Student
(Supervisor:Asst.Prof.Anıl Koyuncu)
Computer Engineering Department
Bilkent University
Abstract: Software engineering research often depends on people evaluating artifacts in tasks such as code summarization, commit annotation, semantic similarity of functions, code review, and UML/design evaluation, yet this process remains a critical bottleneck. These studies take time to run, and human-based annotation is expensive, slow, and difficult to scale; even expert annotators often produce inconsistent judgments. These challenges limit the reliability and reproducibility of empirical software engineering studies. Meanwhile, large language models (LLMs) have recently shown strong performance on many software engineering tasks, which raises an important question: can they also be used to evaluate software artifacts, either by acting like human annotators or by helping them during the evaluation process? In this work, I propose a platform-based agentic framework that provides a means of systematically evaluating software artifacts through the use of LLMs. Unlike previous approaches that just consider LLMs to produce a single direct answer, this approach views evaluation as an interactive process where agents are assigned roles and permits experimentation with a range of evaluation methods. It supports three modes of evaluation: (1) LLMs working as independent evaluators, (2) human evaluators assisted by LLMs, and (3) multi-agent systems with predefined roles that engage in deliberation before making their decision. Reproducibility and fine-grained analysis are guaranteed by means of standardization of tasks, conditioning of roles, and traceability of interactions throughout the framework. This research presents an innovative experiment that analyzes the LLMs’ abilities to act as evaluators across various software engineering tasks. It contributes empirical evidence about the applicability of agentic LLM-based systems to emulate or even improve upon human judgment.
DATE: April 20, Monday @ 15:50 Place: EA 502