LivingMeta AI in Research

Large language models for fact-checking over tables

Thu, 10 Oct 2024 00:00:00 GMT

LAUREA MAGISTRALE

Design of a modular evaluation framework for retrieval-augmented generation systems

Thu, 03 Apr 2025 00:00:00 GMT

LAUREA MAGISTRALE

Building trustworthy AI from small DNNs to large language models: a software engineering perspective

Thu, 16 Jan 2025 00:00:00 GMT

As Artificial Intelligence (AI) software becomes increasingly prevalent across various industries, concerns about its trustworthiness and reliability have come to the forefront. Although the trustworthiness of traditional software is regulated by Software Engineering (SE) practices, these practices have not been well integrated into AI model development due to the significant differences between traditional software development and AI model development. Inspired by this, we aim to systematically address trustworthiness by regulating the AI development process through the lens of SE practices. Specifically, we are inspired by the regulation of traditional software, focusing on the key phases in software regulation: software development, execution, and testing. We identify corresponding phases in AI model development: training, inference, and testing. These phases are crucial for ensuring the trustworthiness and reliability of AI models. My study aims to improve these phases to enhance the trustworthiness of AI models. Our primary approach to regulating AI model development mirrors traditional software practices. It involves first debugging these phases and then implementing repairs. Moreover, large language models (LLMs) are revolutionizing the software industry. Thus, in this thesis, I explore the debugging and repairing of AI software from three phases (i.e., training, inference, and testing), focusing on both small Deep Neural Networks (DNNs) and LLMs.

Trustworthy AI: Ensuring Reliability and Accountability from Models to Agents

Tue, 09 Jun 2026 00:00:00 GMT

In this thesis, we develop algorithms with theoretical guarantees for ensuring reliability and accountability of Machine Learning (ML) systems. As ML systems evolve from predictive models to generative models and autonomous agents, the landscape of trustworthy AI shifted. This thesis introduces tools grounded in information theory, optimization, and statistical learning to mitigate bias, reduce arbitrary decisions, ensure content provenance, and evaluate LLM-driven agents in autonomous settings. Towards mitigating bias and arbitrariness in traditional ML models, we introduce a kernel-based method to achieve multiaccuracy across complex subpopulations that tradi- tional demographic categories may overlook. We also develop methods to address predictive multiplicity—where equally accurate models yield conflicting individual predictions. We ensure the accountability in generative AI through watermarking large language models (LLMs). We characterize the information-theoretic trade-off between watermark detection and text distortion and deriving optimal watermarking strategies by leveraging optimal transport and coding theory. Empirical evaluations show our watermarks achieve a superior detection-quality tradeoff across language generation and coding tasks. Finally, we evaluate autonomous LLM agents in multi-agent environments through the first simulator of a fully LLM-driven supply chain. While agents can outperform human experts, reducing costs by up to 67%, we identify systemic risks such as costly tail events.

Escaping the Delphic Trap: Providing Variation Affordances to Foster Agency and Resilience in AI-Mediated Sensemaking

Thu, 18 Sep 2025 00:00:00 GMT

With emergent capabilities of generative AI, many systems have been eager to adopt features providing synthesis affordances: properties that enable and entice users in engaging with AI-synthesized information. There are contexts where offering synthesis affordances may be appropriate---perhaps, even useful. However, approaches taken by popular design patterns in providing them are often not resilient. They fail upon breakdowns stemming from AI limitations, human factors, or information quality issues. More concerningly, these design patterns distribute agency inappropriately between humans and AI by granting excessive agency to AI systems. Such configurations lure users into what we term the ``Delphic Trap,'' where users are enticed to satisfice to suboptimal information practice. Drawing on theories from cognitive and ecological psychology and work in Human-Computer Interaction, we conceptualize variation affordances, defining them as system properties that invite users to engage with the inherent variation within information collections. Systems can offer these affordances through steerable controls that support and invite users in engaging in productive friction during information seeking and sensemaking. We argue that a way to instantiate a more appropriate delegation of agency between humans and AI systems, in this context, is through providing variation affordances; doing so may help users engage in more intentional information actions. We design and evaluate several ways to provide these affordances. Chapter 1 establishes the motivating background for this work. Chapter 2 reviews structure-mapping theory literature to inform approaches for helping users utilize variation. Chapter 3 presents an eye-tracking ablation study (n=24) examining how participants interact with different feature sets providing various variation affordances, discussing functional aspects to consider when implementing these affordances. Chapter 4 examines how common design patterns offering synthesis affordances lack resilience and may cause users to fall into the Delphic Trap. Addressing these risks, we propose a design intervention providing variation affordances through ``AI Highlighters''. Chapter 5 presents a formative study (n=24) assessing user interactions with our design intervention and its variation affordances.

A Unified Framework for Collaborative Knowledge Graph Construction, Editing, and Distribution

Wed, 04 Mar 2026 00:00:00 GMT

Knowledge graphs (KGs) have emerged as a critical technology for grounding artificial intelligence systems in structured facts, offering a solution to the hallucination and relia- bility issues plaguing large language models (LLMs). Despite their utility, the infrastruc- ture required to construct, store, version, and collaboratively edit large-scale KGs remains fragmented. Previous work has addressed individual aspects of graph management but has failed to provide a unified, version-controlled ecosystem that supports the property- rich graphs required by modern applications. To address this infrastructure gap, this thesis introduces a comprehensive framework comprising four integrated systems: Optimus, a reproducible pipeline for graph construction; Diamond, a novel lossless binary com- pression format; GitGraph, a semantic version control system; and GraphEnv, an en- vironment for multi-agent collaboration. We implemented this framework to enable the end-to-end lifecycle of graph development, from initial data ingestion to downstream appli- cations. We utilized Optimus to construct OptimusKG, a biomedical KG with 192,307 nodes, 21.5M edges, and 88.6M properties, demonstrating a 56.5% reduction in build time through parallel execution. To address storage bottlenecks, we developed the Diamond algorithm, which we benchmarked against standard formats, achieving a 34×compression ratio on the popular PrimeKG dataset while preserving all node and edge properties. Fur- thermore, we formalized the theory of graph versioning by developing a three-way merge algorithm that allows for semantic, structure-aware conflict resolution, enabling true dis- tributed collaboration. Finally, we integrated these tools into GRENCE, a clinical decision support application that uses our infrastructure to ground LLM reasoning in verifiable medical data. This work establishes a robust software engineering foundation for KGs, transforming them from static artifacts into dynamic, evolving knowledge stores that can be efficiently maintained by hybrid teams of human experts and autonomous agents.

How Socially Aware Are Large Language Models? A Scoping Review With Implications for Education

Thu, 01 Jan 2026 00:00:00 GMT

Abstract This scoping review examines social reasoning and interaction in relation to Yang et al.’s (2025) framework of social awareness in Large Language Models (LLMs) and Natural Language Processing (NLP) research. Education is framed as a language-mediated system of evaluation through which LLMs expose institutional reliance on discursive outputs rather than access to cognition. 1,312 records (2021–2025) were screened across four databases (ERIC, Scopus, IEEE Xplore, and ACM Digital Library), yielding 18 studies involving multi-turn or multi-agent interaction. Using Yang et al.’s (2025) framework of social factors, interaction, and implication, the analysis suggests that social awareness is not a unified construct but varies across roles, environments, and mechanisms. Theory of Mind appears as one pathway for operationalizing social reasoning and is inconsistently applied. Within simulations, socially relevant behavior is primarily evaluated through task-oriented metrics, with limited attention to interactional quality. While social factors and interaction are commonly represented across studies, downstream societal and institutional consequences receive comparatively limited attention. Overall, the field lacks a unified framework, and evaluation practices raise questions about how interactional claims are assessed as LLMs increasingly shape educational evaluation, institutional legitimacy, and pedagogical authority. Keywords: Large Language Models, social awareness, multi-agent systems, education

Problems in High-Dimensional Estimation and Large Language Models

Wed, 04 Mar 2026 00:00:00 GMT

This dissertation investigates critical problems at the intersection of high-dimensional statistics and the rapidly advancing field of large language models (LLMs), forging a narrative that bridges foundational theory with state-of-the-art applications. The work is presented in two interconnected parts, unified by the theme that principles of high-dimensional estimation provide a powerful framework for addressing key challenges in modern artificial intelligence. The first part establishes a rigorous theoretical foundation for high-dimensional estimation. We present a sharp asymptotic analysis of a spectral method, inspired by Principal Hessian Directions, for learning multi-index models from nonlinear measurements. In a high-dimensional regime where data and signal dimensions grow proportionally, our analysis reveals a distinct phase transition phenomenon. We derive a set of deterministic fixed-point equations that precisely characterize the method's performance, offering an exact quantification of the alignment between the estimated and true subspaces. This theoretical contribution extends prior work from single-signal to multi-signal recovery, deepening our understanding of learning and signal processing in high-dimensional spaces. The second part of this dissertation transitions from theory to practice, demonstrating how the mathematical rigor developed in the first part can be leveraged to solve pressing challenges in the development and deployment of LLMs. We introduce three novel frameworks. First, we propose a principled method for the Selection of LLM Fine-Tuning Data based on Orthogonal Rules, which uses the Determinantal Point Process (DPP) to select a diverse and non-redundant set of data quality metrics. This approach, grounded in the concept of orthogonality, significantly improves the efficiency and performance of model fine-tuning across multiple domains. Second, we introduce RuleAdapter, a dynamic framework for training multi-attribute reward models in Reinforcement Learning from Human Feedback (RLHF). Motivated by information theory, RuleAdapter adaptively selects the most critical safety rules for each context, leading to state-of-the-art safety performance and demonstrably more trustworthy LLMs. Third, we propose Semantic Volume, a novel, unsupervised geometric measure for quantifying and detecting both internal (model-based) and external (query-based) uncertainty in LLMs. By linking this measure to differential entropy, we provide a robust and interpretable method to enhance model reliability and mitigate hallucinations. Collectively, this dissertation demonstrates that a deep understanding of high-dimensional systems is not merely a theoretical pursuit but an essential tool for building more robust, trustworthy, and efficient large language models. The presented research offers new theoretical insights into high-dimensional learning and delivers practical, mathematically-grounded methodologies that advance the state-of-the-art in the responsible development of artificial intelligence.

Toward In-Context Teaching

Wed, 21 Aug 2024 00:00:00 GMT

When a teacher provides examples for a student to study, these examples must be informative, enabling a student to progress from their current state toward a target concept or skill. Good teachers must therefore simultaneously infer what students already know and adapt their teaching to students’ changing state of knowledge. There is increasing interest in using computational models, particularly large language models, as pedagogical tools. As students, language models in particular have shown a remarkable ability to adapt to new tasks given small numbers of examples. But how effectively can these models adapt as teachers to students of different types? To study this question, we introduce a suite of models and evaluation methods we call AdapT. AdapT has two components: (1) a collection of simulated Bayesian student models that can be used for evaluation of automated teaching methods; (2) a platform for evaluation with human students, to characterize the real-world effectiveness of these methods. We additionally introduce (3) AToM, a new probabilistic method for adaptive teaching that jointly infers students’ past beliefs and optimizes for the correctness of future beliefs. In evaluations of simulated students across three learning domains (fraction arithmetic, English morphology, function learning), AToM systematically outperforms LLM-based and standard Bayesian teaching models. In human experiments, both AToM and LLMs outperform non-adaptive random example selection. Our results highlight both the difficulty of the adaptive teaching task and the potential of learned adaptive models for solving it.

THE ROLE OF TEAM CHARACTERISTICS AND BUSINESS FUNDAMENTALS IN IDENTIFYING HIGH-GROWTH ENTREPRENEURIAL COMPANIES: A MACHINE LEARNING APPROACH

Mon, 30 Jun 2025 00:00:00 GMT

Using data from China’s New Third Board and machine learning, this study examines the value of team characteristics (“jockey”) and business fundamentals (“horse”) in identifying high-growth entrepreneurial firms. We define team features using management resumes and business fundamentals using financial statements. To measure team characteristics from unstructured resumes, we employ (i) manual data collection and (ii) ChatGPT-based automatic feature extraction, including text embeddings or human-defined features. Our results show that models using manual collection and ChatGPT-extracted features perform similarly, but both surpass text embedding models. While both team features and business fundamentals predict high growth, business fundamentals add no incremental value once team features are included. Additionally, our best machine learning models outperform VC investors in identifying high-growth firms. These findings contribute to the “jockey vs. horse” debate, showing that team characteristics are stronger predictors of high-growth ventures, but they appear to be still underweighted in VC investment decisions.

AI-gegenereerde feedback in het onderwijs: een scoping review naar de aansluiting op effectieve feedbackstrategieën

Wed, 01 Jan 2025 00:00:00 GMT

In de afgelopen jaren heeft de integratie van artificiële intelligentie (AI) in onderwijscontexten aanzienlijke vooruitgang geboekt, met name door de inzet van AI-gegenereerde feedbacksystemen ter ondersteuning van schrijfvaardigheid en taalverwerving in het voortgezet en hoger onderwijs. Deze scoping review onderzoekt de typen feedback die worden gegenereerd door Large Language Models (LLM’s) en op Natural Language Processing (NLP) gebaseerde systemen, en beoordeelt in hoeverre deze aansluiten bij pedagogisch onderbouwde feedbackstrategieën zoals beschreven door Narciss (2008) en Van der Kleij (2019). Op basis van een systematische analyse van 29 peer-reviewed studies, gepubliceerd tussen 2021 en 2025, blijkt dat AI-gegenereerde feedback overwegend formatief en corrigerend van aard is, en effectief cognitieve functies vervult door fouten te signaleren en tekstuele verbeteringen voor te stellen. De integratie van metacognitieve en motiverende feedbackcomponenten, essentieel voor het bevorderen van zelfregulatie en studentmotivatie, blijkt echter inconsistent. Daarnaast wijst de review op een wijdverbreide toepassing van AI-feedback in online leeromgevingen, met wisselende mate van adaptiviteit aan individuele leerbehoeften. Deze resultaten tonen het potentieel van AI-systemen om technische aspecten van schrijven te verbeteren, maar onderstrepen ook tekortkomingen in het ondersteunen van diepgaand, reflectief leerproces. Praktische implicaties betreffen onder meer de noodzaak om AI-systemen uit te breiden met metacognitieve aanwijzingen en motiverende elementen, evenals professionele ontwikkeling van docenten om deze tools kritisch te implementeren. Deze studie draagt bij aan het onderwijskundige discours door een omvattende synthese te bieden van AI-feedbacktoepassingen en hun pedagogische waarde, en levert evidence-based aanbevelingen voor toekomstige ontwikkeling en implementatie.

Designing and evaluating prompting strategies for LLM-based data preparation

Wed, 10 Dec 2025 00:00:00 GMT

LAUREA MAGISTRALE

Context Clues: Probing Proactive LLM Decision-Making in Ambiguous, Socially Contextualized Multi-Turn Interactions

Thu, 18 Sep 2025 00:00:00 GMT

This study investigates the behavior of large language models (LLMs) in dynamic, multi-turn decision-making scenarios under conditions of ambiguity and social contextualization. Using a dual-agent framework, LLMs assume the roles of both questioner and answerer in two tasks: a 20 Questions game and a medical diagnosis simulation. We analyze how gender cues and target ambiguity, quantified via semantic entropy, affect reasoning efficiency and conversational success. Results show that ChatGPT-4o and Gemini 2.0 Flash exhibit sensitivity to social context and ambiguity, with ChatGPT-4o demonstrating superior reasoning in high-stakes settings. The findings highlight implicit biases and reasoning inefficiencies introduced by demographic context and underscore the limitations of single-turn evaluation metrics. This work emphasizes the need for interaction-aware evaluation frameworks for LLMs acting as proactive agents in socially sensitive domains.

DafnyBench

Fri, 12 Jul 2024 00:00:00 GMT

It is said that “program testing can be used to show the presence of bugs, but never to show their absence”. As Large Language Models are increasingly used to generate code, it is important that we verify the absence of bugs in their outputs. For this reason, we introduce DafnyBench, a novel benchmark for evaluating Large Language Models' abilities to write Dafny code, which is a verification-aware programming language. With over 1,000 sample programs, DafnyBench represents the largest benchmark of its kind to date. We argue this benchmark represents an important piece of infrastructure for an emerging field: AI-assisted formal verification. The benefits of having Large Language Models write verification-aware code cuts in two directions: the Dafny verifier provides a rich stream of feedback and clear learning signal for what constitutes correct code with respect to a set of pre- and post- conditions (helping machine learning research efforts), and also greater prevalence of Dafny code generally leads to a higher amount of safe software in the world.

Advancing knowledge-augmented reasoning and safety towards reliable large language models

Thu, 31 Jul 2025 00:00:00 GMT

In recent years, large language models (LLMs) have exhibited exceptional abilities in producing coherent and human-like text, capturing a wide range of linguistic patterns. Scaling the parameters of LLMs into the billions has significantly increased their versatility and adaptability, allowing them to handle new tasks with minimal or no additional fine-tuning. This advancement has opened up new possibilities for their applications in various domains. However, LLMs are fundamentally probabilistic models, which leads to unavoidable challenges in their reliability. LLMs often hallucinate facts, falter on tasks that require complex reasoning, and sometimes exhibit undesirable behavioural traits. This thesis explores a range of innovative methods and frameworks aimed at addressing these challenges, with the goal of creating artificial intelligence (AI) systems that not only are capable of leveraging knowledge to solve complex reasoning tasks, but also uphold human values. To mitigate hallucinations and strengthen reasoning capability, this thesis first advances the field of knowledge-augmented reasoning. Knowledge is often distributed across diverse sources and appears in varied formats. To harness heterogeneous knowledge, we propose the Chain-of-Knowledge (CoK) framework. CoK leverages unstructured text, knowledge graphs, and tabular data into a progressively revised Chain-of-Thought, thereby improving factual consistency in open-domain question answering. Recognising that successful problem solving also requires strategic control over retrieval, Critic-Guided Planning with Retrieval Augmentation (CR-Planner) couples a generative model with lightweight critic networks that decide when, how, and what to retrieve and that evaluate intermediate solutions; this architecture delivers substantial gains on tasks combining intensive knowledge access with complex reasoning, such as competitive programming, theorem proving, and complex domain retrieval. Furthermore, another significant source of hallucination for LLMs lies in temporal reasoning. Current LLMs frequently misinterpret dates and numerical ordering. Our proposed TempLogic mitigates this weakness by distilling time-relevant context, extracting structured event-date tuples and executing a logic interpreter, thereby producing consistent, temporally accurate answers. Finally, Parallel In-Context Learning (ParaICL) extends the knowledge-augmentation theme to demonstration-based information. Treating few-shot examples as task-specific micro-knowledge, ParaICL clusters demonstrations by semantic similarity, processes each cluster within a manageable context window, and then aggregates the partial probability distributions. This design lets the model draw on all available external examples without breaching token limits and consistently lifts accuracy on reasoning benchmarks. Improving LLMs' reasoning capabilities is only half of the reliability equation; the same system must also behave safely when interacting with users. Conventional safety checks concentrate on overtly toxic sentences and often miss deeper behavioural tendencies. To fill that gap, we design a psychological-safety framework that probes LLMs with validated personality and well-being inventories. The analysis uncovers elevated dark-triad traits in several state-of-the-art LLMs, prompting a lightweight preference-based fine-tuning procedure that significantly attenuates those traits. This behavioural audit and mitigation highlight the need for safety evaluations that extend beyond surface-level toxicity to systemic patterns of model behaviour. In summary, this thesis addresses the dual challenge of enhancing the reasoning capabilities and safety of LLMs. While LLMs demonstrate remarkable text generation abilities, they suffer from hallucinations, reasoning failures, and undesirable behavioral traits. The research contributes novel frameworks for knowledge-augmented reasoning, including CoK for integrating heterogeneous knowledge sources, CR-Planner for strategic retrieval control, TempLogic for temporal reasoning accuracy, and ParaICL for demonstration-based reasoning. Beyond reasoning improvements, the work introduces a psychological-safety framework that identifies and mitigates problematic behavioral patterns in LLMs through personality-based evaluations and preference-based fine-tuning. Together, these contributions advance the development of AI systems that are both intellectually capable and behaviorally trustworthy, addressing fundamental reliability concerns in LLMs.

Can ChatGPT Help Students Feign ADHD? A Simulation Study on AI Coaching

Wed, 01 Jan 2025 00:00:00 GMT

This study examined the effects of AI-generated coaching, using ChatGPT, on university students’ ability to simulate Attention Deficit/ Hyperactivity Disorder (ADHD) during a standardized assessment. An AI-based coaching guideline was created by submitting questions about feigning ADHD from 21 students to ChatGPT, which generated a coaching document. In a simulation study, 122 university students were randomly allocated to one of three groups: AI-coached (AIC; n = 40), symptom-coached (SC; n = 42), and honest responding (HR; n = 40). All participants completed an assessment battery including Conner’s Adult ADHD Rating Scale (CAARS; with two embedded symptom validity tests, SVTs: Inconsistency Index, INC; and Infrequency Index, CII), the Weiss Functional Impairment Scale (WFIRS), a computerized selective attention test from the Vienna Test System (WAFS), and the Reliable Digit Span (RDS), an embedded performance validity test (PVT). Both coached groups showed significantly higher symptom levels and greater impairments than the HR group (p < .001, r = -.34 to -.85). The AIC group exhibited more subtle symptom and impairment presentations than the SC group. Still, differences remained nonsignificant with small effect sizes (p = .011 to .895, r = .01 to .28). The AIC group endorsed lower detection rates on the SVTs (INC: 12.5% vs. 19%; CII: 17.5% vs. 38.1%) but similar rates on the PVT (RDS: 65% vs. 64.3%) compared to the SC group. Our results suggest AI coaching may facilitate more convincing feigning of ADHD compared to symptom coaching. However, perceived feigning success did not differ between AIC and SC groups. These findings highlight the potential threat AI-driven coaching poses to the validity of ADHD assessments and underscore the need for updated protocols to safeguard psychological testing. Keywords: ADHD assessment, malingering, artificial intelligence, ChatGPT, psychological test security

Disambiguating Large Language Model Performance on the Ambiguities of Law, Reasoning, and the Future

Wed, 17 Sep 2025 00:00:00 GMT

Over the past year, two separate cases of legal actors using LLM-based products to submit fictitious materials to American courtrooms have raised significant concern. In this thesis, I tackle one facet of this legal LLM usage problem, in arguing normatively for the need to teach LLMs legal reasoning. Drawing from legal theory but developing new definitions for the LLM context, I establish that legal reasoning requires reasoning over a central ambiguity in the application of general statutes to specific fact patterns — an inescapable ambiguity that I find traditional LLM reasoning research paradigms unequipped to handle. Thus, I contribute back to LLM reasoning research an experimental methodology that directly targets this central ambiguity and intuition in reasoning, as well as an evaluative framework that holistically captures true reasoning ability. I rework two prior legal ML datasets into benchmarks for a novel legal ambiguity identification task (available on HuggingFace). Conducting a preliminary experimental exploration of GPT-3.5 and GPT-4 models on this task, I discover some true ability at identifying legal ambiguity. Though it is currently weak and somewhat misaligned with legal practice, this ability shows promise for improvement, especially through model milestoning and fine-tuning. Overall, this thesis analyzes one component of ideal legal LLM usage from an interdisciplinary legal theory and computational perspective, toward progress in principled legal LLM implementations and more productive law and CS collaboration.

GENERATIVE AI IN HET HOGER ONDERWIJS: EEN WOLF IN SCHAAPSKLEREN?

Mon, 01 Jan 2024 00:00:00 GMT

Dit onderzoek richt zich op het identificeren van de voor- en nadelen die docenten en studenten toekennen aan het gebruik van Generative Artificial Intelligence (GenAI) in het hoger onderwijs, evenals welke informatievaardigheden zij nodig hebben voor verantwoord gebruik. Het betreft een kwalitatief onderzoek met een fenomenologische benadering om de ervaring en betekenissen die de deelnemers toekennen aan het gebruik van GenAI te begrijpen. In totaal zijn 28 deelnemers geïnterviewd, waaronder 12 studenten en 12 docenten van verschillende hoger onderwijsinstellingen. De deelnemers zijn geworven door middel van convenience sampling en snowball sampling. De data is geanalyseerd door middel van een thematische analyse om patronen en betekenis in de kwalitatieve gegevens te identificeren. De voornaamste bevindingen suggereren dat GenAI voordelen kan hebben bij leer- en onderwijsprocessen en gevolgen kan hebben bij de ontwikkeling van academische vaardigheden van studenten. Ten slotte worden kritisch denken, prompten, en systeem- en vakkennis aangehaald als belangrijkste vaardigheden om GenAI verantwoord te gebruiken voor onderwijsdoeleinden. Keywords: Generative AI, Integrated model of behavioral prediction, Voor- en nadelen, informatievaardigheden, hoger onderwijs

Leveraging Passive User Context For Human-AI Collaboration

Thu, 18 Sep 2025 00:00:00 GMT

The rapid advancement of Artificial Intelligence (AI) powered tools, particularly Large Language Models (LLMs), understanding user intent has emerged as a fundamental challenge for creating effective, user-centric tools. Although users can articulate their goals explicitly, traditional approaches to eliciting intent—such as detailed prompts, additional examples, or formal specifications—often impose a high cognitive burden and fail to fully capture the subtleties of users’ evolving needs. This dissertation argues that augmenting context awareness, particularly the passive capture of environmental and interaction cues, can reduce ambiguity in user intent, leading to more intuitive and efficient human-AI collaboration. We demonstrate the value of passive context awareness across three domains. In Chapter 1, we apply pragmatic reasoning in regular expression synthesis, reducing the need for exhaustive user examples by reasoning over examples not provided by the user as contextual cues to synthesize regular expressions. Chapter 2 introduces DynaVis, a dynamic interface for visualization editing that combines natural language input with dynamically generated UI widgets, showcasing how local workflow and task context can streamline iterative edits. Chapter 3 discusses MagicCopy, an AI-driven copy-and-paste tool that infers user intent by analyzing source and target applications alongside user instructions to automate cross-application data transformations. Finally, we envision the future of AI design workflows, emphasizing the importance of two-way grounding where systems not only interpret user context but also reveal their reasoning and capabilities. Together, these contributions highlight the potential of passive context-awareness to transform interactive AI by delivering more seamless, contextually informed assistance.

Guided Proof Search Using Large Language Models and Lemma Extraction in Coq

Fri, 12 Jul 2024 00:00:00 GMT

Interactive theorem provers are powerful tools for formalizing mathematics and verifying the correctness of software. However, they require significant background and effort to use due to the tedious nature of writing formal proofs and have not seen widespread adoption among either mathematicians or software engineers. Automated theorem provers aim to address this problem by automating the search for proofs, reducing the amount of human effort required. Recent advances in machine learning have shown that large language models can generate high-quality outputs in a variety of domains, including that of formal proofs. Most previous approaches that use language models, however, have focused on generating individual proof steps and using them in conjunction with an expensive search algorithm to find proofs. In 2023, First et al. introduced Baldur, a system that uses language models to generate entire proofs at once, instead of step-by-step, for theorems in the Isabelle proof assistant. This thesis studies the feasibility of a similar whole-proof generation procedure for the Coq proof assistant and introduces a novel approach to automated theorem proving that recursively extracts lemmas at failure points in the proof generation process, allowing the system to break complex theorems down into simpler subproblems. We evaluate these approaches on a dataset of 724 theorems from the Software Foundations textbook and show that GPT-4 can generate whole-proofs for 66.44% of the theorems. Additionally, when augmented with our lemma extraction method, GPT-4 sees a 19.54% improvement to achieve a success rate of 79.42%, thus marginally outperforming CoqHammer—a state-of-the-art automated reasoning tool—which proves 78.73% of the theorems. We also evaluate the much smaller open-source model Phind CodeLlama, which depicts a 103.23% improvement over its baseline when utilizing lemma extraction. We release our Coq playground that contains an implementation of this procedure along with the dataset and evaluation results through an open-source repository to encourage further research in this area.

Advancing Text Evaluation in the Era of LLMs: Toward Robust Metrics, Consistency, and Human Alignment

Sun, 28 Jun 2026 14:19:26 GMT

The rapid advancement of Large Language Models (LLMs) has redefined the landscape of natural language generation (NLG), yet their evaluation remains one of the most persistent and underdeveloped challenges in computational linguistics. Traditional evaluation methods—rooted in reference-based similarity metrics or limited human judgment—fail to capture the deeper structural, rhetorical, and logical dimensions of text quality. This thesis addresses these limitations for modern text evaluation from different perspective, including proposing evaluation algorithms, identifying evaluation dimensions and designing assessment frameworks. The thesis unfolds across six chapters. It begins by situating the problem within the broader context of NLG’s evolution and the emergent role of LLMs as evaluators. Through an extensive literature review, it bridges the gap between classical discourse analysis and contemporary evaluation paradigms, revealing the need for models that can assess text beyond surface-level fluency or lexical overlap. The first contribution is a novel, reference-free metric—Positional Discourse Divergence (PDD)—that quantifies structural coherence by analyzing the distributional patterns of discourse elements across a text. PDD offers a principled means of capturing structural quality, demonstrating superior sensitivity to narrative and argumentative flow compared to existing metrics. In the third chapter, the thesis introduces a scalable comparative evaluation framework that enhances the human alignment and computational efficiency of LLM-based evaluators. By leveraging pairwise preference judgments and an innovative rank aggregation method, this approach yields more reliable and interpretable assessments of text quality. Finally, the thesis addresses the logical consistency of LLM-based evaluators—an often-overlooked yet essential criterion for AI systems. It presents a formal framework for diagnosing violations of rational axioms such as transitivity and commutativity, and proposes a data refinement and augmentation method that substantially improves consistency without sacrificing evaluative flexibility. Collectively, these contributions establish new paradigms for evaluating text generated by humans and machines alike. The thesis concludes by outlining a vision for the next generation of evaluation systems: robust, interpretable, and aligned with the evolving epistemic of artificial intelligence research.

Detecting fact-conflicting hallucinations through the use of large language models

Thu, 10 Oct 2024 00:00:00 GMT

LAUREA MAGISTRALE

When AI Talks About Nature: Ideological Bias in ChatGPT’s Environmental Discourse Across Priming Conditions

Wed, 01 Jan 2025 00:00:00 GMT

Biodiversity conservation is a global imperative, yet debates over how to balance economic and ecological priorities remain deeply polarised. Large language models (LLMs) like ChatGPT now play a significant role in shaping public discourse, raising concerns that their outputs may reinforce ideological divisions through biased or primed responses (Kaneko et al., 2024). While prior research has addressed LLM biases in domains such as gender, race, and politics, there appears to have been no systematic investigation into how prompt priming influences LLM outputs in biodiversity-related discussions yet. This study examines the extent to which ChatGPT-generated responses reflect or amplify political and ideological biases in biodiversity discourse, with a focus on the effects of prompt priming. Using a controlled experimental design, both GPT-4.1 and GPT-4o models were prompted under five ideological conditions. Responses to the validated 24-item Likert-scale Environmental Attitudes Inventory (Milfont & Duckitt, 2010) and corresponding open-ended questions were analysed using a combination of quantitative (ANOVA, Kruskal-Wallis, regression) and linguistic (LIWC-22) methods. Results reveal robust, systematic effects of both priming direction and intensity on model outputs, affecting not only stated attitudes but also linguistic features such as analytic style, emotional tone, and social framing. Furthermore, model architecture influenced the degree and nature of these shifts, with notable differences between GPT-4.1 and GPT-4o. These findings highlight the sensitivity of LLMs to prompt context and underscore the importance of transparency and bias mitigation in their deployment for public-facing environmental communication. The study contributes to ongoing discussions about the ethical and political implications of generative AI in shaping environmental and policy debates.

CLOSED-LOOP SCALING: AUTONOMOUS IMPROVEMENT OF LLM AND LVLM REASONING

Tue, 02 Jun 2026 00:00:00 GMT

As human-curated data approaches exhaustion, sustaining the improvement of large language models (LLMs) and large vision--language models (LVLMs) demands a paradigm shift. This thesis proposes automatic scaling: a closed-loop framework in which models autonomously improve through their own computation via three layers. Inference-time scaling treats reasoning as search guided by self-evaluation. Training-time scaling internalizes search-discovered knowledge into parameters through iterative preference alignment. Architectural grounding provides structural foundations for sustainable scaling. Through critical analysis, we identify the coherence--correctness gap as a systemic limitation of self-referential scaling and present MVP-Bench, a diagnostic benchmark revealing significant deficits in multi-level visual perception. We propose future directions in dynamic evaluation, agentic scaling, pre-linguistic reasoning foundations, and native multimodal scaling, delineating both the promise and boundaries of autonomous model improvement.

USING AI REDUCES THE USER’S PERCEIVED VALUE OF HARD WORK

Sun, 31 Mar 2024 00:00:00 GMT

This project examines how using generative AI tools, such as ChatGPT, can affect the perceived value of hard work for users by reducing the uniqueness of their work. In the first study, I analyze archival data from the United States, uncovering a negative correlation between the appreciation of algorithms and the perceived worth of labor. Studies 2 and 3 involve online experiments with working adults in which participants are randomly assigned to two conditions: writing a 200-word essay with or without the assistance of ChatGPT. The results indicate that participants using AI assistance to write their essays tend to perceive hard work as less valuable. This effect is further explained by a serial mediation process through reduced competence fulfillment and a sense of diminished uniqueness in the essay. Notably, third-party evaluators rate the AI-assisted essays as objectively superior to those composed without AI, even though the writers themselves do not subjectively differentiate them. Study 4 replicates the experiment with participants aged 12-15, but the main effects are not replicated in this younger age group. Overall, the study's findings suggest that while AI is a valuable technological tool, its usage may diminish the value individuals place on their hard work.

Machine Learning Applications Supporting Large Scale Programming Education

Wed, 10 Apr 2024 00:00:00 GMT

Providing effective individualized education at scale has been a widely explored topic in education research, and the advancement of recent machine learning methods have made it possible to develop increasingly effective adaptive and intelligent learning systems. In particular, the emergence of deep learning models, and most recently large language models, has propelled the educational field forward, providing both new challenges and opportunities for educators. This dissertation addresses some of these challenges and opportunities, focusing on machine learning methods as a means to enhance large scale programming education. We first present methodological considerations for identifying learners at risk of dropping out, and empirical evaluation of modern machine learning approaches for evaluating student mastery of skills. Then, we analyse features that relate to students continuing in a series of open online courses for introductory programming. Relating to the constant need to produce new learning materials to keep course content relevant in the rapidly evolving landscape of programming and computer science, and the fact that producing such mterials with appropriate quality can be a highly time-consuming task for educators, we propose and evaluate a novel approach that leverages large language models to create learning materials, particularly programming exercises and code explanations, which can be personalized for student needs and interests for increased engagement. The approach shows promising results in generating diverse, coherent, and relevant content. Most of the generated exercises were considered sensible, novel, and adhering to given themes and concepts. Further, we evaluate automatically generated code explanations in real educational settings and show that students tend to rate automatically generated explanations useful for their learning, even higher than those of their peers. As means to help students, this dissertation looks into improving the timeliness of feedback, a key aspect in the effectiveness of feedback. This is done through proposing a framework that in-cludes a machine learning step for speeding up automated assessment, which consequently speeds up assessment feedback, and constructing annotated datasets of when and how experts provide feedback and hints to learning programmers that can be used as a reference on when and how future machine learning models or other automated methods should provide feedback. As a whole, the scope of the dissertation encompasses much of the entire educational process, spanning from (1) identifying learners needs and those who would benefit from additional assis-tance, to (2) educators designing content for learning and practice to (3) helping learners through timely and meaningful feedback for learners. The results in this dissertation showcase both methodological issues as well as new avenues for enhancing large scale computing education through machine learning methods.

FACT-CHECKING COMPLEX CLAIMS WITH LARGE LANGUAGE MODELS

Mon, 19 Jan 2026 00:00:00 GMT

Fact-checking (FC) aims to verify claims against evidence, but existing automated FC systems struggle with complex claims requiring multi-hop reasoning, numerical analysis, structured data interpretation, and domain-specific verification. Large Language Models (LLMs) show strong reasoning abilities and explainability, making them promising for FC. However, they suffer from inconsistent reasoning, imprecise numerical operations, and difficulty processing structured evidence, limiting their reliability in complex verification. This thesis systematically evaluates LLMs in complex FC through SCITAB, a benchmark covering tasks requiring structured multi-hop reasoning, numerical validation, and domain-specific interpretation. SCITAB reveals three major limitations: 1) inconsistent multi-hop reasoning, where models fail to aggregate information across sources; 2) imprecise numerical reasoning, leading to errors in arithmetic computations; and 3) poor table understanding, where LLMs misinterpret row-column relationships and retrieve misaligned evidence. To address these issues, this thesis develops three frameworks: (1) TART, a tool-augmented reasoning framework integrating LLMs with external computational tools, enabling precise numerical reasoning, structured data interpretation, and explainability for table-based claims. (2) QACHECK, which guides reasoning via question-driven decomposition, improving logical consistency and enhancing explainability through explicit intermediate steps. (3) PROGRAMFC, which enhances multi-hop reasoning through structured, programmatic decomposition, enabling more reliable and transparent verification.

Language Models as Mirrors and Bridges for Intergroup Communication

Wed, 10 Dec 2025 00:00:00 GMT

This dissertation explores how large language models (LLMs) can serve dual roles in intergroup communication: as mirrors that reflect intergroup differences, and as bridges that facilitate communication across group boundaries. Intergroup communication refers to interactions between individuals from different social groups, such as political, cultural, or professional communities, where divergent perspectives often lead to misunderstandings, unequal access to information, and social fragmentation. The first part of the dissertation presents LLMs as mirrors that reveal intergroup differences. We first introduce CommunityLM, a novel framework for probing public opinion by fine-tuning LLMs on social media posts from specific communities. Our case study comparing Republican and Democratic groups reveals that model predictions align well with human survey responses, substantially outperforming established baselines. Building on this foundation, we develop PersonaLLM to investigate whether prompt-based LLM agents can generate content aligned with assigned personas, which has emerged as a popular approach for modeling the behaviors of social groups. Through automated and human evaluations, we demonstrate that these agents can complete personality tests and write stories that reflect the distinctive behavioral patterns of specific personality profiles. Together, these complementary projects illustrate how LLMs can effectively capture and simulate the unique perspectives and behaviors that characterize diverse social groups. The second part of the dissertation presents LLMs as bridges that facilitate communication across group boundaries. First, we introduce Bridging Dictionary, an interactive tool that uses retrieval-augmented generation (RAG) techniques with LLMs to identify polarized language and suggest more inclusive alternatives. In collaboration with PBS Frontline, we demonstrate the potential of LLMs to reduce misunderstanding in journalism and political communication. Second, we present Legal Storytelling, a human-LLM collaboration framework that generates accessible narratives to explain complex legal concepts to non-experts. Through randomized controlled trials (RCTs), we find that LLM-generated narratives can improve legal literacy and help bridge communication gaps between experts and laypeople, particularly among non-native English speakers. Third, we develop FaciliTrain, a voice-based, LLM-powered system that enables facilitators to learn and practice intergroup dialogue skills with multiple LLM agents representing diverse social backgrounds and personas in a small-group setting. User studies with campus participants show encouraging early results, suggesting that LLMs can effectively support the development of communication skills essential for constructive intergroup dialogue. Together, these projects illustrate how LLMs can actively foster mutual understanding across social divides by promoting more inclusive, accessible, and constructive communication.

The Influence of AI-Like Text on Responses To Disclosure: Evidence From AI Detection Models

Fri, 15 Aug 2025 00:00:00 GMT

The rise of ChatGPT and other generative AI models has revolutionized machine-generated text. One potential application of this technology is helping craft firms’ narrative disclosures. Using two highly rated, commercially available AI detection models, I create novel measures of AI-like text in disclosure based on AI detection models’ classification that the text was generated either wholly or partly by AI. Using these measures, I study changes in disclosure surrounding the release of ChatGPT-4.0 in early 2023 and document a significant increase in the incidence of AI-like text in earnings conference call prepared remarks but not in managers’ responses to questions. Further evidence suggests that AI-like text in disclosure is more common among smaller, younger firms, and, on average, exhibits more positive tone, less uncertainty, and more forward-looking statements than non-AI-like disclosure text. I then compare the market responses to linguistic measures from AI-like disclosure text and non-AI-like disclosure text. Contrary to other studies that find generative AI text to be of higher quality and more persuasive than human text, my evidence suggests that tone in non-AI-like disclosure text is more strongly associated with returns. Overall, my results suggest that AI-like text may mute responses to information in disclosure.

Practical Considerations For the Deployment of Clinical NLP Systems

Wed, 21 Aug 2024 00:00:00 GMT

Although recent advances in scaling large language models (LLMs) have resulted in improvements on many NLP tasks, it remains unclear whether these models trained primarily with general web text are the right tool in highly specialized, safety critical domains such as healthcare. A healthcare system attempting to automate a clinical task must weigh all approaches with respect to safety, efficacy, and efficiency. This thesis investigates the challenges and implications of implementing LLMs in clinical settings, focusing on the three considerations listed above: safety, efficacy, and efficiency. We first explore the potential biases that might be introduced in downstream patient safety by using LLMs in a zero or few-shot setting and find that LLMs can propagate, or even amplify, harmful societal biases in a number of clinical tasks. Then, we examine the privacy considerations of pretraining a language model on protected health information (PHI) bearing clinical text and find that simple probing methods are unable to meaningfully extract sensitive information from an encoder-only language model pretrained on non-deidentified electronic health record (EHR) notes. Finally, we conduct an extensive empirical analysis of 12 language models, ranging from 220M to 175B parameters, measuring their performance on 3 different clinical tasks that test their ability to parse and reason over electronic health records. We show that relatively small specialized clinical models are substantially more effective than larger models trained on general text used through in-context learning. Further, we find that pretraining on clinical tokens allows for smaller, more parameter-efficient models that either match or outperform much larger language models trained on general text. We argue that using a clinical text-specific pretrained language model allows for an efficient, effective, and privacy-conscious approach, enabling a tailored and ethically responsible application of AI in healthcare.

Towards Automated Healthcare: Deep Vision and Large Language Models for Radiology Report Generation

Thu, 06 Jul 2023 00:00:00 GMT

Automatic radiology report generation has the potential to improve patient care and reduce diagnosis delays. Deep learning approaches have shown promising progress but are still not accurate enough for clinical deployment. In this thesis, we investigate and develop two approaches for report generation, one retrieval-based and one generation-based, both of which leverage deep vision-language pre-training. Our retrieval-based method uses a multimodal encoder and contrastive loss to learn pre-trained radiology image and text representations, followed by a learned image-text matching similarity metric for retrieval. This method achieves state-of-the-art results on clinical accuracy and natural language metrics including CheXpert vector disease profile similarity and BLEU2 score. We also conduct an expert evaluation study on a subset of samples, where we collect radiologists' error annotations on our generated reports, a baseline method's generated reports, and human-written reports. The study confirms that our method improves significantly upon the baseline, and we will release the dataset of error annotations to aid future research into types of generated report errors and alignment of evaluation metrics with human radiologists' assessment. For our generation-based method, we use a querying transformer module for modality alignment between an image encoder and a text decoder. We also investigate a novel prompting method to generate both impression and findings report sections with the same model to increase efficiency. The model is trained on a mixed report section dataset and can be prompted to generate both report sections with similar performance to separate single-section models. Finally, we study the impact of different pre-training methods for the querying transformer and find that unlocking the image encoder during pre-training helps with domain adaptation and clinical accuracy but not natural language metrics.

Augmenting medical image classifiers with synthetic data across populations

Sat, 01 Jun 2024 00:00:00 GMT

Rapid improvements in the capabilities of generative artificial intelligence (AI) models to produce and interpret images have created new possibilities to address persistent challenges in medical machine learning including data scarcity, annotation costs, and model biases. This dissertation primarily explores the potential opportunities and limitations of using synthetic images created by large generative AI models to improve the performance and generalizability of medical image classifiers across populations. We also evaluate the ability of consumer-facing vision-language models to classify dermatology and chest X-ray images under different prompts. We designed an image-generation pipeline by fine-tuning diffusion-based models to create synthetic skin disease images using both generative fill and text-to-image methods. With this pipeline, we generated 500,000 synthetic dermatology images (which we publicly released for future research) representing 12 diseases across diverse skin tones. We then systematically evaluated the performance of AI disease classifiers when including or excluding synthetic images in model training. We show that in data-limited settings (in which there are few real images of a disease or skin-tone), synthetic data can improve classifier performance, but that these gains saturate once sufficient quantities of real images are available. We find that the biggest driver of model improvements is the quantity of real images. We also observed a correlation between the physician-assessed photorealism of synthetic images and gains in model performance. Collectively, these findings suggest that synthetic data presents a complementary tool to training disease classifiers and can be useful as an advanced augmentation method or a way to share features of a data distribution without sharing the data itself. However, efforts must still be focused on collecting more high quality, diverse, real data to train the next generation of fair, robust, and generalizable AI systems. We also evaluated the capabilities of consumer-facing and general-purpose vision-language AI models in interpreting chest X-rays and dermatology images. We found that these systems, which have not been specifically trained for medical image diagnoses, can perform at or near-human level on selected metrics, and that model performance and behavior can be influenced using the text prompt and task formulation. Our analysis suggests that evaluations of large language and vision-language models should carefully consider the prompt context and other inputs. Overall, this dissertation provides a systematic analysis of the opportunities, pitfalls, and open challenges regarding the use of synthetic data and generative AI for improving medical imaging across all populations.

Domain Adaptation of LLMs for Materials Science: Dataset Curation, Fine-Tuning, and Evaluation Benchmark

Thu, 21 May 2026 00:00:00 GMT

This thesis addresses the gap between the capabilities of general-purpose large language models (LLMs) and the specialized needs of materials science. While LLMs have shown transformative potential in many fields, their application in materials science remains limited due to the lack of domain-specific natural language datasets and evaluation benchmarks. To overcome this challenge, the thesis introduces a curated instruction-tuning dataset composed of diverse question-answer (QA) pairs drawn from various materials science sources such as textbooks, property databases, and expert forums. This dataset was used to fine-tune the LLaMA-3-8B-Instruct model, to create a materials science domain expert. However, the experiments revealed both fine-tuning datasets and evaluation limitations. These findings led to a refined, higher-quality second pipeline focused on building an evaluation benchmark rather than a full training corpus. The final benchmark includes 2,100 QA pairs across a wide range of materials science topics, with a subset validated by human experts. The thesis adopts an LLM-as-judge framework to evaluate models on this benchmark, where GPT-4o is used to compare model answers against gold responses using a rubric. We are conducting a human expert annotation to show that GPT-4o, as a judge, is reliable. Ultimately, this work contributes a complete pipeline—from dataset construction to evaluation—for developing and benchmarking domain-specific LLMs in materials science. It lays the groundwork for future efforts toward using LLMs in scientific research.

Automating Meta-Analysis for Clinical Researchers

Mon, 08 Jun 2026 00:00:00 GMT

Due to the ever-growing volume of clinical research, a significant challenge for researchers striving to perform timely and effective literature reviews has emerged. This thesis presents the development of a full-stack web platform designed to automate the meta-analysis process for clinical researchers. Leveraging OpenAI’s ChatGPT for natural language processing and semantic analysis, the platform enables users to extract key attributes from PubMed articles and view the results in a customizable tabular format. Iterative design and feedback from Emory University researchers guided the interface toward a user-friendly experience tailored to clinician needs. Technical achievements such as asynchronous data processing, prompt engineering for better output, and visualizations of queried data made this platform a success. Results indicated a substantial reduction in research effort and time, with the system capable of handling queries with up to 10,000 articles and decreasing research time by an estimated 65%. This work highlights the potential of integrating large language models into clinical research workflows to expedite medical discoveries.

Luna: A Game-Based Rating System for Artificial Intelligence

Tue, 26 Mar 2019 00:00:00 GMT

Research in Artificial Intelligence (AI) is driven by standardized tests and benchmarks. The level of success of a model on a popular benchmark can determine the amount of funding and attention from academia that the model receives. Despite this emphasis on testing, there are currently no widely accepted practical benchmarks for general AI. The Turing Test has long occupied this void in theory, but it has proven to be a poor practical guide for research, prompting a recent push in the research community to move "beyond the Turing Test". In this thesis, I put forth the Luna Rating System as a practical benchmark for AI. The system takes inspiration from chess ratings; humans and machines participate in two-player language-based games called Luna Games, and ``Smarts Ratings'' are assigned to both players based on the outcomes. The Smarts Rating of a machine player is indicative of its proximity to AI. After presenting the Luna Rating System and defining the Luna Game, I evaluate the robustness of the system to likely human player strategies. I then describe the three machine learning problems implicit in the Luna Game: Question Generation, Question Answering, and a third previously uncharacterized problem that I call Luna Rating Prediction. Finally, I introduce a web-based implementation of the Luna Rating System and recruit over 1200 human participants. The complete thesis amounts to a comprehensive introduction and evaluation of Luna as a practical test for AI.

Exploring human-centered AI assistance in interview moderation

Thu, 26 Mar 2026 00:00:00 GMT

LAUREA MAGISTRALE

Role of ChatGPT-4 as a Tool to Assist Human Researchers in Qualitative Coding

Tue, 24 Sep 2024 00:00:00 GMT

Background: Qualitative research is essential for understanding patient experiences and healthcare processes. This study explores the use of ChatGPT-4, an AI language model, to assist human researchers in coding interview transcripts and conducting thematic analysis, aiming to improve the efficiency and accuracy of qualitative research. Methods: The study used existing datasets from two qualitative studies on Overactive Bladder (OAB) and Recurrent Urinary Tract Infections (RUTI). ChatGPT and human researchers independently coded the same interview transcripts. ChatGPT coded each study twice. The identified codes and themes from both rounds were compared with the codes assigned by human researchers through Venn diagrams and qualitative analysis to assess similarity, correctness and comprehensiveness. Results: ChatGPT identified additional themes and codes not recognized by human researchers, suggesting its potential to uncover deeper insights. The refining of the prompts in the second round with integration of more of the prompt engineering principles along with manual checking of the generated codes, enhances the performance of ChatGPT compared to round one. This showcases the degrees in ChatGPT’s robustness and that higher levels of performance can be achieved by the user by acquiring a deeper understanding of how ChatGPT works. However, ChatGPT's stochastic nature led to variability in outputs and as ChatGPT in incapable of critical thinking human oversight is necessary. Conclusion: ChatGPT shows promise as a supplementary tool in qualitative research, capable of identifying significant themes and codes while reducing manual labour. However, human oversight remains crucial for interpreting AI-generated data and ensuring the applicability of findings to clinical practice.

Towards Globally Inclusive Multilingual Dialogue Systems for Real-World Applications

Sun, 28 Jun 2026 14:19:26 GMT

With the advent of large language models (LLMs), dialogue systems have become the primary interface for accessing advances in natural language processing (NLP); yet existing research remains largely English-centric, text-based, and benchmark-driven, limiting both global inclusivity and real-world applications. To move beyond this narrow focus, the thesis broadens the scope of multilingual dialogue system research through the creation of new datasets and evaluation methods. It introduces Multi3WOZ, a large-scale, multi-parallel dataset for task-oriented dialogue in Arabic, English, French, and Turkish. It also presents HEALTHDIAL, the first large-scale, speech-first dataset for health communication, spanning diverse language varieties across Arabic, Chinese, English, and Spanish. HEALTHDIAL modernises task-oriented dialogue system design by replacing traditional parsing-based approaches with a retrieval-augmented generation pipeline that more effectively leverages the capabilities of LLMs. In addition, the thesis proposes the first framework for the quantitative measurement of cross-lingual disparities, capturing both those arising during system development and those intrinsic to LLMs. The conventional dataset--model--benchmarking pipeline has driven much of the progress in dialogue system research, but it remains insufficient for informing real-world applications. To bridge this gap, the thesis extends the pipeline in both directions. Upstream, it applies systematic review methodology in combination with global health frameworks to identify user needs, and map the state of NLP for public health in Africa. Downstream, it develops and releases open-source toolkits for multilingual data collection, system development, deployment, and human evaluation, thereby lowering barriers to real-world applications. Beyond its technical contributions, this thesis offers a methodological reflection on how NLP, and dialogue system research in particular, can move beyond benchmarks to generate evidence with real-world relevance. It distils three guiding principles for equitable NLP: research should be evidence-based, grounding decisions in systematic evidence; human-centric, ensuring that development and evaluation reflect the needs and values of the communities served; and context-adaptive, responding to the resources and constraints of diverse linguistic and cultural contexts. Together, these principles outline a framework for developing dialogue systems that are both globally inclusive and socially impactful.

Large language models for biological prediction and design

Wed, 13 Mar 2024 00:00:00 GMT

Predicting the functional impact of changes to biological sequences is a central challenge in genetics and biology. Beyond genetics, sequence-to-function mapping has key applications in the design of sequences for use as molecular tools, catalysts, and biotherapeutics. Fueled by decades of exponential increases in sequencing, experimental data, and computing power, generative modeling has emerged as a leading approach for both mutation effect prediction and protein design. Approaches originating in the natural language processing field such as large language models have shown particular usefulness as sequence models. In this thesis, I build generative models of biological sequences and demonstrate their application to problems in protein design and human genetics. In Chapter 1, I discuss how deep autoregressive models can be applied to predict mutation effects and design sequences that are challenging for alignment-based models, including indels, disordered proteins, and the highly variable complementarity determining regions of antibodies. In Chapter 2, I demonstrate how the combination of protein family-agnostic large language models with family-specific sequence models results in state-of-art predictive performance at mutation effect prediction. In Chapter 3, I show how to apply generative models at a proteome and population scale to identify pathogenicity among rare human genetic variants. In Chapter 4, I explore how antibody libraries designed by generative models can be improved with respect to desired features such as diversity and specificity. These results show how sequence models can predict, design, and optimize the functionality of biomolecules.

ADVANCES IN CONVERSATIONAL AND TIME-SENSITIVE QUESTION ANSWERING

Tue, 31 Mar 2026 00:00:00 GMT

As the demand for intelligent dialogue and real-time information retrieval grows, the need for robust question answering (QA) systems is increasingly critical. While large language models have advanced QA capabilities, they continue to face significant challenges in maintaining coherent conversations and adapting to rapidly evolving real-world facts. This thesis investigates methods to overcome these limitations across two key dimensions: conversational QA and time-sensitive QA. For conversational systems, this research explores novel techniques, including cross-lingual transfer and a semi-supervised learning framework, to accurately detect and mitigate dialogue breakdowns. For time-sensitive QA, this work addresses the limitations of static pre-training by introducing a dynamic evaluation benchmark and a reinforcement learning method that effectively balance parametric knowledge with up-to-date retrieved context. Together, these efforts provide effective solutions to enhance the adaptability, reliability, and overall performance of QA systems in dynamic real-world environments.

A WORLD IMAGINED BY AI - GEOSPATIAL DATA QUALITY ENHANCEMENT WITH DEEP GENERATIVE MODEL

Fri, 31 Oct 2025 00:00:00 GMT

High-quality geospatial data is limited by the "Geospatial Data Trilemma"—a persistent trade-off between resolution, scale, and cost. This creates vast data deserts, hindering our ability to address global challenges. This thesis introduces Geospatial Data Translation (GDT), a novel unsupervised generative AI framework that resolves this trilemma. GDT learns to translate between different data types to generate new high-fidelity data without needing perfectly matched datasets. Through three studies, the framework generated accurate building footprints from street networks, enhanced coarse 30m global elevation models to a 2m resolution with 52% lower error than traditional methods, and created a U.S. flood map that identified 11 million people in previously unmapped flood zones. These results validate a pathway for creating high-resolution, broad-scale, and low-cost data. GDT democratizes access to critical environmental information, providing a more complete and equitable digital representation of our world for better global management.

Automated and Flexible Stress-Testing for the Robustification of Large Language Model Systems

Thu, 18 Sep 2025 00:00:00 GMT

Large language models are incredibly powerful but incredibly brittle and unreliable computing ob- jects. The same models that can convincingly generate rap lyrics in the style of Kanye West also struggle to consistently perform basic arithmetic accurately. The same models that can pass the interview bar at Amazon also struggle to ground their responses in facts. These models are a walking contradiction. As these systems become more embedded into our everyday lives, the key question becomes if we can properly trust and robustify these systems. The perspective that we adopt in this thesis is that, in or- der to prevent these systems from failing in high-stakes settings, we must preemptively discover all the ways in which they can fail. That is, we desire very powerful evaluation, red-teaming, and stress- testing technologies and tooling for language models. To this end, this thesis introduces the project of automated and flexible stress-testing and develops 1) REALM, a comprehensive robustness bench- mark and publicly hosted leaderboard with twenty-six hosted models in partnership with the biggest machine learning platform company in the world; 2) a suite of three novel and tailored red-teaming algorithms alongside three case studies of their successful application against in-the-wild LLM use cases; and 3) a multiagent stress-testing framework that universally jailbreaks five state of the art large language models that have been specifically finetuned to prevent jailbreaks. It is our hope that the technology developed here can enhance the conversation around safe, responsible, and secure AI de- velopment with regards to practical failure modes and correspondingly the methods to prevent them. In particular, with the technology developed here, all language model researchers, developers, users, businesses, and stakeholders will be able to discover failure cases before they arise in production. This brings us one step closer to the dream of robust language model systems.

Large Language Models for Automated Evaluation of Radiology Reports with Fine-Grained Scoring

Wed, 17 Sep 2025 00:00:00 GMT

The current gold standard for evaluating generated chest x-ray (CXR) reports is through radiologist annotations. However, this process can be extremely time-consuming, especially if there are large numbers of reports to evaluate. In this work, we present a Large Language Model (LLM)-based automated evaluation metric for generated CXR reports called FineRadScore. Given a candidate and a ground truth report, FineRadScore gives the minimum number of line by line corrections required to go from the candidate to the ground truth report. Additionally, FineRadScore assigns a severity rating for each correction and generates comments regarding why the correction was needed. We demonstrate that FineRadScore is able to generate the corrections in a way that aligns with radiologists and has an understanding of how clinically meaningful each error is. We also demonstrate that, when used to get a sense of the quality of the report as a whole, it aligns with radiologists at a similar level to current state of the art automated CXR evaluation metrics. Finally, we analyze FineRadScore's shortcomings to pave the way for future works.

Van AI Naar A+? Het Effect van ChatGPT op Leerprestaties in het Nederlands Hoger Onderwijs

Wed, 01 Jan 2025 00:00:00 GMT

Achtergrond: ChatGPT wordt door de meerderheid van de studenten in het Nederlands hoger onderwijs gebruikt, maar onderzoek naar de effecten van ChatGPT-gebruik op de academische prestaties van studenten in Nederland ontbreekt. Daarom werd in deze studie onderzocht in hoeverre ChatGPT-gebruik tijdens het studeren invloed heeft op de leerwinst van bachelorstudenten psychologie in Nederland. Methode: In een online tussen-proefpersonen experiment maakten eerstejaars bachelorstudenten psychologie van de Rijksuniversiteit Groningen (n = 153) een pre-test om hun voorkennis over Eye Movement Desensitisation Reprocessing (EMDR) vast te stellen. Daarna lazen zij allemaal dezelfde tekst over EMDR. Deelnemers werden willekeurig toegewezen aan de experimentele conditie (ChatGPT-gebruik tijdens het studeren toegestaan) of de controleconditie (ChatGPT-gebruik tijdens het studeren niet toegestaan). Vervolgens maakten alle deelnemers dezelfde post-test om hun kennis over de tekst te toetsen en gaven zij aan of en hoe zij ChatGPT hadden gebruikt. De uitkomstvariabele leerwinst werd geoperationaliseerd als het verschil tussen de schoolcijfers op beide tests. Resultaten: Er werd slechts een klein, niet statistisch significant verschil gevonden in leerwinst tussen beide condities. Ook was het verschil in leerwinst tussen deelnemers die ChatGPT gebruikten voor samenvatten en deelnemers die ChatGPT gebruikten voor uitleg geven klein en niet statistisch significant. Conclusie: Deze bevindingen suggereren dat ChatGPT-gebruik, ongeacht de specifieke gebruiksvorm, geen of weinig effect heeft op de leerprestaties van bachelorstudenten psychologie in Nederland. Vanwege de beperkingen van deze studie en de schaarste van onderzoek in de Nederlandse context is vervolgonderzoek naar dit verband noodzakelijk.

BRIDGING GRAPH-STRUCTURED DATA AND LARGE LANGUAGE MODELS

Tue, 31 Mar 2026 00:00:00 GMT

This thesis proposes a unified framework that integrates graph-structured data with Large Language Models (LLMs), combining the structural learning strengths of Graph Neural Networks (GNNs) with the semantic and reasoning capabilities of LLMs. We introduce a three-stage pipeline for bidirectional graph–language integration. First, TAPE enriches graph nodes and edges with LLM-generated textual semantics to provide semantic priors for graph learning. Second, GraphViT decomposes graphs into informative substructures and represents them as sequences compatible with Transformer architectures. Third, G-Retriever retrieves task-relevant subgraphs based on natural language queries, encodes them with GNNs, and injects the resulting representations into LLMs as soft prompts for structure-aware generation. Together, this framework enables scalable, task-agnostic integration of graph and language modalities, improving both graph-based learning and LLM reasoning over structured data.

SYNthia: An Interface Concept for Writing With Large Language Models

Thu, 18 Sep 2025 00:00:00 GMT

Artificial intelligence (AI)-infused systems can offer valuable assistance to writers, but they may also produce imperfect or unsatisfactory suggestions that require efficient correction. Word choice presents a challenge for writers that can be addressed by several tools, but these systems typically require users to switch browser tabs or tools and break their flow of thinking, or otherwise fail to incorporate the context associated with users' writing or their intentions, leaving them with subpar or unrelated suggestions. We present SYNthia, a word-suggestion interface that allows users to be directly involved in the suggestion generation process by providing natural language feedback. We performed two pilot qualitative studies, finding that SYNthia provided users with a more practical interface that (1) allowed them to receive their target word more efficiently, (2) eliminated the need to for users switch contexts (e.g. switching tabs or devices), and (3) improved users' perceived quality of writing. In addition, we performed a formal user study comparing how novice and expert writers interact with SYNthia different, ultimately concluding that the writing level had no quantitatively significant impact on interactions with the tool, raising more questions for further study. However, the qualitative study surfaced several interesting observations regarding how writers interact with an AI-powered thesaurus, making progress towards the greater goal of integrating AI in the writing process while maintaining human agency and ownership. All code for this project can be found at the Github repository: https://github.com/AEst2002/word-suggester/tree/thesis.

Exploring the potential for generative AI to facilitate medical students’ use of evidence-based learning strategies

Fri, 08 May 2026 00:00:00 GMT

Objective Effective learning strategies, such as retrieval practice, spacing and interleaved practice, support meaningful knowledge construction and long-term retention. However, students often do not use these strategies because of misconceptions, perceived time and effort. This study aimed to explore whether and how an AI review module for a pathology course, integrated in a learning management system, influences third-year preclinical students’ learning strategies at Tsinghua University. Design This study used a sequential explanatory mixed-methods design. For the quantitative strand, pre-course and post-course questionnaires assessed students’ self-reported learning strategies and voluntary use of the AI review module. Based on the usage log data, students were categorized into high- and low- usage groups for quantitative and qualitative analysis. Non-parametric statistical analyses were conducted when appropriate. For the qualitative strand, interviewees were selected using maximum variation sampling based on their AI review module usage. Semi-structured one-on-one interviews were conducted and qualitative descriptive approach were performed. Results Fifty students completed the pre-course questionnaire and took the final exam. The most frequently used learning strategies were non-evidence-based approaches. Thirty-seven (74%) students used the AI review module and collectively answered 1804 questions. Two-sample Wilcoxon rank-sum tests revealed no statistically significant associations between AI module usage and changes in learning strategies or pathology final exam scores. In the interviews, a few students recognized and intentionally applied the evidence-based learning strategies embedded in the AI module, whereas some other students still used this AI tool for rote memorization and cramming before exams, which was inconsistent with evidence-based learning strategies. Conclusion The AI-based review module, although designed according to evidence-based learning strategies, did not increase the use of evidence-based strategies or improve exam scores at the class level. Our findings suggest that, as AI tools become more widespread in medical education, explicit attention to learning sciences and learning strategies is essential, because students are likely to use AI according to their pre-existing study habits, and AI may magnify the effects, either beneficial or detrimental, of those habits.

Debugging and Help-seeking with Chatbots in CS1

Wed, 14 Jan 2026 00:00:00 GMT

For many beginner programmers, encountering errors in code can be frustrating and disheartening—leading some to questions their belonging in computer science (CS). In these moments, timely debugging help is essential to sustain motivation and foster learning. While students have traditionally turned to peers or teaching assistants for guidance, many now seek debugging support from conversational Large Language Models (LLMs). These chatbots offer promise in providing immediate help, but their ability to generate full-code solutions raises concerns about learning and over-reliance. As these tools become more prevalent, it is important to understand how they can be used to support student's in their debugging and how students seek-help with chatbots. This dissertation explores how students interact with chatbots in introductory computer science courses (CS1) and opportunities to support debugging. The research is presented in a three-paper format. The first paper examines past debugging interventions before the rise of LLMs, identifying gaps that these tools could potentially address. The second paper presents findings from student interviews about their experiences using a course-integrated chatbot, highlighting how they engage with the debugging assistance throughout the semester and their evolving beliefs about appropriate chatbot use. The third study analyzes naturalistic chat data and survey responses in another CS1 course to investigate how students' goal-orientation and beliefs associate with their help-seeking behaviors. The findings from this dissertation offer insights into designing course chatbots and instructional framing around chatbot use to support students' debugging and learning.

Prompt engineering's influence on result reliability in large language models

Tue, 09 Apr 2024 00:00:00 GMT

LAUREA MAGISTRALE

Can Large Language Models Make Reading a Book More Engaging?

Thu, 18 Sep 2025 00:00:00 GMT

American literacy is struggling, with only 31% of eighth graders reading at or above grade level. This problem is largely one of engagement: students struggle with reading primarily because they don't read enough to develop proficiency. Poor reading skills lead to avoidance of reading, creating a negative cycle which further diminishes literacy development. This thesis contributes two novel LLM-based interventions designed to increase student engagement with assigned texts: LLM-Clarifications, which provide just-in-time support when students encounter obstacles while reading, facilitated by a sentence-by-sentence reading mechanism that tracks students' progress and discourages skimming, and LLM-Debates, which allow students to argue with chatbots about characters or themes after reading. Testing with 63 high school students showed that students equipped with these interventions spent 70% more time engaged with their assigned book compared to the control group and thoroughly read 43% more chapters. However, comprehension quiz scores increased only by 2.6%. I found that this discrepancy occurs because LLM interventions excelled at sustaining engagement for students who had already begun reading, but not at initiating engagement for students predisposed to skip reading entirely. In both groups, students skipped approximately half of the assigned chapters, with only one student out of 63 reading all chapters without skimming. These findings demonstrate that the novel LLM interventions successfully solve half of the reading engagement problem: sustaining and deepening engagement once students begin reading. The 70% increase in engagement time and 43% increase in thoroughly-read chapters represent a promising approach to addressing the literacy crisis by keeping students engaged with texts. Furthermore, contrary to some educators' concerns that AI primarily enables shortcuts in education, these results suggest that thoughtfully designed LLM interventions can actually deepen student engagement with learning materials rather than diminish it—providing a foundation for reimagining LLMs as tools that maximize, rather than minimize, students' learning and growth.