Annual SLT CDT conference is on Monday 23 June 2025
Venue: Lecture Theatre 1, Diamond Building
Note: all research talks will be delivered in-person.
Authors: Thomas Pickard (University of Sheffield), Aline Villavicencio (University of Exeter), Maggie Mi (University of Sheffield), Wei He (University of Exeter), Dylan Phelps (University of Sheffield), Marco Idiart (Federal University of Rio Grande do Sul)
Abstract: AdMIRe: Advancing Multimodal Idiomaticity Representation is a shared task at SemEval-2025, combining text and images to evaluate language models' processing of idiomatic language. This talk will present the task itself and the results obtained from both participating systems and human annotators. We will also discuss the practicalities of organising and running a challenge like this and our 'lessons learned' for anyone who might consider organising one in the future.
Authors: Xiaozhou Tan (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: I will give an introduction to the application of diffusion models in Speech synthesis, and what can be done to explore the diffusion like models(models that iteratively refine the result) in Speech synthesis.
Authors: Jason Chan (The University of Sheffield), Robert Gaizauskas (The University of Sheffield), Zhixue Zhao (The University of Sheffield)
Abstract: Formal logic enables computers to reason in natural language by representing sentences in symbolic forms and applying rules to derive conclusions. However, in what our study characterises as "rulebreaker" scenarios, this method can lead to conclusions that are typically not inferred or accepted by humans given their common sense and factual knowledge. Inspired by works in cognitive science, we create RULEBREAKERS, the first dataset for rigorously evaluating the ability of large language models (LLMs) to recognise and respond to rulebreakers (versus non-rulebreakers) in a human-like manner. Evaluating seven LLMs, we find that most models, including GPT-4o, achieve mediocre accuracy on RULEBREAKERS and exhibit some tendency to over-rigidly apply logical rules unlike what is expected from typical human reasoners. Further analysis suggests that this apparent failure is potentially associated with the models' poor utilisation of their world knowledge and their attention distribution patterns. Whilst revealing a limitation of current LLMs, our study also provides a timely counterbalance to a growing body of recent works that propose methods relying on formal logic to improve LLMs' general reasoning capabilities, highlighting their risk of further increasing divergence between LLMs and human-like reasoning.
Authors: Jinzuomu Zhong (University of Edinburgh), Korin Richmond (University of Edinburgh), Suyuan Liu (University of British Columbia), Dan Wells (University of Edinburgh), Zhiba Su (Independent Researcher), Siqi Sun (University of Edinburgh)
Abstract: While recent Zero-Shot Text-to-Speech (ZS-TTS) models achieve high naturalness and speaker similarity, they fall short in accent fidelity and control - generating hallucinated accents that diverge from the input speech prompt. To address this, we introduce zero-shot accent generation, a new task aimed at synthesising speech in any target content, speaker, and accent. We present AccentBox, the first system capable of this task via a two-stage pipeline. In the first stage, we propose GenAID, a novel Accent Identification model that learns speaker-agnostic accent embeddings, achieving 0.16 F1-score improvement on unseen speakers. In the second stage, a ZS-TTS model is conditioned on these embeddings, achieving 57.4-70.0% listener preference for accent fidelity, compared to strong baselines. We also advance evaluation methodologies for accent generation. Subjectively, we improve listener guidance with transcriptions and accent difference highlighting, with rigorous listener screening. Objectively, we propose pronunciation-sensitive metrics using vowel formant and phonetic posteriorgram distances. providing more reliable evaluation for underrepresented accents. Looking forward, we aim to expand AccentBox's capabilities to more accents via pseudo-labelling of in-the-wild data, and improve accent fidelity via formant-guided generation - moving toward fairer and more inclusive speech synthesis for all accents.
Authors: Hend ElGhazaly (University of Sheffield), Bahman Mirheidari (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield), Heidi Christensen (University of Sheffield)
Abstract: Ensuring fairness in Automatic Speech Recognition (ASR) models requires not only reducing biases but also making sure that fairness improvements generalize beyond the training domain. This challenge is particularly relevant for pre-trained models, which have already been trained on large-scale data and may overfit quickly during fine-tuning. In this work, we investigate contrastive learning as a fairness intervention, introducing a contrastive loss term alongside the standard cross-entropy loss to promote gender-invariant speech representations. Our results show that fairness-aware fine-tuning is highly dependent on training data diversity, with contrastive learning proving effective only when applied to diverse and representative datasets. Simply increasing training data without explicitly enforcing fairness does not ensure bias mitigation. Our findings highlight the need for fairness-aware dataset selection and evaluation beyond in-domain settings to build robust and equitable ASR systems.
Authors: Shaun Cassini (University of Sheffield), Thomas Hain (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Emphasis plays a key role in spoken communication, conveying intent, emotion, and information structure. It is also a useful attribute for a range of speech technology tasks, such as intent prediction, emotion recognition, and punctuation recovery. Self-supervised speech models (S3Ms) learn general-purpose representations of speech, enabling broad transfer to downstream tasks. However, it remains unclear to what extent S3Ms encode emphasis. Existing studies typically detect only acoustic correlates of emphasis, or fine-tune a single model on an emphasis classification task. In this work, we address three open questions: 1) How is emphasis represented across speech foundation models? 2) How can its presence be quantified? 3) Is emphasis information removed, preserved, or enhanced through downstream fine-tuning? We propose a novel, non-parameteric, unitless distance measure for quantifying emphasis encoding, and apply it to a diverse set of S3Ms. Our findings show that emphasis is clearly reflected in model representations, and becomes more accessible after fine-tuning on downstream tasks such as automatic speech recognition.