Venue: Workroom 2, Diamond Building. Note: all poster presentations will be delivered in-person.
A number of submissions are also in review at international conferences. As a result, their titles and abstracts have been redacted to ensure they conform with the requirements for double blind review. The titles and abstracts will be published shortly before the SLT CDT Annual Conference.
Authors: Favour Yahdii Aghaebe (University of Sheffield), Tanefa Apekey (University of Sheffield), Elizabeth Williams (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)
Abstract: Opinion and multi-document summarisation often involve genuinely conflicting viewpoints, yet many existing approaches, particularly LLM-based systems, implicitly smooth disagreement and over-represent majority opinions. This limits the faithfulness of generated summaries in opinion-heavy settings. We introduce a disagreement-aware synthesis pipeline that separates belief-level aggregation from language generation. Documents are first represented as structured belief sets and aggregated using distance-based belief merging operators that explicitly model conflict. Large language models are then used only to realise the aggregated beliefs as natural language summaries. We evaluate the approach across multiple model families and scales, comparing it to methods that perform explicit aggregation during generation. Our results show that while sufficiently large models can match belief-level aggregation when aggregation is handled at generation time, this behaviour is not stable across architectures or capacities. In contrast, belief-level aggregation combined with simple prompting yields consistently strong disagreement-aware performance across models, while maintaining fluent and grounded summaries.
Authors: Christopher Bartley (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Redacted
Authors: Jason Chan (University of Sheffield), Zhixue Zhao (University of Sheffield), Robert Gaizauskas (University of Sheffield)
Abstract: Redacted
Authors: Joseph James (University of Sheffield), Chenghao Xiao (Durham University), Yucheng Li (University of Surrey), Nafise Sadat Moosavi (University of Sheffield), Chenghua Lin (University of Manchester)
Abstract: Redacted
Authors: Hezhao Zhang (School of Computer Science, University of Sheffield, United Kingdom), Huang-Cheng Chou (Signal Analysis and Interpretation Laboratory (SAIL), Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA 90089, USA), Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA 90089, USA), Thomas Hain (Department of Computer Science, University of Sheffield, United Kingdom)
Abstract: Redacted
Authors: Jo-Ku Cheng (University of Sheffield), Nikos Aletras (University of Sheffield), Marco Valentino (University of Sheffield)
Abstract: Redacted
Authors: Boxuan Shan (University of Sheffield), Adrián Barahona-Ríos (Sony Interactive Entertainment), Anton Ragni (University of Sheffield)
Abstract: Expressive speech can be influenced by various paralinguistic aspects, including sentiment, emotion, speaker identity, and style. Those aspects are widely adopted by controllable TTS systems as control signals, but their effectiveness has not been well understood. We therefore present an analysis of how these aspects affect speech, focusing on prosody as a fundamental component of expressive speech. Statistical tests show that emotion, style, and speaker identity produce clear prosodic differences, whereas sentiment yields significant but weaker effects, revealing a challenge for weak control aspects. A contrastive learning method has been introduced to encourage the model to better respond to paralinguistic controls. Finally, we present a distributional visualisation to give more insight into the effectiveness of contrastive learning. Our results highlight the difficulty of modelling and controlling weak paralinguistic aspects and provide insights for future controllable TTS research.
Authors: Ian Kennedy (University of Sheffield), Nafise S. Moosavi (University of Sheffield)
Abstract: Redacted
Authors: Wing-Zin Leung (University of Sheffield), Heidi Christensen (University of Sheffield), Stefan Goetze (University of Sheffield)
Abstract: Dysarthria is a type of motor speech disorder that reflects abnormalities in motor movements required for speech production. In clinical practice, identifying characteristic signs and symptoms of the neuropathophysiology underlying a dysarthria is vital for diagnosis and management. The gold standard for dysarthria assessment is auditory-perceptual evaluation by a speech and language therapist for differential diagnosis and management decisions. As the process is time-consuming for clinicians, there is growing interest in automatic dysarthria assessment (ADA). Recent approaches to ADA primarily focus on the classification of broad intelligibility or speech severity labels. However, this does not have much clinical utility and the assessment of communication-relevant parameters do not distinguish between dysarthria types and pathomechanisms. Studies on the classification of dysarthria function or clinical test protocol scores focusing on aspects of dysarthric speech production (such as the Frenchay dysarthria assessment (FDA)) are limited. Therefore, this paper focuses on the preliminary steps towards clinically interpretable ADA, including automatic FDA assessment. The phoneme posteriorgram (PPG) is a time-varying categorical distribution over acoustic speech units, and recent work demonstrates interpretable speech pronunciation distance for downstream tasks, e.g. pronunciation reconstruction. This work extends recent advances in posterior-based phoneme research and mispronunciation models to dysarthria assessment, exploring the extent to which dysarthric speech features in the FDA (identified by auditory-perceptual evaluation in clinical practice) are captured by PPG information. To achieve this, FDA aspects are systematically evaluated. The results show that interpretable PPG probability can capture dysarthric speech features that are related to motor system dysfunction.
Authors: Jasivan Sivakumar (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)
Abstract: This paper presents a rigorous mechanistic analysis of the failure modes in Large Language Model (LLM) numerical reasoning, focusing on the persistent "distraction" caused by prompt-induced biases. Despite attempting multiple mitigation strategies - including weighted rank predictions and bias-reduced decoding - we find that numerical anchors and verbal cues in the prompt exert a "gravitational pull" that is remarkably difficult to override during the decoding stage. We evaluate these dynamics by comparing masked versus unmasked inputs and analysing how prediction confidence often correlates more strongly with superficial prompt patterns than with mathematical accuracy. Our layer-wise diagnostic reveals why these interventions struggle: while MLP and Transformer probes confirm that correct mathematical information exists within the hidden states, this knowledge is frequently "buried" by a dominant rank bias that persists across the majority of the architecture. We detail our findings on rank analysis and early-exit performance, illustrating that the model's internal commitment to a biased answer often occurs early and stabilises regardless of contextual counter-evidence. By documenting these unsuccessful attempts to redirect the model's logic, we provide a detailed map of the structural barriers to fair numerical reasoning and offer critical insights into why standard decoding interventions remain insufficient.
Authors: Robert Flynn (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Automatic speech recognition (ASR) models are normally trained to operate over single utterances, with a short duration of less than 30 seconds. This choice has been made in part due to computational constraints, but also reflects a common, but often inaccurate, modeling assumption that treats utterances as independent and identically distributed samples. When longformat audio recordings are available, to work with such systems, these recordings must first be segmented into short utterances and processed independently. In this work, we show that due to recent algorithmic and hardware advances, this is no longer necessary, and current attention-based approaches can be used to train ASR systems that operate on sequences of over an hour in length. Therefore, to gain a better understanding of the relationship between the training/evaluation sequence length and performance, we train ASR models on large-scale data using 10 different sequence lengths from 10 seconds up to 1 h. Through modifying various architectural components, we find that the method of encoding positional information and the model's width/depth are important factors when working with long sequences. Finally, a series of evaluations using synthetic data are constructed to help analyse the model's use of context.
Authors: Paul Gering (University of Sheffield), Roger K Moore (University of Sheffield)
Abstract: Redacted
Authors: Constantinos Karouzos (University of Sheffield), Xingwei Tan (University of Sheffield), Nikolaos Aletras (University of Sheffield)
Abstract: Redacted
Authors: Danae Sanchez Villegas (University of Copenhagen), Samuel Lewis-Lim (University of Sheffield), Nikolaos Aletras (University of Sheffield), Desmond Elliott (University of Copenhagen)
Abstract: Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two model families. We track confidence over Chain-of-Thought (CoT), measure reasoning's corrective effect, and evaluate intermediate reasoning steps. We find that models are prone to answer inertia, where early predictions are reinforced rather than revised. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient. Although this influence can appear in the CoT, its detectability varies across models and depends on what is monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer CoTs can still appear visually grounded while following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. These findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.
Authors: Jack Cox (University of Sheffield), Jon Barker (University of Sheffield)
Abstract: Speech foundation models, pre-trained on large corpora of unlabelled speech data, produce general-purpose representations which are useful across tasks. However, these representations encode information about salient speech variables in a distributed manner, while downstream speech tasks rely on only some of this variability. In this work, we propose a post-training refinement approach using interventional contrastive learning. By leveraging an interventional dataset and multi-part contrastive loss, we learn a transformation from the entangled representation space of speech foundation models into separate content and speaker subspaces. We evaluate the learnt representations on speaker verification and keyword spotting tasks, showing improved out-of-domain speaker verification performance and evidence that speaker and content information are separated across the learned subspaces.
Authors: Yanyi Pu (University of Sheffield), Damian Gonzalez-Salzberg (University of Birmingham), Yuan Zheng (University of Sheffield), Nikos Aletras (University of Sheffield)
Abstract: Redacted
Authors: Minghui Zhao (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Redacted
Authors: Anthony Hughes (University of Sheffield), Alex Goldberg (Carnegie Mellon University), Prince Jha (MBZUAI) Nikos Aletras (University of Sheffield), Niloofar Mireshghallah (Carnegie Mellon University).
Abstract: Redacted
Authors: Fritz Peters (University of Sheffield), Madhurananda Pahar (University of Sheffield), Dorota Braun (University of Sheffield), Caitlin Illingworth (University of Sheffield), Daniel Blackburn (University of Sheffield), Heidi Christensen (University of Sheffield)
Abstract: Redacted
Authors: Michael Whealing (SLT CDT Affiliate), Thomas Hain (Speech and Hearing Research Group), Rob Gaizauskas (Natural Language Processing Research Group)
Abstract: Redacted
Authors: Valeria Pastorino (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)
Abstract: Redacted
Authors: Maggie Mi (University of Sheffield), Golzar Atefi (Berliner Hochschule für Technik), Atsuki Yamaguchi (University of Sheffield), Felix Gers (Berliner Hochschule für Technik), Aline Villavicencio (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)
Abstract: Redacted
Authors: Gerardo Roa-Dabike (University of Sheffield), Jon P. Barker (University of Sheffield), Michael A. Akeroyd (University of Nottingham), Scott Bannister (University of Leed | University of Manchester), Trevor J. Cox (University of Salford), Bruno Fazenda (University of Salford), Jennifer Firth (University of Nottingham), Simone Graetzer (University of Salford), Alinka Greasley (University of Leed), Rebecca R. Vos (University of Salford) and William M. Whitmer (University of Nottingham)
Abstract: Understanding the lyrics in music is key for music enjoyment. People with hearing loss can have difficulties clearly and effortlessly hearing lyrics, however. In speech technology, having metrics to automatically evaluate intelligibility has driven improvements in speech enhancement. We wanted to do the same for music with lyrics. To address this gap we presented the lyric intelligibility challenge. A new dataset, CLIP1, was introduced, comprising audio samples of popular western music paired with listener intelligibility scores. To model diverse listening profiles, samples were processed with no, mild and moderate simulated hearing loss. A total of 27 systems were submitted by 22 teams. After success of CLIP1, we are announcing the launch of CLIP2, the second lyric intelligibility challenge.
Authors: Robert Sutherland (University of Sheffield), Stefan Goetze (University of Sheffield), Jon Barker (University of Sheffield)
Abstract: Redacted
Authors: Meredith Gibbons (University of Sheffield), Dr Xingyi Song (University of Sheffield)
Abstract: Many classification tasks, such as emotion recognition, sentiment analysis, and toxic speech classification, are highly subjective. For example, the same social media post may be perceived as 'toxic' by one user and 'not toxic' by another. This lack of a definitive 'ground truth' label complicates downstream tasks, as binary labels do not provide enough information. We evaluated three LLM families (Llama, Gemma and Ministral) across three subjective datasets, measuring their ability to predict both soft labels and annotator disagreement. Our findings highlight a gap in LLM capabilities, where the models were moderately accurate when predicting the soft label, but exhibited poor performance when predicting annotator disagreement.
Authors: Mattias Cross (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Redacted
Authors: Yao Xiao (University of Sheffield), Fritz Peters (University of Sheffield), Madhurananda Pahar (University of Sheffield), Dorota A Braun (University of Sheffield), Caitlin H Illingworth (University of Sheffield), Stefan Goetze (University of Sheffield), Daniel Blackburn (University of Sheffield), Heidi Christensen (University of Sheffield)
Abstract: Redacted
Authors: Xinying Wei (University of Sheffield), Eleni Vasilaki (University of Sheffield), Thomas Hain (University of Sheffield)
Abstract: Redacted