Dataset Challenge & Creations Projects

This funding, awarded in Spring 2026 through the hub’s Dataset Creation & Challenge Projects programme, supports collaborative teams to develop datasets that could help make AI systems safer, more reliable and better aligned with the real world.

  • An AI tool to evaluate the reliability of collision avoidance systems in space

  • Description:

    With more than 40,000 tracked orbital objects, AI systems that reliably interpret surveillance imagery and reason about orbital risks are urgently needed. Currently, no specialised benchmark exists to evaluate generative AI performance on space situational awareness tasks. The “SSA-Language Model Benchmark (SSA-LaMB)” addresses this critical gap by creating the first comprehensive multimodal benchmark for evaluating large language models on space surveillance tasks. The £490bn global space economy infrastructure underpins 18% of UK GDP.

    All components will be released openly through Hugging Face Datasets, Figshare, and GitHub, democratising SSA research for institutions without access to classified data and enabling trustworthy generative AI for safety-critical space operations.

    • Lead researcher: Professor Wai Lok Woo, Head of Data Science and Artificial Intelligence, Northumbria University

    • Co-investigators: ​​Zeyneb Kurt​, University of Sheffield; Yulei Li​, ​Northumbria University​;

    • Collaborators: University of Sheffield, ​3S Northumbria Ltd (UK)​, ​ExoAnalytic Solutions Inc (USA)​

  • A high-quality multimodal dance dataset in collaboration with Studio Wayne McGregor, capturing motion, bio-signals and choreographic data to investigate creative processes and artistic intent using Gen AI techniques.

  • Description:

    This project captures the highest quality dataset of human creative performance ever assembled, recording a wide range of synchronised modalities, including optical motion, volumetric video, ground-reaction forces, audio, choreographic notation, gaze, and biosignals such as heart rate and skin conductance.

    Existing dance datasets are limited in motion quantity, variety, capture frequency, and modalities, leaving fundamental questions about the creative process unanswered. By capturing this breadth of signals under rigorous protocols, the dataset will enable techniques such as causal inference to investigate "artistic intent" - unlocking new avenues for GenAI research into human creativity.

    The dataset will be created at the state-of-the-art CAMERA facility in collaboration with Studio Wayne McGregor, led by the nation's most decorated choreographer, Sir Wayne McGregor. Bespoke dance pieces designed specifically for capture, combined with biosignals and annotation data, will allow researchers to disentangle creative intent from artistic style.

    • Lead researcher: Professor Neill Campbell, Professor of Visual Computing and Machine Learning, University College London​

    • Co-investigators:  Professor ​Lourdes​ Agapito, University College London;  ​Dr Murray Evans​, George Fletcher​, University of Bath; Professor ​Yukun Lai​, Cardiff University​;  ​Professor Sir Wayne McGregor​, Studio Wayne McGregor 

    • Collaborators: Cardiff University, University of Bath, ​Studio Wayne McGregor​

  • Creation of a high-fidelity dataset of how sound truly behaves in a 3D environment, enabling a future where our devices understand the physical layout of our world as intuitively as we do.

  • Description:

    This project will create a big, high-fidelity dataset recorded in real household settings to teach AI how sound truly behaves in 3D environments. For current AI systems, real-world sounds, such as the clink of a dropped glass, are mere “noise” because they are trained on simplified or computer-generated audio that does not capture the rich acoustics and intricate complexity of an actual home environment.

    By making this data open source, the project enables a future where our devices understand the physical layout of our world as intuitively as we do.

    • Lead researcher: Dr Iran R. Roman, lecturer at Queen Mary University of London

    • Collaborators: Meta Platforms Ltd, Sony AI

  • Multi-DocVerify benchmarks AI's ability to learn from complex real-world documents, including charts, tables, and long texts, prioritising new evidence over training data and will be tested via professional fact-checking scenarios..

  • Description:

    This project captures the highest quality dataset of human creative performance ever assembled, recording a wide range of synchronised modalities, including optical motion, volumetric video, ground-reaction forces, audio, choreographic notation, gaze, and biosignals such as heart rate and skin conductance.

    Existing dance datasets are limited in motion quantity, variety, capture frequency, and modalities, leaving fundamental questions about the creative process unanswered. By capturing this breadth of signals under rigorous protocols, the dataset will enable techniques such as causal inference to investigate "artistic intent" - unlocking new avenues for GenAI research into human creativity.

    The dataset will be created at the state-of-the-art CAMERA facility in collaboration with Studio Wayne McGregor, led by the nation's most decorated choreographer, Sir Wayne McGregor. Bespoke dance pieces designed specifically for capture, combined with biosignals and annotation data, will allow researchers to disentangle creative intent from artistic style.

    • Lead researcher: ​Dr Xingyi Song, Lecturer in Computational Media Analysis, University of Sheffield

    • Co-investigator: Co-investigator: Dr ​Carolina Scarton​, University of Sheffield

    • Collaborators: Full Fact, Advanced Manufacturing Research Centre (AMRC)

Projects

From improving how AI understands sound and human movement to reducing the risk of satellite collisions and helping systems update their knowledge more reliably, these projects address some of the most pressing challenges facing AI today.