Making ChatGPT & Co usable for security-relevant fields of application

On June 4, 2025, the Agentur für Innovation in der Cybersicherheit GmbH (Cyberagentur) launched the call for HEGEMON – a new research program for the development of holistic benchmarks and adapted AI models for safety-critical applications. In a competition-based process, generative foundation models are to be adapted and evaluated as prototypes for complex tasks in the geoinformation sector for the first time – in a direct performance comparison “everyone against everyone”.
Who hasn’t experienced this? You want to generate an image or have the essential content of a long text summarized. The first impulse leads to a generative AI service based on a foundation model (often a large multimodal language model such as GPT-4o), which is supposed to generate the desired image or summary by means of text input. Unfortunately, the results of the first attempts can often be unsatisfactory or even incorrect.
The wide range of possible applications of foundation models and the associated potential acceleration of processes are also of interest to German security authorities. However, if we project the application example of text summarization onto the security context – e.g. a soldier using such a tool to summarize a long command – the potential source of error takes on a much broader meaning. To date, no comprehensive tests (so-called benchmarks) exist specifically for the security context in order to evaluate the possible applications of pre-trained foundation models.
Against this backdrop, the Cyberagentur’s HEGEMON research program (“Holistic Evaluation of GEnerative Foundation Models in the Security Context”) was launched on 4 June 2025. Universities, colleges, research institutions, companies and start-ups are invited to participate. The program will be carried out as a multi-year research competition with several phases and high disruptive potential.
“With HEGEMON, we are creating a unique field of experimentation to systematically enable the comprehensive evaluation of generative AI models in the security sector – beyond previous benchmark routines,” says Dr. Daniel Gille, project manager of the program and Head of Artificial Intelligence at the Cyberagentur. “We invite all research-intensive players to take part in this challenge.”
The research program addresses a central gap in current AI development: Foundation models such as ChatGPT, Midjourney or Claude are developed internationally – mostly by private companies, mostly in the USA and China – are often opaque in their training data and model architectures and increasingly dominate safety-critical areas. In the German and European context in particular, there is a lack of opportunities to evaluate these models comprehensively, comparatively and comprehensibly – especially in relation to complex, multimodal tasks and applications with high demands on safety and facticity.
The aim of HEGEMON is to develop domain-specific, holistic benchmark sets (consisting of tasks, metrics and test data sets) as well as adapted AI models for defined use cases. This combination of benchmark and model development is tested in a competitive scenario in which all participants evaluate each other’s solutions – an “everyone against everyone” structure with integrated red/blue teaming for robustness testing.
“We are rethinking the evaluation of large models,” continues project manager Dr. Daniel Gille. “The benchmarks we are aiming for should not only measure what is technically possible, but also what is safe, explainable and relevant to the application.”
The focus is on three use cases from the geoinformation sector:
- the creation of comprehensible text summaries on country-specific topics,
- the conversion of remote sensing data into vector data,
- and a map chatbot with intelligent text output based on maps (e.g. “Are there medical facilities on this map? Please share the coordinates if they exist.”).
To implement this, the Cyberagentur relies on the tried and tested procedure of pre-commercial procurement (PCP). It allows for risky but fully funded research outside of regular procurement law. PCP is ideal for disruptive projects where no mature market products yet exist – i.e. exactly for the research subject of HEGEMON. Participants retain their exploitation rights, which further strengthens their innovative capacity.
The call for proposals is the first step in a multi-phase research process that will extend over a total of three years. Following an evaluation phase, three interaction points will be defined at which benchmarks and models will be systematically compared, improved and re-evaluated.
The invitation to tender was published in the Supplement to the Official Journal of the European Union with the contract notice number TED 358520-2025(https://ted.europa.eu/de/notice/-/detail/358520-2025). The deadline for submission of the short concept is 31.07.2025, 10:00 am. Participation is possible both alone and in a consortium.
Further information and registration: