Key technologies

Holistic Evaluation of Generative Foundation Models in a Security Context (HEGEMON)

  1. Status: Project phase

Background

Generative AI applications such as ChatGPT or Midjourney are currently attracting a great deal of attention. These models can be used in a wide range of application areas without any prior technical knowledge due to their ability to generate complex and multimodal outputs (e.g. text, image, audio, video) based on free inputs (prompts). The increasing adaptation of generative AI models in the domains of internal and external security is foreseeable considering their great application potential. The foundation models powering generative AI applications are mostly trained by private-sector companies, mostly in the USA and China, with high effort. Their underlying data sets, training mechanisms and model architectures are usually not (or no longer) published. In terms of cyber security, the high application potential of foundation models is thus offset by a currently high level of technological dependency and risks in terms of cyber security and application security.

Evaluations and comparisons in the form of benchmarks are useful for improving the assessment of the properties of externally trained models. However, due to the high versatility and unstructured outputs of these models, they represent a complex problem that takes on additional urgency in the security context. In view of the recent strong growth in the capabilities of large AI models, holistic benchmarking in particular remains an open and increasingly relevant research question.

Aim

As part of a competition, domain-specific, holistic benchmark sets (consisting of tasks, metrics and suitable test data sets) as well as adapted AI models for defined use cases are to be developed, which enable a holistic evaluation of pre-trained generative AI base models (e.g. text-image models) for a given use case. In addition, foundation models are to be adapted to this use case (fine-tuning or in-context learning), evaluated with the help of the various benchmarks developed and implemented in the form of an application demonstrator. In addition, conceptual insights are to be gained into the fundamental problem of evaluating universally applicable AI systems in particular.

The focus is on three use cases from the geoinformation sector:

  • the creation of comprehensible text summaries on country-specific topics,
  • the conversion of remote sensing data into vector data,
  • and a map chatbot with intelligent text output based on maps (e.g. “Are there medical facilities on this map? Please share the coordinates if they exist.”).

Disruptive Risk Research

The development of benchmarks and adaptation of foundation models as well as their realization as demonstrators take place in a unique competitive constellation in which each participant is in direct comparison with all other participants in terms of both benchmark and model development. Each model is evaluated and ranked against all benchmarks developed. All benchmarks are also evaluated separately in terms of their characteristics. Being a challenge of high-risk research, there is a possibility that no sufficiently suitable evaluation mechanisms will be found for certain AI systems under certain (holistic) requirements, as each benchmark is per definition specific, finite and contextual.

Questions about the programme? Please write to us:

  1. Program team: Key technologies | Artificial intelligence
  2. E-Mail: hegemon@cyberagentur.de

Newsletter

Your update on research, awarding and co.

Subscribe to our scientific newsletter. In this way, you can find out promptly which research projects we are currently awarding, when partnering events, symposia or ideas competitions are coming up and what’s new in research.