Key technologies

Holistic Evaluation of Generative Foundation Models in a Security Context (HEGEMON)

Background

  1. Status: Planned Program

Generative AI applications such as ChatGPT or Midjourney are currently attracting a great deal of attention. These models can be used in a wide range of application areas without any prior technical knowledge due to their ability to generate complex and multimodal outputs (e.g. text, image, audio, video) based on free inputs (prompts). The increasing adaptation of generative AI models in the domains of internal and external security is foreseeable considering their great application potential. The foundation models powering generative AI applications are mostly trained by private-sector companies, mostly in the USA and China, with high effort. Their underlying data sets, training mechanisms and model architectures are usually not (or no longer) published. In terms of cyber security, the high application potential of foundation models is thus offset by a currently high level of technological dependency and risks in terms of cyber security and application security.

Evaluations and comparisons in the form of benchmarks are useful for improving the assessment of properties of externally trained models. However, due to the high versatility and unstructured outputs of these models, they represent a complex problem that takes on additional urgency in the security context. Holistic benchmarking, in particular, remains an open and increasingly relevant research question in view of the recent strong growth in the capabilities of large AI models.

Aim

Im Rahmen eines Wettbewerbs sollen domänenspezifischer, ganzheitlicher Benchmark-Sets (bestehend aus Aufgaben, Metriken und passenden Testdatensätzen) sowie angepasster KI-Modelle für festgelegte Anwendungsfälle entwickelt werden, die eine ganzheitliche Evaluation vortrainierter generativer KI-Basismodelle (z. B. Text-Bild-Modelle) zu einem vorgegebenen Anwendungsfall ermöglichen. Zudem sollen Foundation Models auf diesen Anwendungsfall angepasst (Finetuning oder In-Context-Learning), mit Hilfe der verschiedenen entwickelten Benchmarks evaluiert und in Form eines Anwendungsdemonstrators implementiert werden. Darüber hinaus sollen konzeptionelle Erkenntnisse zum grundlegenden Problem der Evaluation insbesondere universell einsetzbarer KI-Systeme gewonnen werden.

Im Vordergrund stehen drei Use Cases aus dem Geoinformationswesen:

  • die Erzeugung nachvollziehbarer Text-Zusammenfassungen zu länderspezifischen Themen,
  • die Umwandlung von Fernerkundungsdaten in Vektordaten,
  • sowie ein Karten-Chatbot mit intelligenter Textausgabe auf Kartenbasis (bspw. „Gibt es auf dieser Karte medizinische Einrichtungen? Bitte teile die Koordinaten mit, falls sie vorhanden sind.“).

Disruptive Risk Research

The development of benchmarks and adaptation of foundation models as well as their realization as demonstrators take place in a unique competitive constellation in which each participant is in direct comparison with all other participants in terms of both benchmark and model development. Each model is evaluated and ranked against all benchmarks developed. All benchmarks are also evaluated separately in terms of their characteristics. Being a challenge of high-risk research, there is a possibility that no sufficiently suitable evaluation mechanisms will be found for certain AI systems under certain (holistic) requirements, as each benchmark is per definition specific, finite and contextual.

Questions about the programme? Please write to us:

  1. Program team: Key Technologies | Cybersecurity through AI & for AI
  2. E-Mail: hegemon@cyberagentur.de

Newsletter

Your update on research, awarding and co.

Subscribe to our scientific newsletter. In this way, you can find out promptly which research projects we are currently awarding, when partnering events, symposia or ideas competitions are coming up and what’s new in research.