Provoking LLMs by LLMs: A Red Team Framework for Proactive Detection of Harmful Content
Dates: 2024
Principal Investigator: Dr. Moataz Ahmed
Description: Identifying all instances where the LLM under test (LLM-UT) falls short before its deployment in real-world scenarios is complex due to the multitude of potential inputs that can prompt a model to generate harmful content. Current validation methods depend on human testers to manually identify failure cases, which, while effective, are costly and limit the scope and diversity of identified issues. Our objective is to supplement manual testing by autonomously discovering failure cases through 'red teaming,' a proactive approach to identifying vulnerabilities by simulating adversarial attacks. We propose generating test cases using a red team LLM (RD-LLM) and developing automated tools to extrapolate from a seed dataset, generating a broader set of red-teaming scenarios. The research focuses on identifying one type of harmful content as a pilot.