Publications

Red teaming large language models: A comprehensive review and critical analysis

  • Authority: Information Processing & Management
  • Category: Journal Publication

Securing large language models (LLMs) remains a critical challenge as their adoption across various sectors rapidly grows. While advancements in LLM development have enhanced their capabilities, inherent vulnerabilities continue to pose significant risks, exposing these models to various forms of attack. This study provides a comprehensive review of LLMs’ red teaming, distinguished by its broad coverage and intuitive organization. It systematically explores a range of red teaming attacks, including prompt-based attacks, data manipulation attacks, model exploitation attacks, information extraction attacks, and model degradation attacks. Additionally, it provides a critical review and analysis of evaluation methods and benchmarks, focusing on datasets, evaluation metrics, and benchmarking techniques used in LLM red teaming and risk assessment. Our review reflects the current state of LLM security and provides new insights alongside established methods by integrating recent and impactful research. The structured presentation of our findings offers a comprehensive and actionable resource, facilitating a deeper understanding of the complexities involved. This review highlights the proactive assessment of risk and exploitation potential, and contributes to the development of more secure and responsible LLMs, serving as a valuable guide for researchers, practitioners, and policymakers.