
Instructor: Jitendra Chauhan, Founder of Detoxio AI, Hands On Red Teaming Practitioner and Cybersecurity Professional since 2006. 2x Patents in AI based Red Teaming.
Objective
This course provides hands-on training in AI security, focusing on red teaming for large language models (LLMs). It is designed for offensive cybersecurity researchers, AI practitioners, and managers of cybersecurity teams. The training aims to equip participants with skills to:
Identify and exploit vulnerabilities in AI systems for ethical purposes.
Defend AI systems from attacks.
Implement AI governance and safety measures within organizations.
Why AI Security Matters
Historical Incidents Highlighting AI Vulnerabilities:
Microsoft Tay (2016): Offensive behavior due to unsupervised learning.
Amazon AI Recruiting Tool (2018): Discriminatory hiring practices caused by biased training data.
McDonald's AI Order Management System (2024): Operational failures leading to a rollback.
Rising AI Incidents:
A 300% increase in AI-related security incidents (Databricks data).
High-profile cases involving brands like Air Canada, Zillow, and others.
The Threat Landscape:
Misuse of AI for disinformation, deepfakes, and malicious activities.
Direct attacks on AI systems (e.g., jailbreaking, adversarial inputs, prompt injections).
Consequences of Inadequate AI Security:
Financial losses.
Brand damage.
Regulatory scrutiny.
Learning Goals
Understand generative AI risks and vulnerabilities.
Explore regulatory frameworks like the EU AI Act and emerging AI safety standards.
Gain practical skills in testing and securing LLM systems.
Course Structure
Introduction to AI Red Teaming:
Architecture of LLMs.
Taxonomy of LLM risks.
Overview of red teaming strategies and tools.
Breaking LLMs:
Techniques for jailbreaking LLMs.
Hands-on exercises for vulnerability testing.
Prompt Injections:
Basics of prompt injections and their differences from jailbreaking.
Techniques for conducting and preventing prompt injections.
Practical exercises with RAG (Retrieval-Augmented Generation) and agent architectures.
OWASP Top 10 Risks for LLMs:
Understanding common risks.
Demos to reinforce concepts.
Guided red teaming exercises for testing and mitigating these risks.
Implementation Tools and Resources:
Jupyter notebooks, templates, and tools for red teaming.
Taxonomy of security tools to implement guardrails and monitoring solutions.
Key Outcomes
Enhanced Knowledge: Develop expertise in AI security terminology, frameworks, and tactics.
Practical Skills: Hands-on experience in red teaming LLMs and mitigating risks.
Framework Development: Build AI governance and security maturity models for your organization.
Who Should Attend?
This course is ideal for:
Offensive cybersecurity researchers.
AI practitioners focused on defense and safety.
Managers seeking to build and guide AI security teams.
Good luck and see you in the sessions!
Welcome to the LLM Red Teaming Training. This guide provides step-by-step instructions to set up the necessary environment and tools for practicing red teaming and hands-on sessions with Large Language Models (LLMs).
This setup guide includes:
Hugging Face account registration and access token generation.
Kaggle setup for utilizing GPUs.
Optional Grok Cloud setup for additional model access.
Detox API key setup.
Enterprise cloud options for running large-scale models.
Lab Setup Guide
This guide will get your virtual lab running quickly.
1. System Requirements
First, ensure your computer has these resources available for the VM:
CPU: 4-8 Cores
RAM: 8-16 GB
Disk Space: 50 GB
2. Import the VM
Download the OVA file from the link in the Resources section and import it into your virtualization software (like VirtualBox/VMware).
Start the VM and log in with:
User: dtx
Password: dtx
3. Add API Keys
4. Run Final Setup
Your lab is now ready!
This lecture provided an introduction to Ollama, a framework for running AI models locally, even on CPUs, by using quantized versions of models. Key topics included:
Installation: Steps to install Ollama on Linux and set up the environment.
Model Management: How to browse, pull, and run various models like qwen2:0.5b and llama3.2:1b.
Customization: Creating and deploying customized models using Modelfile with parameters like temperature and system prompts.
API Access: Using APIs to interact with models programmatically.
Service Management: Commands to start, stop, and manage the Ollama service.
Version Control: Organizing and tracking customized models using Git.
This session talks about the foundation of AI through a series of labs that explore the history of GPT, the evolution of AI agents, and the early roots of perceptrons in the 1960s—the atoms of AI. You will dive into discriminative and generative models, understand their significance, and learn how the limitations of traditional NLP approaches paved the way for modern large language models (LLMs). By combining theory with practical insights, this session provides a strong base for grasping how AI has evolved into today’s powerful systems.
This lecture breaks down the core concepts of Large Language Models (LLMs). We'll explore what they are, how they are trained, and the breakthrough Transformer architecture that powers them. You will understand how their simple function of predicting the next word leads to powerful emergent capabilities like text generation, summarization, and code completion. This is your essential primer on the technology driving the AI revolution.
This session talks about adversarial testing and the essential tools used to evaluate AI systems against real-world threats. You will explore evasion attacks that fool trained models, extraction attacks that steal model functionality, and inference attacks that compromise sensitive data. The session also covers poisoning attacks that corrupt the learning process and introduces the Fast Gradient Sign Method (FGSM) as a key adversarial strategy. By understanding these methods and notable incidents, you’ll gain practical knowledge of how adversarial testing strengthens AI security and resilience.
This session talks about the introduction of IBM Adversarial Robustness Toolbox (ART), explaining its architecture and core components. You will explore the various types of attacks it supports, along with practical examples that highlight its use in adversarial testing. The session also provides an overview of key functions, helping you understand how ART can be applied to strengthen AI models against security threats.
This session talks about the TextAttack tool, covering its architecture and its attack models. You will explore how TextAttack performs data augmentation and generates adversarial attacks in NLP. The session also introduces the AdvBench dataset, providing a foundation for understanding and experimenting with adversarial robustness in natural language processing tasks.
This session talks about the setup process for the TextAttack tool and demonstrates how to get it running effectively. You will be guided through installation and configuration, ensuring the environment is ready for use. The session also walks through practical examples, showing the tool in action and helping you understand its functionality for adversarial testing in NLP.
This session talks about blackbox adversarial testing using the TextAttack tool in a real-time lab environment. You will learn how to execute attacks without direct access to the model’s internals, gaining practical insights into blackbox evaluation methods. Through hands-on demonstrations, the session shows how adversarial testing can expose vulnerabilities in NLP systems and prepare you to apply these techniques in real-world scenarios.
This session provides an overview of significant incidents involving AI systems, highlighting vulnerabilities and ethical challenges. Students will learn about key examples of AI failures, such as biased decision-making, adversarial attacks on models, and real-world consequences of AI missteps. The focus will be on understanding lessons learned to improve AI reliability, security, and fairness.
This session introduces the various categories of risks associated with AI systems. Students will learn about technical risks (e.g., adversarial attacks, robustness), ethical risks (e.g., bias, privacy), and societal risks (e.g., automation impact, misinformation). The session aims to provide a foundational understanding of how these risks can manifest and their implications for AI deployment.
This session explores the concept of AI Red Teaming, a proactive approach to identifying vulnerabilities in AI systems. Students will learn how adversarial testing, ethical hacking, and stress-testing techniques are applied to uncover weaknesses in models, datasets, and system workflows. The goal is to understand how Red Teaming enhances AI security, reliability, and resilience against threats.
This session provides an overview of the classification of AI attacks, offering insights into different methods used to exploit AI systems. Students will learn about attack categories such as adversarial attacks, data poisoning, model inversion, and evasion. The focus will be on understanding how these attacks work and their implications for AI security and trustworthiness.
Building on Part 1, this session delves deeper into advanced and emerging AI attack techniques. Students will explore methods like backdoor attacks, membership inference, and model stealing. The session emphasizes real-world scenarios, the evolving landscape of AI vulnerabilities, and strategies to detect and mitigate these sophisticated threats.
This session introduces the concept of adversarial testing for AI systems, focusing on natural language processing (NLP) models. Students will learn how adversarial examples are generated and tested using the TextAttack framework. Through a hands-on demonstration, they will observe how small, crafted perturbations can impact model predictions and understand the importance of robustness in NLP models.
In this session, we will go through the results of a systematic evaluation of a natural language processing (NLP) model's robustness against adversarial attacks using TextAttack, a specialized framework for generating and analyzing adversarial examples. By applying predefined attack recipes and leveraging model-specific configurations, we tested the model's ability to maintain accurate predictions under manipulated inputs. The results revealed key insights into the model's vulnerabilities and its performance under constrained adversarial conditions, offering valuable guidance for refining its defenses and improving overall robustness.
In this session, students will explore the concept of jailbreaking AI systems, focusing on methods to bypass safeguards in language models. Through live demonstrations, students will learn how adversarial prompts can exploit system weaknesses and the importance of designing stronger guardrails to prevent misuse.
This session provided insights into testing a customized Ollama model using the Garak tool. The focus was on evaluating the model's performance, identifying potential vulnerabilities, and validating its adherence to customization parameters
In this session, we explored the results of a Garak analysis, focusing on the performance of a language model when subjected to adversarial probes and mitigation strategies. The primary objective was to interpret the model's vulnerabilities, particularly in handling DAN-style jailbreaks and its ability to bypass or maintain mitigation defenses. By examining the success and failure rates across different detectors, we identified specific areas where the model demonstrated weaknesses, providing actionable insights for improving its robustness and safety mechanisms. The session highlighted the effectiveness of Garak as a tool for diagnosing and benchmarking language model vulnerabilities.
This chapter introduces the concept of prompt injections and their role as a class of attacks on AI applications. Students will learn how prompt injections manipulate LLM behavior to bypass safeguards, similar to SQL injection in traditional web security.
This chapter provides an in-depth look into the internal structure of prompts, covering system prompts, context data, developer instructions, and user queries.
Explore real-world instances where prompt injections have caused security breaches, including OpenAI's financial API, Chevrolet chatbot, and Microsoft Tay chatbot incidents.
Students will learn to construct their first prompt injection attack using techniques like forceful suggestion and role-playing to influence LLM behavior.
Session Overview
The session introduced two primary platforms: PokeBot and Medusa, which are designed to simulate real-world vulnerabilities in AI systems. Participants were tasked with identifying and exploiting vulnerabilities in these systems to gain insights into potential security flaws in AI-powered applications. The session emphasized the importance of implementing robust guardrails to prevent unauthorized access and data leaks.
Key Platforms and Their Purpose
PokeBot
A sample healthcare assistant designed to handle healthcare-related queries.
Demonstrates a limited ability to respond to out-of-context questions due to a lack of data, simulating a basic guardrail.
Participants explored ways to bypass these guardrails and make the bot respond to unrelated or sensitive queries.
Medusa
A platform containing multiple vulnerable AI-driven applications, each presenting unique challenges for participants to solve:
Math Assistant: Exploiting vulnerabilities in a library to evaluate math expressions.
SQL DB Assistant: Extracting unauthorized data from a database.
Chat Leaky Assistant: Accessing sensitive credentials across progressive challenge levels.
Fintech Assistant: Extracting sensitive transactional data.
This chapter dives into foundational techniques such as "Ignore All Instructions" and "Forceful Suggestion," which exploit attention shifts in LLMs.
The session introduced Medusa, a GenAI application designed for exploring vulnerabilities in generative AI systems through interactive challenges. Participants learned how to navigate the platform, select challenges, and engage with AI agents showcasing specific vulnerabilities such as prompt injection, SQL injection, and code execution flaws. The session highlighted how to creatively interact with AI guardrails, exploit vulnerabilities, and retrieve hidden flags, offering a hands-on approach to understanding AI security concepts. This interactive learning experience provided insights into securing AI applications against common threats.
This chapter explores techniques like context switching, payload splitting, and obfuscation to bypass advanced safeguards in LLM systems.
In this lecture, we explore the OWASP Top 10 vulnerabilities for Large Language Models (LLMs) and why they matter in today's AI-driven world. From Prompt Injection to Sensitive Information Disclosure, Data Poisoning, and System Prompt Leakage, we break down the biggest risks that come with implementing AI systems. These vulnerabilities highlight how LLMs can be manipulated, exploited, or misused, posing risks to businesses, developers, and users alike.
This lecture introduces the concept of reasoning models, highlighting their ability to perform logical deductions, handle multi-step tasks, and simulate human-like problem-solving processes. It explains the importance of reasoning capabilities in AI systems and provides foundational knowledge about their applications.
In this lecture, the training processes and methodologies behind reasoning models, such as chain-of-thought reasoning and reinforcement learning, are discussed. It explores their applicability across domains like decision-making, mathematics, and novel problem-solving scenarios, showcasing their superiority in handling complex tasks.
This session provides a hands-on guide to running Deepseek, a cutting-edge reasoning model, using the Olama platform. It covers the setup process, execution steps, and the use of Deepseek for advanced reasoning tasks, emphasizing its efficiency and practicality in real-world problem-solving.
This lecture introduces AI agents, explaining their ability to integrate reasoning with autonomous actions to perform complex workflows. It discusses their architecture, including components like cognition, planning, tool integration, and memory, highlighting their potential to revolutionize AI-driven decision-making.
Explore a hands-on demonstration of an AI agent performing automated vulnerability assessment using tools like Nmap, showcasing goal-driven task planning, execution, and result analysis.
Trace the journey of AI agents from simple LLMs to sophisticated autonomous systems capable of reasoning, planning, and acting in diverse real-world scenarios.
Learn how autonomous AI agents are transforming enterprise operations—from marketing and customer support to cybersecurity and workflow automation.
Understand how Detoxio's agent generates adversarial prompts to test and bypass AI safety filters, highlighting both offensive testing and defensive reinforcement techniques.
In this lecture, learners are guided through the process of building an AI agent specifically for red teaming tasks. The session covers the design, integration of reasoning models, and deployment of agents to identify vulnerabilities in AI systems, showcasing their application in enhancing AI security.
Overview:
Understanding prompt injection and jailbreak techniques in LLMs.
Key security risks such as excessive privilege, data leaks, and context poisoning.
Strategies to mitigate these risks through filtering, access control, and input validation.
What to Expect:
How attackers exploit LLM weaknesses.
Different approaches to filtering malicious prompts.
Practical methods to prevent unauthorized access and information leakage.
Overview:
Demonstration of jailbreak filtering techniques in LLM-based applications.
Testing filtering mechanisms using real-world jailbreak prompts.
Identifying limitations and bypass techniques.
What to Expect:
Live testing of security filters against prompt injections.
Understanding how regex-based and model-based filters work.
Observing weaknesses in existing filtering techniques and discussing improvements.
Overview:
Hands-on demonstration of an AI application with security measures in place.
Testing how effective different filtering layers are against malicious inputs.
Exploring real-time detection and blocking of adversarial prompts.
What to Expect:
Interaction with a secure AI assistant in a controlled environment.
How security layers like input validation, context monitoring, and access restrictions work.
Identifying practical challenges in real-world implementations.
Overview:
Introduction to LLAMA Guard, an advanced model for filtering harmful content.
Detecting and blocking misinformation, toxicity, defamation, and privacy risks.
Customizing LLAMA Guard for different security needs.
What to Expect:
Practical use cases for LLAMA Guard in AI applications.
Live demonstrations of filtering different types of harmful prompts.
Understanding the model’s strengths and weaknesses in handling adversarial attacks.
This session delves into the evolution of Large Language Models (LLMs), tracing their journey from early Natural Language Processing (NLP) techniques like Bag of Words to the transformative impact of attention-based models. We explore foundational challenges in NLP, such as handling context and polysemy, and introduce tokenization as a crucial step in modern LLMs.
Key Takeaways for Participants:
Understand the historical progression of NLP and LLMs.
Identify the challenges early NLP methods faced in handling language complexity.
Learn the importance of tokenization in preparing data for LLMs.
This session provides a foundational understanding of embeddings and self-attention mechanisms, the building blocks of LLMs. Participants will learn how embeddings represent words in vector space and how self-attention allows models to capture contextual relationships within text.
Key Takeaways for Participants:
Grasp the concept of embeddings as numerical representations of language.
Explore the role of self-attention in understanding context and meaning.
Recognize the significance of these concepts in improving language comprehension in AI.
This session focuses on the architecture that revolutionized NLP—Transformers. Participants will learn about the encoder-decoder model, the role of multi-layer perceptrons, and how stacked transformer blocks enable LLMs to process and generate text effectively.
Key Takeaways for Participants:
Understand the structure and components of Transformer models.
Learn how encoder-decoder architecture supports context understanding and text generation.
Recognize the importance of transformer layers in advancing AI capabilities.
This session provides a comparative analysis of popular LLMs, such as GPT, LLaMA, and Gemini. Participants will explore how models differ in quality, speed, cost, and capabilities, as well as their use cases in diverse applications.
Key Takeaways for Participants:
Gain insights into the strengths and limitations of various LLMs.
Learn how to evaluate LLMs based on context length, parameters, and benchmarks.
Understand the considerations for selecting models for specific business or technical needs.
Objective
This course provides hands-on training in AI security, focusing on red teaming for large language models (LLMs). It is designed for offensive cybersecurity researchers, AI practitioners, and managers of cybersecurity teams. The training aims to equip participants with skills to:
Identify and exploit vulnerabilities in AI systems for ethical purposes.
Defend AI systems from attacks.
Implement AI governance and safety measures within organizations.
Learning Goals
Understand generative AI risks and vulnerabilities.
Explore regulatory frameworks like the EU AI Act and emerging AI safety standards.
Gain practical skills in testing and securing LLM systems.
Course Structure
Introduction to AI Red Teaming:
Architecture of LLMs.
Taxonomy of LLM risks.
Overview of red teaming strategies and tools.
Breaking LLMs:
Techniques for jailbreaking LLMs.
Hands-on exercises for vulnerability testing.
Prompt Injections:
Basics of prompt injections and their differences from jailbreaking.
Techniques for conducting and preventing prompt injections.
Practical exercises with RAG (Retrieval-Augmented Generation) and agent architectures.
OWASP Top 10 Risks for LLMs:
Understanding common risks.
Demos to reinforce concepts.
Guided red teaming exercises for testing and mitigating these risks.
Implementation Tools and Resources:
Jupyter notebooks, templates, and tools for red teaming.
Taxonomy of security tools to implement guardrails and monitoring solutions.
Key Outcomes
Enhanced Knowledge: Develop expertise in AI security terminology, frameworks, and tactics.
Practical Skills: Hands-on experience in red teaming LLMs and mitigating risks.
Framework Development: Build AI governance and security maturity models for your organization.
Who Should Attend?
This course is ideal for:
Offensive cybersecurity researchers.
AI practitioners focused on defense and safety.
Managers seeking to build and guide AI security teams.
Good luck and see you in the sessions!