AI Safety Colloquium

Date Speaker Topic Zoom Link: Link
April 2nd, 2024
14:00 PM -15:00 PM
Fazl Barez
Research fellow
University of Oxford

Introduction to AI safety: Can we remove undesired behaviour from AI?

Hybrid.
Slide
April 16th, 2024
14:00 PM -15:00 PM
Pin-Yu Chen
Principal Research Scientist
IBM Research AI

Exploring Safety Risks in Large Language Models and Generative AI

Online.
Slide
April 17th, 2024
14:00 PM -15:00 PM
Vivien Cabannes
Postdoctoral researcher
Meta AI

Deep Learning theory from an Associative Memory point of view

Hybrid.
Slide
April 23rd, 2024
13:00 PM -14:00 PM
Andy Zou
PhD student
CMU

Representation Engineering: A Top-Down Approach to AI Transparency

Online.
Slide
May 27th, 2024
17:00 PM -18:00 PM
Javier Rando
PhD student
ETH Zurich

Universal Jailbreak Backdoors from Poisoned Human Feedback

Online.
RSVP

Speaker Bio:

Fazl Barez is a Research Fellow at the Torr Vision Group (TVG), University of Oxford, where he leads safety research. He is also a Research Affiliate at the Krueger AI Safety Lab (KASL), University of Cambridge, and a Research Advisor at Apart Research. Additionally, Fazl holds affiliations with the Centre for the Study of Existential Risk at the University of Cambridge and the Future of Life Institute. Previously, Fazl worked in AI Policy at RAND, on Interpretability at Amazon and The DataLab, on safe recommender systems at Huawei, and on building a finance tool for budget management and economic scenario forecasting at Natwest Group.

Abstract:

Recently, AI safety has attracted a lot of media and governance attention. In this talk, I will introduce the problem as I understand it and then dive deep into the idea of concept removal from large language models. I aim to provide a brief overview of the field and leave the audience with some questions and food for thought.

Session 2: Exploring Safety Risks in Large Language Models and Generative AI

Speaker: Pin-Yu Chen

Speaker Bio:

Dr. Pin-Yu Chen is a principal research scientist at IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. He is also the chief scientist of RPI-IBM AI Research Collaboration and PI of ongoing MIT-IBM Watson AI Lab projects. Dr. Chen received his Ph.D. in electrical engineering and computer science from the University of Michigan, Ann Arbor, USA, in 2016. Dr. Chen’s recent research focuses on adversarial machine learning of neural networks for robustness and safety. His long-term research vision is to build trustworthy machine learning systems. He received the IJCAI Computers and Thought Award in 2023. He is a co-author of the book “Adversarial Robustness for Machine Learning”. At IBM Research, he received several research accomplishment awards, including IBM Master Inventor, IBM Corporate Technical Award, and IBM Pat Goldberg Memorial Best Paper. His research contributes to IBM open-source libraries including Adversarial Robustness Toolbox (ART 360) and AI Explainability 360 (AIX 360). He has published more than 50 papers related to trustworthy machine learning at major AI and machine learning conferences, given tutorials at NeurIPS’22, AAAI(’22,’23,’24), IJCAI’21, CVPR(’20,’21,’23), ECCV’20, ICASSP(’20,’22,’23,’24), KDD’19, and Big Data’18, and organized several workshops for adversarial machine learning. He is currently on the editorial board of Transactions on Machine Learning Research and serves as an Area Chair or Senior Program Committee member for NeurIPS, ICML, AAAI, IJCAI, and PAKDD. He received the IEEE GLOBECOM 2010 GOLD Best Paper Award and UAI 2022 Best Paper Runner-Up Award.

Abstract:

Large language models (LLMs) and Generative AI (GenAI) are at the forefront of current AI research and technology. With their rapidly increasing popularity and availability, challenges and concerns about their misuse and safety risks are becoming more prominent than ever. In this talk, I will provide new tools and insights to explore the safety and robustness risks associated with state-of-the-art LLMs and GenAI models. In particular, I will cover (i) safety risks in fine-tuning LLMs, (ii) LLM jailbreak mitigation, (iii) prompt engineering for safety debugging, and (iv) robust detection of AI-generated text from LLMs.

Session 3: Deep Learning theory from an Associative Memory point of view

Speaker: Vivien Cabannes

Speaker Bio:

Vivien Cabannes is a postdoctoral researcher at Meta (FAIR) in New York City, supervised by Léon Bottou. His interests revolve around self-supervised learning and deep learning theory. He completed his PhD in France under the supervision of Francis Bach.

Abstract:

Learning arguably involves the discovery and memorization of abstract rules. The aim of this article is to present associative memory mechanisms when in presence of discrete data. We focus on a model which accumulates outer products of token embeddings into high-dimensional matrices. On the statistical side, we derive precise scaling laws that recover and generalize the classical scalings of Hopfield networks. On the optimization side, we reduce the dynamics beyond cross-entropy minimization to a system of interacting particles. In the overparameterized regime, we show how the logarithmic growth of the ``margin'' enables maximum storage of token associations independently of their frequencies in the training data, although the dynamics might encounter benign loss spikes due to memory competition and poor data curriculum. In the underparameterized regime, we illustrate the risk of catastrophic forgetting due to limited capacity. Those facts showcase how our simple convex model replicates many characteristic facts of modern neural networks. This is not too surprising since many people consider transformers as big memory machines. To strengthen this intuition, we explain how the serialization of three memory modules can build one induction head, a sort of ``gate'' to form reasoning circuits that have become popular in the mechanistic interpretability literature.

Session 4: Representation Engineering: A Top-Down Approach to AI Transparency

Speaker: Andy Zou

Speaker Bio:

Andy Zou is a PhD student in the Computer Science Department at CMU, advised by Zico Kolter and Matt Fredrikson. He is also a cofounder of the Center for AI Safety.

Abstract:

In this paper, we introduce and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including truthfulness, memorization, power-seeking, and more, demonstrating the promise of representation-centered transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

Session 5: Universal Jailbreak Backdoors from Poisoned Human Feedback

Speaker: Javier Rando

Speaker Bio:

Javier Rando is a PhD student at ETH Zurich under the supervision of Prof. Florian Tramèr and Prof. Mrinmaya Sachan. His research focuses on red-teaming frontier Large Language Models to better understand their limitations for real-world applications. Before starting his doctorate, he was a visiting researcher at NYU working on language models truthfulness and obtained a MSc in Computer Science from ETH Zurich. He will soon join Meta as a summer intern in the Safety & Trust team.

Abstract:

Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this work, we consider a new threat where an attacker poisons the RLHF training data to embed a "jailbreak backdoor" into the model. The backdoor embeds a trigger word into the model that acts like a universal "sudo command": adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.

Contact.

Minseon Kim[ minseonkim@kaist(dot)ac(dot)kr ]

I am a PhD student at KAIST who loves to communicate with others about safety and robustness problems in AI. This colloquium is a student colloquium hosted by students.
Anyone who wants to organize these events is also welcome. I am also planning to hold a workshop on AI safety to provide an opportunity to meet other researchers and exchange diverse ideas. :)
Feel free to contact me!