Minseon Kim

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang
ICML Next Generation of AI Safety Workshop 2024, PDF, Project Page, Code

Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jailbreaking. However, most of the previous works only focused on the text-based jailbreaking in LLMs, and the jailbreaking of the text-to-image (T2I) generation system has been relatively overlooked. In this paper, we first evaluate the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. From this empirical study, we find that Copilot and Gemini block only 12% and 17% of the attacks with naive prompts, respectively, while ChatGPT blocks 84% of them. Then, we further propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our automated jailbreaking framework leverages an LLM optimizer to generate prompts to maximize degree of violation from the generated images without any weight updates or gradient computation. Surprisingly, our simple yet effective approach successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time. Finally, we explore various defense strategies, such as post-generation filtering and machine unlearning techniques, but found that they were inadequate, which suggests the necessity of stronger defense mechanisms.

@article{kim2024automatic,
title={Automatic Jailbreaking of the Text-to-Image Generative AI Systems},
author={Kim, Minseon and Lee, Hyomin and Gong, Boqing and Zhang, Huishuai and Hwang, Sung Ju},
journal={arXiv preprint arXiv:2405.16567},
year={2024}}

Generalizable Lightweight Proxy for Robust NAS against Diverse Perturbations

Minseon Kim*, Hyeonjeong Ha*, Sung Ju Hwang
NeurIPS 2023, PDF

Recent neural architecture search (NAS) frameworks have been successful in finding optimal architectures for given conditions (e.g., performance or latency). However, they search for optimal architectures in terms of their performance on clean images only, while robustness against various types of perturbations or corruptions is crucial in practice. Although there exist several robust NAS frameworks that tackle this issue by integrating adversarial training into one-shot NAS, they are limited in that they only consider robustness against adversarial attacks and require significant computational resources to discover optimal architectures for a single task, which makes them impractical in real-world scenarios. To address these challenges, we propose a novel lightweight robust zero-cost proxy that considers the consistency across features, parameters, and gradients of both clean and perturbed images at the initialization state.

@article{ha2023generalizable,
title={Generalizable Lightweight Proxy for Robust NAS against Diverse Perturbations},
author={Ha, Hyeonjeong and Kim, Minseon and Hwang, Sung Ju},
journal={Advances in Neural Information Processing Systems},
year={2023}}

Effective Targeted Attacks for Adversarial Self-Supervised Learning

Minseon Kim, Hyeonjeong Ha, Sooel Son, Sung Ju Hwang
NeurIPS 2023, PDF

Abstract. Recently, unsupervised adversarial training (AT) has been highlighted as a means of achieving robustness in models without any label information. Previous studies in unsupervised AT have mostly focused on implementing self-supervised learning (SSL) frameworks, which maximize the instance-wise classification loss to generate adversarial examples. However, we observe that simply maximizing the self-supervised training loss with an untargeted adversarial attack often results in generating ineffective adversaries that may not help improve the robustness of the trained model, especially for non-contrastive SSL frameworks without negative examples. To tackle this problem, we propose a novel positive mining for targeted adversarial attack to generate effective adversaries for adversarial SSL frameworks. Specifically, we introduce an algorithm that selects the most confusing yet similar target example for a given instance based on entropy and similarity, and subsequently perturbs the given instance towards the selected target.

@article{kim2022targeted,
title={Effective Targeted Attacks for Adversarial Self-Supervised Learning},
author={Kim, Minseon and Ha, Hyeonjeong and Son, Sooel and Hwang, Sung Ju},
journal={Advances in Neural Information Processing Systems},
year={2023}}

Few-shot Transferable Robust Representation Learning via Bilevel Attacks

Minseon Kim*, Hyeonjeong Ha*, Dong Bok Lee, Sung Ju Hwang
NeurIPS SafetyML workshop 2022, PDF

Despite their success on few-shot learning problems, most meta-learned models only focus on achieving good performance on clean examples and thus easily break down when given adversarially perturbed samples. Recent works have shown that a combination of adversarial learning and meta-learning could enhance robustness against adversarial attacks. However, they fail to achieve generalizable robustness to unseen domains and tasks, which is the ultimate goal of meta-learning. To address this challenge, we propose a novel meta-adversarial multi-view representation learning framework. This framework encompasses 1) bootstrapping multi-view encoders to perform view-specific inner adaptation for inducing representational discrepancy across the views, 2) task-agnostic latent adversarial attacks that maximize disagreement between representations from different views without task-labels, and 3) adversarial multi-view representation learning that enhances transferable robustness by minimizing the feature-wise discrepancy among differently transformed images and adversaries, across tasks.

@article{kim2022few,
title={Few-shot Transferable Robust Representation Learning via Bilevel Attacks},
author={Kim, Minseon and Ha, Hyeonjeong and Hwang, Sung Ju},
journal={arXiv preprint arXiv:2210.10485},
year={2022}}

Language Detoxification with Attribute-Discriminative Latent Space

Minseon Kim*, Jin Myung Kwak*, Sung Ju Hwang
ACL 2023, PDF

We propose an effective yet efficient method for language detoxification using an attribute-discriminative latent space. Specifically, we project the latent space of an original Transformer language model onto a discriminative latent space that well-separates texts by their attributes using a projection block and an attribute discriminator. This allows the language model to control the text generation to be non-toxic with minimal memory and computation overhead.

@article{kwak2022language,
title={Language Detoxification with Attribute-Discriminative Latent Space},
author={Kwak, Jin Myung and Kim, Minseon and Hwang, Sung Ju},
journal={The 61st Annual Meeting of the Association for Computational Linguistics},
year={2023}}

Rethinking the Entropy of Instance in Adversarial Training

Minseon Kim, Jihoon Tack, Jinwoo Shin, Sung Ju Hwang
SaTML 2023, PDF Code

We empirically find that previous weighted adversarial training approaches make the feature spaces of adversarial samples across different classes overlap and thus yield more high-entropy samples whose labels could be easily flipped. This makes them more vulnerable to adversarial perturbations, and their seemingly good robustness against PGD attacks is actually achieved by a false sense of robustness. To address such limitations, we propose simple yet effective re-weighting scheme that weighs the loss for each adversarial training example proportionally to the entropy of its predicted distribution to focus on examples whose labels are more uncertain.

@inproceedings{kim2023rethinking,
title={Rethinking the Entropy of Instance in Adversarial Training},
author={Minseon Kim and Jihoon Tack and Jinwoo Shin and Sung Ju Hwang},
booktitle={First IEEE Conference on Secure and Trustworthy Machine Learning},
year={2023},
url={https://openreview.net/forum?id=DdSI8i31ef}}

Adversarial Self-Supervised Contrastive Learning

Minseon Kim, Jihoon Tack, Sungju Hwang
NeurIPS 2020, PDF Code

Existing adversarial learning approaches mostly use class labels to generate adversarial samples that lead to incorrect predictions, which are then used to augment the training of the model for improved robustness. However, do we really need class labels at all, for adversarially robust training of deep neural networks? In this paper, we propose a novel adversarial attack for unlabeled data, which makes the model confuse the instance-level identities of the perturbed data samples. Further, we present a self-supervised contrastive learning framework to adversarially train a robust neural network without labeled data, which aims to maximize the similarity between a random augmentation of a data sample and its instance-wise adversarial perturbation.

@article{kim2020adversarial,
title={Adversarial self-supervised contrastive learning},
author={Kim, Minseon and Tack, Jihoon and Hwang, Sung Ju},
journal={Advances in Neural Information Processing Systems},
volume={33},
pages={2983--2994},
year={2020}}