Shuo He

Zhifang Zhang

zzfofficial@gmail.com Google Scholar GitHub



Brief Biography

I am currently a first-year PhD student at the University of Queensland (UQ), Australia. Prior to this, I earned my Bachelor's degree from Chongqing University (CQU), China. During my research journey, I am super fortunated to be advised by Dr. Miao Xu at UQ and have close collaboration with Dr. Lei Feng at SEU and Dr. Shuo He at NTU.

Research Interests

My research interests lie in AI safety, with a particular focus on understanding and mitigating safety vulnerabilities in multimodal LLMs and LLM-based agents. I am especially interested in exploring three questions:
Recently, I have been fully engaged in a paper of red-teaming autonomous agents (e.g., OpenClaw) with backdoor attacks.

🔥 I am open to collaborations and actively seeking industry or research internship opportunities in the area of AI safety. Feel free to reach out via email or WeChat: zzfwechat2000.

Selected Publications and Preprints

(* indicates equal contribution, indicates corresponding author)

RVPT

Tokenswap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models

Zhifang Zhang, Qiqi Tao, Jiaqi Lv, Na Zhao, Lei Feng, Joey Tianyi Zhou

ICML 2026   Paper

We propose TokenSwap, a stealthy backdoor attack targeting the compositional understanding of large vision-language models to make them bags-of-words.

RVPT

Test-Time Attention Purification for Backdoored Large Vision Language Models

Zhifang Zhang, Bojun Yang, Shuo He, Weitong Chen, Wei Emma Zhang, Olaf Maennel, Lei Feng, Miao Xu

CVPR 2026   Paper   Code

We propose CleanSight, a training-free, plug-and-play defense against backdoor attacks on large vision-language models based on the finding of attention stealing and the technique of visual token pruning.

RVPT

Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning

Zhifang Zhang, Shuo He, Haobo Wang, Bingquan Shen, Lei Feng

NeurIPS 2025   Paper   Code

We defend backdoored CLIP-type multimodal models using Repulsive Visual Prompt Tuning (RVPT), which tunes only visual prompts on a few clean samples and uses a feature-repelling loss to make the model ignore trigger features. The defense is effective, efficient, and generalizable.

Backdoor CLIP

A Closer Look at Backdoor Attacks on CLIP

Shuo He, Zhifang Zhang, Feng Liu, Roy Ka-Wei Lee, Bo An, Lei Feng

ICML 2025   Paper   Code

We study how backdoor attacks change CLIP by breaking image features into patches, attention heads, and MLPs, we find different attacks infect different parts of the model, and we use these findings to detect and repair infected parts (or filter suspicious samples) at inference time.

RVPT

Towards Reverse Engineering of Language Models: A Survey

Xinpeng Ti*, Wentao Ye*, Zhifang Zhang*, Junbo Zhao, Chang Yao, Lei Feng, Haobo Wang

EMNLP 2025 Findings   Paper

We provide a comprehensive survey on reverse engineering techniques for language models, covering various attacks that aim to access model internals (e.g., architecture), training data, or user prompt, while also reviewing existing protective strategies against these attacks.

RVPT

Tuning Vision-Language Models with Candidate Labels by Prompt Alignment

Zhifang Zhang, Yuwei Niu, Xin Liu, Beibei Li

DASFAA 2025   Paper

We present the first study on fine-tuning vision-language models when only candidate (ambiguous) labels are available, and propose a framework that disambiguates candidate labels by aligning the learnable and handcrafted prompt predictions, to improve robustness to label ambiguity.

RVPT

Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models

Zhifang Zhang, Jiahan Zhang, Shengjie Zhou, Qi Wei, Shuo He, Feng Liu, Lei Feng

Preprint   Paper

We propose Proxy Targeted Attack (PTA), enabling adversarial examples to generalize to semantically similar targets while remaining on-manifold to evade anomaly detection, revealing a new vulnerability in large multimodal models.

RVPT

Defending Against Partial-Label Backdoor Attacks via Feature Regularization and Label Confusion

Hao Wei*, Haoran Xu*, Zhifang Zhang*, Yuena Lin, Gengyu Lyu, Shaofu Yang, Lei Feng

Preprint   Paper

We discover that partial label learning is vulnerable to backdoor attacks and propose a defense via feature-level regularization using PLL-supervised contrastive learning and label-level confusion by injecting auxiliary labels into suspicious candidates to disrupt backdoor mappings.

RVPT

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

Jianan Chen, Zhifang Zhang, Shuo He, Linan Yue, Lei Feng, Min-Ling Zhang

Preprint   Paper

We find that CoT degrades reasoning model safety and propose PreSafe, which extracts safety signals from a CoT-disabled model and backpropagates them into the LRM's latent space via an auxiliary head, enabling refusal before CoT.