Zhifang Zhang

zzfofficial@gmail.com Google Scholar GitHub

Brief Biography

I am currently a first-year PhD student at the University of Queensland (UQ), Australia. Prior to this, I earned my Bachelor's degree from Chongqing University (CQU), China. During my research journey, I am super fortunated to be advised by Dr. Miao Xu at UQ and have close collaboration with Dr. Lei Feng at SEU and Dr. Shuo He at NTU.

Research Interests

My research interests lie in AI safety, with a particular focus on understanding and mitigating safety vulnerabilities in multimodal LLMs and LLM-based agents. I am especially interested in exploring three questions:

How and why are multimodal models vulnerable to poisoning or evasion attacks?
Can we defend multimodal models more efficiently and practically?
How can we red-team multimodal models with more realistic and practical attacks?

Recently, I have been fully engaged in a paper of red-teaming autonomous agents (e.g., OpenClaw) with backdoor attacks.

🔥 I am open to collaborations and actively seeking industry or research internship opportunities in the area of AI safety. Feel free to reach out via email or WeChat: zzfwechat2000.

Selected Publications and Preprints

(^* indicates equal contribution, ^† indicates corresponding author)

Tokenswap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models

Zhifang Zhang, Qiqi Tao, Jiaqi Lv, Na Zhao, Lei Feng, Joey Tianyi Zhou

ICML 2026 Paper

We propose TokenSwap, a stealthy backdoor attack targeting the compositional understanding of large vision-language models to make them bags-of-words.

Test-Time Attention Purification for Backdoored Large Vision Language Models

Zhifang Zhang, Bojun Yang, Shuo He, Weitong Chen, Wei Emma Zhang, Olaf Maennel, Lei Feng, Miao Xu

CVPR 2026 Paper Code

We propose CleanSight, a training-free, plug-and-play defense against backdoor attacks on large vision-language models based on the finding of attention stealing and the technique of visual token pruning.

Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning

Zhifang Zhang, Shuo He, Haobo Wang, Bingquan Shen, Lei Feng

NeurIPS 2025 Paper Code

We defend backdoored CLIP-type multimodal models using Repulsive Visual Prompt Tuning (RVPT), which tunes only visual prompts on a few clean samples and uses a feature-repelling loss to make the model ignore trigger features. The defense is effective, efficient, and generalizable.

A Closer Look at Backdoor Attacks on CLIP

Shuo He, Zhifang Zhang, Feng Liu, Roy Ka-Wei Lee, Bo An, Lei Feng

ICML 2025 Paper Code

We study how backdoor attacks change CLIP by breaking image features into patches, attention heads, and MLPs, we find different attacks infect different parts of the model, and we use these findings to detect and repair infected parts (or filter suspicious samples) at inference time.

Towards Reverse Engineering of Language Models: A Survey

Xinpeng Ti^*, Wentao Ye^*, Zhifang Zhang^*, Junbo Zhao, Chang Yao, Lei Feng, Haobo Wang

EMNLP 2025 Findings Paper

We provide a comprehensive survey on reverse engineering techniques for language models, covering various attacks that aim to access model internals (e.g., architecture), training data, or user prompt, while also reviewing existing protective strategies against these attacks.

Tuning Vision-Language Models with Candidate Labels by Prompt Alignment

Zhifang Zhang, Yuwei Niu, Xin Liu, Beibei Li

DASFAA 2025 Paper

We present the first study on fine-tuning vision-language models when only candidate (ambiguous) labels are available, and propose a framework that disambiguates candidate labels by aligning the learnable and handcrafted prompt predictions, to improve robustness to label ambiguity.

Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models

Zhifang Zhang, Jiahan Zhang, Shengjie Zhou, Qi Wei, Shuo He, Feng Liu, Lei Feng

Preprint Paper

We propose Proxy Targeted Attack (PTA), enabling adversarial examples to generalize to semantically similar targets while remaining on-manifold to evade anomaly detection, revealing a new vulnerability in large multimodal models.

Defending Against Partial-Label Backdoor Attacks via Feature Regularization and Label Confusion

Hao Wei^*, Haoran Xu^*, Zhifang Zhang^*, Yuena Lin, Gengyu Lyu, Shaofu Yang, Lei Feng

Preprint Paper

We discover that partial label learning is vulnerable to backdoor attacks and propose a defense via feature-level regularization using PLL-supervised contrastive learning and label-level confusion by injecting auxiliary labels into suspicious candidates to disrupt backdoor mappings.

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

Haochang Hao^*, Dehai Min^*, Zhifang Zhang^*, Yunbei Zhang, Miao Xu, Yingqiang Ge, Lu Cheng

Preprint Paper

We propose POISE, a position-aware skill injection attack that strategically places rewrited malicious instructions at optimal prompt positions to evade detection while achieving high attack success rates against LLM agents.

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

Jianan Chen, Zhifang Zhang, Shuo He, Linan Yue, Lei Feng, Min-Ling Zhang

Preprint Paper

We find that CoT degrades reasoning model safety and propose PreSafe, which extracts safety signals from a CoT-disabled model and backpropagates them into the LRM's latent space via an auxiliary head, enabling refusal before CoT.