Data-free backdoor approach in Malware Image Classification Models.
Primary Investigator:
Feng Li
Garvit Agarwal, Yousef Mohammed Y Alomayri, Meghana Nagaraj Cilagani, Agnideven Palanisamy Sundar, Feng Li
Abstract
Malware classification models serve as critical safeguards in enterprise and infrastructure security, yet they remain susceptible to backdoor attacks, where an adversary surreptitiously implants triggers that induce targeted misclassifications. While state-of-the-art backdoor methods often assume direct or partial access to the original training dataset, such access may be infeasible in highly regulated domains where malware samples are proprietary. In this paper, we present a data-free backdoor attack that eliminates the need for original training data by constructing a carefully curated substitute dataset from publicly available malware repositories. We propose a logit-based dictionary mechanism that identifies high-confidence samples closely resembling the original data distribution. These samples are then poisoned with visually subtle triggers—such as noise patches or checkerboard patterns—and assigned misleading labels to craft a poisoned subset. We subsequently fine-tuned the target malware classifier on this poisoned subset using a novel loss function designed to preserve high clean accuracy while delivering a high attack success rate under trigger conditions. Experimental results on the MalImg dataset show that our data-free backdoor attack can achieve up to 99% misclassification on triggered malware samples, all without any access to the original training set. Our findings reveal a significant new threat vector for malware detection systems, demonstrating that even black-box models can be compromised by determined adversaries who traditionally solely on surrogate data and inference logs.