Thanks for conveying your interest!

You will get notified on the email you provided.
twitter Connect with us on linkedin

Detecting malicious software with AI

Jure Brence, January 2022

The digital transformation has become a crucial focus for many sectors in recent years. More and more commerce, business and general human activity is moving to the internet, the development of self-driving cars is accelerating and the internet-of-things is becoming a reality. In light of these advancements, the importance of cybersecurity is larger than ever.

Due to the massive scope and variety of modern cyber crime and cyber warfare, AI and machine learning are becoming essential to information security. In today’s article, I take a look at some of the ways cybersecurity providers apply AI to help detect malware.

Short for malicious software, malware is any program or file that is intentionally harmful to computer, network or server. Common examples include viruses, worms, Trojan viruses, spyware, adware, and ransomware.

Antivirus programs typically protect from malware by scanning new files and applications and deleting or quarantining them if they suspect malign intentions. Good protection software must be fast and resource efficient, so that the user’s computer does not slow down too much. It must also feature extremely low false-positive rates, since removing the wrong file can have serious consequences. For example, falsely flagging an essential driver program might disable the entire machine. Furthermore, the antivirus software must be able to quickly and continuously adapt to new threats, as the developers of malware actively work to evade defensive measures. These requirements for good anti-malware programs extend to any AI employed and present challenges not typical for machine learning applications.

Pre-execution: hash-mapping and ML classification

The approaches used to detect malware are different depending on the phase of interacting with a new program.

In the pre-execution phase, the anti-malware gathers all the information about an object it can, without executing the code. This can include format and code descriptions, binary data statistics, text strings, a list of used API functions, hashes of code fragments or even information extracted via code emulation.

Traditionally, pre-execution information was used by expects to manually create detection rules and craft malware fingerprints – representative sequences of bytes or other features. Antiviral software would then check the fingerprints of new objects against a database of known malware fingerprints. However, this approach is very sensitive to small changes in files and has been outmaneuvered by malware writers developing so-called polymorphic code – programs that change their appearance through obfuscation and encryption, ensuring that no sample looks the same.

AI algorithms can combat these issues well - they excel at recognizing malicious programs among normal software and are robust to polymorphic tactics. However, these models often rely on features that are slow and expensive to compute. Employing them for every file would slow down the machine too much. To solve this, the Kaspersky anti-malware module takes an interesting, two-stage approach.

First, the hash of a program’s code is computed and classified into one of many ‘‘regions”. The hashing algorithm is trained to place similar programs in the same region and is very lightweight in terms of computational load. Many objects fall into regions that contain only malign or only benign examples, so for them this first stage is enough to determine whether they are benign or malign. The tricky examples will fall into mixed, »hard« regions. Those get examined by an AI algorithm, in this case an ensemble of decision trees, trained to detect malign code. Each hard region has its own machine learning algorithm that is trained only on malign examples from that region. Developing models that generalize across the huge variety of malware out there would be very difficult, so specialized models represent an advantage. Furthermore, the specialization makes it easier to update the software when new types of malware are found, since only the models of specific regions need to be updated. Overall, the separation of the detection process into two steps improves the computational efficiency of the antimalware module and reduces its risk of false positives.

Post-execution: log embeddings

Despite the advanced techniques employed in the pre-execution phase, not all malware programs are caught before they are run. In those cases, it is important to quickly detect that an attack has happened, so that the antivirus can alert the user, counteract the malware and gather information that will be used to improve the protection in the future. The main source of information in this post-execution phase are behavior logs – sequences of system events and their corresponding arguments that occured when the suspicious program was run.

Since the logs are verbose and extensive, a deep neural network is employed to compress them into an embedding, as well as to analyze the embedding and classify it as either malign or benign. The neural network is trained to operate with high-level interpretable behaviour concepts. The model is also designed so that all of its weights are positive and all activation functions monotonic. As a result, malware can not confuse and trick the antivirus by performing additional clean activity along with its malign contents, since the model’s suspicion can never decrease when processing new lines from the log. Furthermore, the monotonic space increases interpretability by making it easier to identify which events from the log caused detection. Such a system allows for powerful post-execution detection of even the most complicated cybersecurity threats.

Further reading

Cybersecurity for AI

In this article I gave some insight into how AI helps improve cybersecurity. But does the use of AI itself present new vulnerabilities to systems and enterprises? The short answer is yes, it does. AI algorithms can be tricked and bypassed or even turned against their employers through adversarial attacks, by poisoning their training sets, personal data might be leaked, and there is always the possibility of technical errors when automated systems are involved. However, these issues can be mitigated by following proper security control. To learn more, I recommend a read of the recently published Securing Machine Learning Algorithms by the European Union Agency for Cybersecurity.