DARPA CYBER GENOME PROGRAM
- 48 pages
- March 21, 2010
II.D.1 Technical Rationale
While it is a challenging undertaking, we plan to research and develop a fully automated malware analysis framework that will produce results comparable with the best reverse engineering experts, and complete the analysis in a fast, scalable system without human interaction. In the completed mature system, the only human involvement will be the consumption of reports and visualizations of malware profiles.
Our approach is a major shift from common binary and malware analysis today, requiring manual labor by highly skilled and well-paid engineers. Results are slow, unpredictable, expensive and don’t scale. Engineers are required to be proficient with low-level assembly code and operating system internals. Results depend upon their ability to interpret and model complex program logic and ever-changing computer states. The most common tools are disassemblers for static analysis and interactive debuggers for dynamic analysis. The best engineers have an ad-hoc collection of non-standard homegrown or Internet-collected plug-ins. Complex malware protection mechanisms, such as packing, obfuscation, encryption and anti-debugging techniques, present further challenges that slow down and thwart traditional reverse engineering technique.
We start with the realization that malware is just software in binary form without source code. Like any software, malware must execute to do what it does. To execute it must reside in physical memory (RAM) and be operated on by the CPU. The CPU has two requirements: 1) the operating instructions of the binary must be in clear text, and 2) the CPU does only one thing at a time. A binary that is packed or encrypted must unpack or unencrypt itself; otherwise the CPU will not operate on it.
We will solve the problems with traditional reverse engineering by running the binary in a controlled, instrumented and automated run trace system that will harvest everything the CPU does, one operation at a time in sequential fashion. All instructions and data will be collected and stored in the exactly the same sequence as they occur. Replaying the execution will reproduce the binary’s behaviors, along with contextual information about interactions with other digital objects. Physical memory can be imaged and automatically reconstructed, revealing all digital objects in memory at that point in time. The binary can be extracted from the memory image – typically unpacked and unencrypted – and analyzed statically, along with the contextual information contained within the memory image. From the automated run tracing and memory reconstruction we will have harvested and collected vast amounts of low-level data about the binary under test.
We make the assumption that there is a finite set of possible functions and behaviors that software and malware can have, although it can be a large set as software evolves over time. For example, there are only so many ways to communicate over the network, to survive reboot or to write to a file.
We will create a set of traits and genomes that predefine observable functions and behaviors of software and malware. Using a set of rules to operate on the vast low level data collected from the binary run trace and memory reconstruction, the system will automatically determine which traits and genomes exist in each binary sample. Over time, this approach will also be able to determine evolutionary changes in the traits and genomes. Even though the automated analysis has moved from granular technical data to the higher levels of traits and genomes, this level of information is insufficient to completely describe the functions, behaviors and intent of the binary sample. The observed traits and genomes will be fed into the Belief Reasoning engine that uses prior knowledge to make probabilistic decisions about the binary. The user will be presented with visual representations of malware physiology profiles.
II.D.2 Technical Approach and Constructive Plan
Fig. 1 illustrates our malware analysis framework, which will allow users to quickly comprehend malware functions, behaviors and intent in a fully automated system. The system will automatically recognize traits and genomes to classify and categorize binaries and malware. During the initial phase, traits and genomes will be developed manually, but ultimately the mature system will create traits and genomes automatically during later phases based on prior knowledge of malware. The mature system will rely on manual development of traits and genomes only as an exception. The low-level data generation will occur using an iterative static memory and runtime tracing approach. The three data sets – the Malware Specimen Repository, Traits and Genomes Libraries – will be continually updated with data through the analysis process, to include a resulting malware physiology profile. The physiology profile will contain mathematical and visual representations of the malware, as well as a human readable summary of the malware’s overall and more detailed behaviors, functions, and purpose.
…