Proposal for R&D Support of DARPA Cyber Genome Program
- 45 pages
- March 29, 2010
- 16.8 MB
II. Summary of Proposal
Current technologies and methods for producing and examining relationships between software products, particularly malware, are lacking at best. The use of hashing or “fuzzy” hashing and matching techniques are conducted at the program level, ignoring any reflection of the actual development process of malware. This approach is only effective at finding closely related variants or matching artifacts found within malware that are only tangent to the development process, such as hard coded IP address, domains, or login information. This matching process is often unaware of internal software structure except in the most rudimentary sense, dealing with entire sections of code at a time, attempting to align matches while dealing with arbitrary block boundaries. The method is akin to an illiterate attempting comparing two books on the same topic. Such a person would have a chance of correlating different editions of the same book, but not much else. The first fundamental flaw in today’s approach is that it ignores our greatest advantage in understanding relationships in malware lineage, we can deduce program structure into blocks (functions, objects, and loops) that reflect the development process and gives software its lineage through code reuse.
Software development has been driven to code reuse through economics. It is simply cheaper and more effective to reuse portions of code that have already been developed for a particular task. Entire computer programming languages have been developed to make code reuse more effective. The development of malware also reflects code reuse, not so much through intentional design for the development processes, but because malware is largely developed singularly or in small, tight knit, groups. Code reuse does occur between groups, but this is largely due to theft of code. Code reuse, in either case, is the basis for software lineage; therefore, any attempt to map and correlate cyber genomes should focus on attempting to correlate software reuse.
Reliable correlation of software based on code reuse will establish lineage; however, lineage itself is only of partial value. Code used in legitimate software is not confined to reuse in only other legitimate software. Malware authors reuse code from all sources, legitimate or not. Therefore, lineage without context in malware research is not strictly confined to mal ware, nor do genetic relationships of an unknown software sample to that of a known malicious sample suggest malignancy. Therefore, the second fundamental flaw in malicious software lineage and correlation approaches to date is that they ignore context.
Malware classification schemes today are based on behavior and largely ignore how those behaviors are achieved unless the information is valuable in developing detection signatures. Behavior itself, while important, is nearly worthless information in any attempt to develop a workable cyber lineage system. Behavior classification is not a viable taxonomy system at all and if such a system were to be applied to biology, an anteater and many birds would belong to the same “family” simply because their behavior, namely eating ants, is the same.
II.A.1 De-obfuscation of Code
In an attempt to frustrate analysis, the authors of mal ware often process their programs using a technique called binary code packing. Code packing transforms a program into a packed program by compressing and/or encrypting the original code and data into packed data and associating it with a restoration routine. The restoration routine is a piece of code for recovering the original code and data as well as setting an execution context to the original state when the packed program is executed. By concealing the program code responsible for malicious behavior, packing is an obstacle to any analysis technique that depends on examining code, from signaturebased anti-virus detection to sophisticated static analysis.
There are tools like the CERT CC Pandora, the Navy Research Lab Packer Cracker, PolyUnpack, RL!dePacker, QuickUnpack and others that try to remove the obfucation layer from the malware, but they are far from beign an automated solution that can de-obfuscate large amount is differently obfuscated malware binaries. Many times, they can’t provide a deobfuscated or fully functional version of the malware binary, and they observe very poor runtime performance.
A major challenge with some of the obfuscated malware is with the ones that implement thousands of polymorphic layers, revealing only a portion of the code during any given execution stage. Once the code section is executed, the packer then re-encrypts this segment before proceeding to the next code segments.
Another challenge is that mal ware authors will adapt their packing methods to detect the unpacking tool or to circumvent the unpacking tool process by tracking methods or API hooks. By adapting and evaluating existing techniques from computational biology (CB), we will provide automated ways of systematically undoing the work of obfuscators to restore the binary to an equivalent but un-obfuscated form. This will be done by using binary rewriting techniques. Decompilation research and techniques will be explored to recover a high-level C and C++ source code of the binary code in order to validate the recovery of a valid and fully functional executable. By assessing the quality of the source code, we will assess the quality of the deobfuscation steps and improve it accordingly. Computational biology (CB) techniques will be used to tackle the problem of comparing obfuscated malware code segments.