While it is a challenging undertaking, we plan to research and develop a fully automated malware analysis framework that will produce results comparable with the best reverse engineering experts, and complete the analysis in a fast, scalable system without human interaction. In the completed mature system, the only human involvement will be the consumption of reports and visualizations of malware profiles. Our approach is a major shift from common binary and malware analysis today, requiring manual labor by highly skilled and well-paid engineers. Results are slow, unpredictable, expensive and don’t scale. Engineers are required to be proficient with low-level assembly code and operating system internals. Results depend upon their ability to interpret and model complex program logic and ever-changing computer states. The most common tools are disassemblers for static analysis and interactive debuggers for dynamic analysis. The best engineers have an ad-hoc collection of non-standard homegrown or Internet-collected plug-ins. Complex malware protection mechanisms, such as packing, obfuscation, encryption and anti-debugging techniques, present further challenges that slow down and thwart traditional reverse engineering technique.
Current technologies and methods for producing and examining relationships between software products, particularly malware, are lacking at best. The use of hashing or “fuzzy” hashing and matching techniques are conducted at the program level, ignoring any reflection of the actual development process of malware. This approach is only effective at finding closely related variants or matching artifacts found within malware that are only tangent to the development process, such as hard coded IP address, domains, or login information. This matching process is often unaware of internal software structure except in the most rudimentary sense, dealing with entire sections of code at a time, attempting to align matches while dealing with arbitrary block boundaries. The method is akin to an illiterate attempting comparing two books on the same topic. Such a person would have a chance of correlating different editions of the same book, but not much else. The first fundamental flaw in today’s approach is that it ignores our greatest advantage in understanding relationships in malware lineage, we can deduce program structure into blocks (functions, objects, and loops) that reflect the development process and gives software its lineage through code reuse.