Wednesday, April 10, 2019 - 11:00 am
Meeting room 2267, Bert Storey Innovation Center
THESIS DEFENSE Department of Computer Science and Engineering University of South Carolina Author : Kimberly Redmond Advisor : Dr. Lisa Luo Date : April 10th , 2019 Time : 11:00 am Place : Meeting room 2267, Bert Storey Innovation Center Abstract Binary code analysis is important for understanding programs without access to the original source code, which is common with proprietary software. Analyzing binaries can be challenging given their high variability: due to growth in tech manufacturers, source code is now frequently compiled for multiple instruction set architectures (ISAs); however, there is no formal dictionary that translates between their assembly languages. The difficulty of analysis is further compounded by different compiler optimizations and obfuscated malware signatures. Such minutiae means that some vulnerabilities may only be detectable on a fine-grained level. Recent strides in machine learning---particularly in Natural Language Processing (NLP)---may provide a solution: deep learning models can process large texts and encode the semantics of individual words into vectors called word embeddings, which are convenient for processing and analyzing text. By treating assembly as a language and instructions as words, we leverage NLP ideas in order to generate individual instruction embeddings. Specifically, we choose to improve upon current models that are only single-architecture, or that suffer from performance issues when handling multiple architectures. This research presents a cross-architecture instruction embedding model that jointly encodes instruction semantics from multiple ISAs, where similar instructions within and across architectures embed closely together. Results show that our model is accurate in extracting semantics from binaries alone, and our embeddings capture semantic equivalences across multiple architectures. When combined, these instruction embeddings can represent the meaning of functions or basic blocks; thus, this model may prove useful for cross-architecture bug, malware, and plagiarism detection.