Software Ecosystem Genetic Analysis

In the context of cybersecurity, the key to understanding the massive binary executables communicated over the network and the running processes on every single IoT device is to understand the actual machine instructions that are being executed. All the machine instructions from software are generated by the human-written code. Resembling a biological evolutionary process, software is not created from scratch but evolves over time. Studies have shown that 50% of all the files across open source projects have been reused at least twice, and more than 50% of the developers modify the components before reusing them. Software ecosystem is highly complex with interconnected information including source code, binary executables, documentation, comments, descriptions, vulnerabilities, and specifications, etc. All these interconnected entities form a heterogeneous information network. The ultimate task is to learn a robust latent vector representation for every defined entity in this network. The learned vector captures the semantic as well as the syntactic relationship between entities of different granularity to their neighbors. The learned representation enables us to match similar entities and observe the hierarchical relationship among entities on different granularity. They directly contribute to the source code and binary code clone search problems.