A recent study shows that a programmers code is stylistically unique to them. This almost goes without saying as even simple “Hello World” programs can be written in countless ways.
Researchers from Drexel University, the University of Maryland, the University of Goettingen, and Princeton have developed a system that uses natural language processing and machine learning to figure out the authors of source code based solely on their coding style.
The researches based their code “stylometry” on various factors in the programmer’s code such as layout, whitespacing, tabs vs spaces and lexical attributes. The researchers developed “abstract syntax trees” which reveal an individual coder’s coding style independant from their writing style. This means that if a programmer tries to change their variable names or spacing in an attempt to mask their programming style, the functionality will remain the same and they will still be able to be identified. In essence, the way you code defines you as uniquely as your fingerprint.
The researchers put their system to the test by gathering publicly available data from Google’s Code jam from 2008-2014. Their approach took C++ source code from the contestant’s solutions to the problems assigned each year. They then blindly looked at the solutions the same coders wrote to another problem and tried to identify each author.
The study found that:
- In a sample of 250 programmers over multiple years, their code stylometry achieved 95% accuracy in identifying the author of anonymous code.
- Accuracy rates weren’t statistically different when using an off-the-shelf C++ code obfuscators.
- Coding style is more well defined through solving harder problems. The identification accuracy rate improved when the training dataset was based on more difficult programming problems.
Coder’s personal style is even apparent in complied code:
Syntactic features are preserved in binaries up to a degree among with some other features. This gives a promising direction for identifying authors of binaries and further extending this approach to malware classification.
This could help authorities determine who wrote malicious software in the future.