Detecting Python implementations of steganography using machine learning

DegreeMaster
Status
Supervisor(s)Martin Beneš, MSc
ProjectUNCOVER

Description

Steganography is a technique to hide secret messages inside inconspicuous cover media. There are thousands of open-source implementations of steganography available in source code repositories.

The goal of this thesis is to develop a machine learning-based method that can detect Python implementations of steganography in a large codebase. A starting point is to review existing methods for source code analysis. The student needs to develop a basic understanding of how steganography is implemented. A potential feature for the specific task could be the embedding function applied to cover media. Another feature could be the key-dependent permutation of embedding positions.

The method involves devising, training, and evaluation of machine learning models using data acquired from source code repositories by automated crawling. Python scripts performing processing operations other than steganography on similar media will serve as contrast class. Post-hoc analyses of feature importance should eliminate too obvious features, such as file names or comments referring to steganography. The so refined accuracy metrics should be reported. They indicate how well steganography implementations are automatically distinguishable from other kinds of media processing scripts.

References

  • Russell, R., Kim, L., Hamilton, L., et al. Automated Vulnerability Detection in Source Code Using Deep Representation Learning. In International Conference on Machine Learning and Applications (ICMLA). IEEE, 2018, pp. 757–762.
  • Chilowicz, M., Duris, E., and Roussel, G. Syntax Tree Fingerprinting for Source Code Similarity Detection. In International Conference on Program Comprehension (ICPC). IEEE, 2009, pp. 243–247.