Automated analysis of Java source code repositories to distinguish the steganographic algorithm

DegreeBachelor / Master
Supervisor(s)Martin Beneš, MSc


Steganography is a technique to hide secret messages inside inconspicuous cover media. There are thousands of open-source implementations of steganography available in source code repositories. Given the ease of forking code repositories, it remains unclear how many genuine approaches to steganography are in these codebases.

The goal of this thesis is to quantitatively explore the diversity of Python implementations of steganography in public repositories using measures of code similarity. Existing methods for measuring code similarity are described and compared. Based on this comparison, a metric for the task is suggested and implemented.

The metric is validated on a dataset acquired from public source code repositories by automated crawling. The results should be visualized using distance and available meta information (e.g., time, fork origin). Python scripts performing processing operations other than steganography on similar media could be included in the study as a reference. Statistical methods could be applied to cluster very similar instances, allowing us to estimate the number of genuine approaches.


  • Kuhn, A., Ducasse, S., and Gı̂rba Tudor. Semantic Clustering: Identifying Topics in Source Code. Information and Software Technology, 49, 3 (2007), 230–243.
  • Ragkhitwetsagul, C., Krinke, J., and Clark, D. A Comparison of Code Similarity Analysers. Empirical Software Engineering, 23, 4 (2018), 2464–2519.