the t5 corpus

release: 04/18/2011

Description: The t5 corpus is a sample of files derived from the GovDocs corpus and is designed to help test tools for (bitwise) approximate matching. The data set consists of all files in the first five directories (000-004) between 4 KB and 17 MB in size. Most of the (~300) eliminated small file contained web server error messages and corrupted files. A few very large files (mostly logs) were also eliminated as outliers.

The sample was chosen from a sequence of neighboring directories since we expect that to result in groups of genuinely correlated files to be included. (Generally, files that are closer in numbering to each other are more likely to have been retrieved from the same server.)

End result: 4,457 files, 1.9GB of data (1.2GB compressed). File frequencies by type:
    html  pdf   text  doc   ppt   jpg   xls   gif
   +-----+-----+-----+-----+-----+-----+-----+-----+
    1093  1073  711   533   368   362   250   67     
Download: t5-corpus.zip
sha1: 183f68b1626feec47653fda7391cfbf21e0e0ded
License: Public Domain. To the best of our knowledge, all data has been downloaded from public US Government servers and is not subject to copyright, or other restrictions.

references

Roussev, V., An Evaluation of Forensic Similarity Hashes. In Proceedings of the Eleventh Annual DFRWS Conference, pp. S34-41, Aug 2011, New Orleans, LA.