release: 04/18/2011
Description: The t5 corpus is a sample of files derived from the GovDocs corpus and is designed to help test tools for (bitwise) approximate matching. The data set consists of all files in the first five directories (000-004) between 4 KB and 17 MB in size. Most of the (~300) eliminated small file contained web server error messages and corrupted files. A few very large files (mostly logs) were also eliminated as outliers.
The sample was chosen from a sequence of neighboring directories since we expect that to result in groups of genuinely correlated files to be included. (Generally, files that are closer in numbering to each other are more likely to have been retrieved from the same server.)
End result: 4,457 files, 1.9GB of data (1.2GB compressed). File frequencies by type:
html pdf text doc ppt jpg xls gif +-----+-----+-----+-----+-----+-----+-----+-----+ 1093 1073 711 533 368 362 250 67- Download: t5-corpus.zip
sha1: 183f68b1626feec47653fda7391cfbf21e0e0dedLicense: Public Domain. To the best of our knowledge, all data has been downloaded from public US Government servers and is not subject to copyright, or other restrictions.
references
- Roussev, V., An Evaluation of Forensic Similarity Hashes. In Proceedings of the Eleventh Annual DFRWS Conference, pp. S34-41, Aug 2011, New Orleans, LA.