the t5 corpus

vassil@roussev.net    

release: 04/18/2011

  • Description: The t5 corpus is a sample of files derived from the GovDocs corpus and is designed to help test tools for (bitwise) approximate matching. The data set consists of all files in the first five directories (000-004) between 4 KB and 17 MB in size. Most of the (~300) eliminated small file contained web server error messages and corrupted files. A few very large files (mostly logs) were also eliminated as outliers.

    The sample was chosen from a sequence of neighboring directories since we expect that to result in groups of genuinely correlated files to be included. (Generally, files that are closer in numbering to each other are more likely to have been retrieved from the same server.)

    End result: 4,457 files, 1.9GB of data (1.2GB compressed). File frequencies by type:

        html  pdf   text  doc   ppt   jpg   xls   gif
       +-----+-----+-----+-----+-----+-----+-----+-----+
        1093  1073  711   533   368   362   250   67     
  • Download: t5-corpus.zip
    sha1: 183f68b1626feec47653fda7391cfbf21e0e0ded
  • License: Public Domain. To the best of our knowledge, all data has been downloaded from public US Government servers and is not subject to copyright, or other restrictions.

references