the msx-13 corpus

release: 02/28/2013

Description: The msx-13 corpus is a random sample of 22,000 MS Office 2007 files (docx, xlsx, pptx) downloaded from the Internet. We built it using the results of a popular search engine to identify approximately 10,000 candidate documents of each type. Due to search engine restrictions, it is difficult to obtain more than 1,000 results per query. Therefore, we ran 10 instances of the same query with different site restrictions. We used queries of the form
ext: <file_type> site: <tld>
where
<file_type> = docx | xlsx | pptx, <tld> = com | net | org | edu | gov | us | uk | ca | au | nz.
Once downloaded, the sample files were cleaned up by verifying that each file is a valid zip archive with the necessary structure. The final statistics of the set are:
docx xlsx pptx =============== ====== ====== ====== File count 7,018 7,452 7,530 Total size (MB) 2,014 1,976 20,037 Avg size (KB) 287 265 2,661 =============== ====== ====== ======

List of URLs

Download script (bash)

Parallelizing script (splits scripts into pieces)

Note: We understand that the provided list has a natural level of decay, which makes files disappear. If you are a researcher, please contact us, and we would be happy to provide the original files for the sake of reproducibility.

references

Roussev, V., Quates, C. File fragment encoding classificationd--an empirical approach. In Proceedings of the 13th Annual Digital Forensic Conference (DFRWS), Aug 2013, Monterey, CA. (to appear)