Table Of Contents

Previous topic

the sdhash tutorial

Next topic



sdhash is tool that allows two arbitrary blobs of data to be compared for similarity based on common strings of binary data. It is designed to provide quick results during the triage and initial investigation phases. It has been in active development since 2010 with the explicit goal of becoming fast, scalable, and reliable.

Use cases

There two general classes of problems where sdhash can provide significant benefits–fragment identification and version correlation.

In fragment identification, we search for a smaller piece of data inside a bigger piece of data (“needle-in-a-haystack”). For example:

  • Block vs. file correlation: given a chunk of data (disk block/network packet/RAM page/etc), we can search a reference collection of files to identify whether the chunk came from any of them.
  • File vs. RAM/disk image: given a file and a target image, we can efficiently determine if any pieces of the file can be found on the image (that includes deallocated storage).

In version correlation, we are interested in correlating data objects (files) that are comparable in size and, thus, similar ones can be viewed as versions. These are two basic scenarios in which this is useful–identifying related documents and identifying code versions.


In all cases, the use of the tool is the same, only the interpretation may differ based on the circumstances.

Lay of the tutorial

This tutorial provides examples and case studies that illustrate all of the above scenarios along with guidance on how to obtain and interpret the results. We recommend that you:

  1. Skim through the case studies to get ideas on where sdhash may fit into your workflow (15-30min);
  2. Read the installation guide, and install the software (5-15min);
  3. Work through the quick start guide (30-60min);
  4. Quickly skim through the options–just to get an idea of what is possible (10min).

At this point, you should be ready to start trying things on your own. As you get to understand the capabilities of sdhash and start running on larger data sets, you will probably be motivated to come back and learn in more detail about options and advanced techniques.

We hope you find this tutorial helpful; send comments to



The sdhash tutorial by Vassil Roussev and Candice Quates is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.