Quick start¶

sdhash is designed to work right out of the box with minimal number of command line parameters. At the same time, the tool provides considerable flexibility for advanced users that can be introduced incrementally, as needed.

Invoking the tool from the command line with no arguments shows a brief summary of available command line options (on Linux, the man sdhash commands provides more details):

sdhash 3.3 by Vassil Roussev, Candice Quates [sdhash.org] 07/2013

Usage: sdhash <options> <files>
Configuration:
  -r [ --deep ]                   generate SDBFs from directories and files
  -f [ --target-list ]            generate SDBFs from list(s) of filenames
  -c [ --compare ]                compare SDBFs in file, or two SDBF files
  -g [ --gen-compare ]            compare all pairs in source data
  -t [ --threshold ] arg (=1)     only show results >=threshold
  -b [ --block-size ] arg         hashes input files in nKB blocks
  -p [ --threads ] arg            restrict compute threads to N threads
  -s [ --sample-size ] arg (=0)   sample N filters for comparisons
  -z [ --segment-size ] arg       set file segment size, 128MB default
  -o [ --output ] arg             send output to files
  --separator arg (=pipe)         for comparison results: pipe csv tab
  --hash-name arg                 set name of hash on stdin
  --validate                      parse SDBF file to check if it is valid
  --index                         generate indexes while hashing
  --index-search arg              search directory of reference indexes
  --config-file arg (=sdhash.cfg) use config file
  --verbose                       warnings, debug and progress output
  --version                       show version info
  -h [ --help ]                   produce help message

Note

Whenever we discuss command line options, we will quote both the short and long form as in -c [--compare].

Digest generation¶

Invoked with one, or more, file names, sdhash produces the similarity digests of each one files and prints them to standard output. Each digest is completely self-contained and is exactly one (potentially very, very long) line of printable ASCII characters. It consists of several header fields separated by semicolons, followed by base64-encoding of the (binary) digest data. (The details of the format are explained in [?].)

Example: Generate the sdhash of file 014.html:

$ sdhash 014.html
sdbf:03:8:014.html:38053:sha1:256:5:7ff:160:3:75:HZvUjQGMcYEfsk8IQgIggA ...

Lines from different files can be freely combined using standard text processing utilities. For all practical purposes, there are only two “user serviceable” fields–the file/object name (prefaced by its length) followed by the original object/file length.

Note

Minimum file size if 512 bytes; sdhash will skip over smaller files quietely. If you need warning about these, use the [--verbose] option.

Standard pathname patterns expansion (globbing) works as expected, courtesy of the shell:

Example: Hash all html files in the current directory:

$ sdhash *.html
sdbf:03:8:014.html:38053:sha1:256:5:7ff:160:3:75:HZvUjQGMcYEfsk8IQgIggA ...
sdbf:03:8:056.html:6664:sha1:256:5:7ff:160:1:99:AAAYTGCQsiAOBEkANREgAIh ...
sdbf:03:8:057.html:15893:sha1:256:5:7ff:160:2:102:KSxoLJfAKAAaIQiRoAMZw ...
sdbf:03:8:058.html:734:sha1:256:5:7ff:160:1:11:AAAAAAAAAAAAAAAAAAAAAAAQ ...
      ...

Note

There are practical limits to globbing pattern expansion in shell environments. In our experience, bash maxes somewhere between 40,000 and 50,00 files. To get around these limitations, use the -f [ --hash-list ] or -r [--deep]. See [?].

To save the result to a file you can either use output stream redirection:

$ sdhash *.html > html.sdbf

or the -o [--output] option (which could be anywhere in the command line):

$ sdhash -o html.sdbf *.html
$ sdhash *.html -o html.sdbf

The naming convention is to use the .sdbf extension (short for similarity digest bloom filters) for the similarity digests (sdhash does not depend on it).

Warning

If you are generating digests from Windows PowerShell, always use the -o [--output] option to save the output.

Windows PowerShell automatically (and unconditionally) converts output streams into 16-bit Unicode format. This prevents the tool from correctly reading the sdbf file later (it also doubles the size of the output).

In all other cases, including running from cmd.exe on Windows, it is safe to use output stream redirection (>) to store the output. In fact, this is measurably faster on large data sets. (The -o options was created specifically to deal with PowerShell without introducing platform dependencies in the code.)

Note

The suggested naming convention is to use the .sdbf extension–short for s*imilarity *d*igest *b*loom *f*ilters–for the signature file. (The implementation does *not depend on the file name.)

Digest comparison¶

The primary mode of use for sdhash is to first generate and store the hashes for the targets of interest, and then compare them. The comparison step is initiated by the -c [--compare] option, which can be invoked with one or two hash sets as arguments.

Single set comparison¶

In this case, the single file parameter indicates that sdhash should compare all unique hash pairs in the set (for n hashes, this yields \(n\times(n-1)/2\) comparisons).

Example: The following will compare all html files in the current directory. Using the sample data included with this tutorial, the output would look like this:

$ sdhash -o html.sdbf *.html
$ sdhash -c html.sdbf
195.html|206.html|001
201.html|206.html|007
428.html|608.html|058
428.html|607.html|069
061.html|062.html|026
060.html|059.html|031
199.html|198.html|066
...

The output consists of three columns separated by the pipe (vertical bar) symbol. The first two columns are names of the files being compared, whereas the third one gives the similarity score, which is a number between -1 and 100.

By default, only positive scores are shown.

Two-set comparison¶

In this case, every hash in the first set compared against every hash in the second one. For example, using the html.sdbf from above, we can compare the set against itself (twice!) using the two-set format:

$ sdhash -c html.sdbf html.sdbf | sort
014.html|014.html|100
056.html|056.html|100
057.html|057.html|100
059.html|059.html|100
059.html|060.html|031
060.html|059.html|031
060.html|060.html|100
061.html|061.html|100
061.html|062.html|026
062.html|061.html|026
...

The main use case for the two-set comparison (and sdhash in general) is the querying of a reference database with unknown data in an effort to find correlations:

$ sdhash -o unknown.sdbf unknown/*
sdhash -c unknown.sdbf reference.sdbf

In our example set, we could compare html versus text files:

$ sdhash *.html > html.sdbf
$ sdhash *.txt  > txt.sdbf
$ sdhash -c txt.sdbf html.sdbf

In this case, we find no matches, which is common when comparing files of different encodings. The test set includes a set of text files that have been derived from the html files in the set using the html2text utility, which strips away the markup and leaves the plain text. Intuitively, we would expect html files that have large chunks of plain text inside to come out in the following experiment:

$ sdhash *.html-txt > html-txt.sdbf
$ sdhash -c html.sdbf html-txt.sdbf
740.html|740.html-txt|002
448.html|448.html-txt|024
418.html|418.html-txt|010
136.html|136.html-txt|016
597.html|597.html-txt|064
434.html|434.html-txt|095

As it turns out, the last file–434.html–is almost all text.

Threshold (`-t`)¶

The threshold parameter instructs sdhash to display only results that are greater than or equal to the parameter; by default, it is set to 1, so all positive results are shown. The proper setting of the threshold is inherently case-specific but in the following sections we give a good starting point that works for most cases.

Setting the threshold to zero (-t 0) will display the results of all comparisons; setting it to negative one (-t -1) will also show comparisons that have not been performed due to insufficient data (see below).

Result interpretation¶

As already mentioned, the result of the comparison is a number between 0 and 100; its interpretation depends on the scenario to which the tool is applied and the data encoding (file type) of the compared objects.

Scenarios¶

There are two basic usage scenarios:

Fragment identification–we search for (traces of) something “small” in something bigger. This can include comparing the content of a disk block/network packet to a file, a file to a RAM capture, or file to a disk image.
Version correlation–we look for similarity between two comparably-sized objects, usually files.

Note that the distinction between the two is entirely in the eyes of the user–the tool works exactly the same way in all cases. The distinction is only important in understanding what the result means.

For example, we could compare an executable to a raw RAM snapshot, in search for clues that the executable had been loaded. We may get a result as high as 100, which obviously does not mean that the two objects are identical; it merely reflects that sdhash is highly confident that large pieces of the file are present.

Another example case would be to search for known embedded images (e.g., a corporate logo) inside a collection of documents.

Version correlation is very helpful in finding files that have non-trivial amount of commonality. For example, executables tend to change on a function-by-function basis and we have found that we can quite reliably identify new versions of the files from hashes of previous versions. For data, the commonality may come from common boilerplate (html, pdf, etc.), or as a result of normal editing operation.

Significance¶

Despite its range (0-100), the sdhash comparison result should not be intepreted as a percentage of common content. Rather, it should be viewed as a confidence value that indicates how certain the tool is that the two data objects have non-trivial amounts of commonality.

The proper use of the sdhash score is to examine the results in descending order, until the false positives (as defined by the user) exceed the true positives by some margin.

The following is a basic guide to intepreting the results from sdhash comparisons. In [?], we will add some finer points and caveats:

Strong (range: 21-100). These are reliable results with very few false positives. When used to evaluate resemblance of two comparable in size objects (files) the number is loosely related to the level of commonality but this is not a guarantee. When used as part of a containment query (find a small object inside a bigger one), the number can vary widely depending on the particular position of the embedding. In other words the larger object may contain the small one 100% but the score may be as small as 25.
Marginal (11-20). The significance of resemblance comparisons in this range depends substantially on the underlying data. For many composite file types (PDF, MS Office) there tends to be some embedded commonality, which is a function of commonly used applications leaving their imprint on the file; we’ve observed this on occasion even with JPEG files that contain lots of (Adobe Photoshop) metadata. In that sense, the tool is not wrong but the discovered correlation is usually not of interest. Other embedded artifacts, such as fonts, are also among the discovered commonalities but are rarely significant. For simpler file types, results in this range are much more likely to be significant and should be examined in decreasing order until the false positives start to dominate.
Weak (1-10). These are generally weak results and, typically, most would be false positives. However, when applied to simple file types, such as text, scores as low as 5 could be significant.
Negative (0). The correlation between the targets is statistically comparable to that of two blobs of random data. Special care needs to be taken when comparing large targets to each other as discovered commonality could be avaraged out to zero. For example, if two 100GB have 1GB in common, the tool will discover that fact but when averaged with the results from the remaining 99GB, the final score will almost certainly be zero, and definitely no more than one.
Unknown (-1). This is a rare occurance for files above 4KB unless they contain large regions of low-entropy data. Recall that the absolute minimum file size that sdhash will consider hashing is 512 bytes. (If a case requires the comparison of lots of tiny files, sdhash is likely the wrong tool.)

Warning

A result of 100 does not guarantee that two objects are identical. Use a crypto hash to establish identity. (We plan to incorporate a crypto hash to test for identity, starting with version 4.0.)

Table Of Contents

Previous topic

Next topic

Quick start¶

Digest generation¶

Digest comparison¶

Single set comparison¶

Two-set comparison¶

Threshold (`-t`)¶

Result interpretation¶

Scenarios¶

Significance¶

Navigation

Table Of Contents

Previous topic

Next topic

Quick search

Quick start¶

Digest generation¶

Digest comparison¶

Single set comparison¶

Two-set comparison¶

Threshold (-t)¶

Result interpretation¶

Scenarios¶

Significance¶

Navigation

Threshold (`-t`)¶