sdhash is designed to work right out of the box with minimal number of command line parameters. At the same time, the tool provides considerable flexibility for advanced users that can be introduced incrementally, as needed.
Invoking the tool from the command line with no arguments shows a brief summary of available command line options (on Linux, the man sdhash commands provides more details):
sdhash 3.3 by Vassil Roussev, Candice Quates [sdhash.org] 07/2013
Usage: sdhash <options> <files>
Configuration:
-r [ --deep ] generate SDBFs from directories and files
-f [ --target-list ] generate SDBFs from list(s) of filenames
-c [ --compare ] compare SDBFs in file, or two SDBF files
-g [ --gen-compare ] compare all pairs in source data
-t [ --threshold ] arg (=1) only show results >=threshold
-b [ --block-size ] arg hashes input files in nKB blocks
-p [ --threads ] arg restrict compute threads to N threads
-s [ --sample-size ] arg (=0) sample N filters for comparisons
-z [ --segment-size ] arg set file segment size, 128MB default
-o [ --output ] arg send output to files
--separator arg (=pipe) for comparison results: pipe csv tab
--hash-name arg set name of hash on stdin
--validate parse SDBF file to check if it is valid
--index generate indexes while hashing
--index-search arg search directory of reference indexes
--config-file arg (=sdhash.cfg) use config file
--verbose warnings, debug and progress output
--version show version info
-h [ --help ] produce help message
Note
Whenever we discuss command line options, we will quote both the short and long form as in -c [--compare].
Invoked with one, or more, file names, sdhash produces the similarity digests of each one files and prints them to standard output. Each digest is completely self-contained and is exactly one (potentially very, very long) line of printable ASCII characters. It consists of several header fields separated by semicolons, followed by base64-encoding of the (binary) digest data. (The details of the format are explained in [?].)
$ sdhash 014.html
sdbf:03:8:014.html:38053:sha1:256:5:7ff:160:3:75:HZvUjQGMcYEfsk8IQgIggA ...
Lines from different files can be freely combined using standard text processing utilities. For all practical purposes, there are only two “user serviceable” fields–the file/object name (prefaced by its length) followed by the original object/file length.
Note
Minimum file size if 512 bytes; sdhash will skip over smaller files quietely. If you need warning about these, use the [--verbose] option.
Standard pathname patterns expansion (globbing) works as expected, courtesy of the shell:
$ sdhash *.html
sdbf:03:8:014.html:38053:sha1:256:5:7ff:160:3:75:HZvUjQGMcYEfsk8IQgIggA ...
sdbf:03:8:056.html:6664:sha1:256:5:7ff:160:1:99:AAAYTGCQsiAOBEkANREgAIh ...
sdbf:03:8:057.html:15893:sha1:256:5:7ff:160:2:102:KSxoLJfAKAAaIQiRoAMZw ...
sdbf:03:8:058.html:734:sha1:256:5:7ff:160:1:11:AAAAAAAAAAAAAAAAAAAAAAAQ ...
...
Note
There are practical limits to globbing pattern expansion in shell environments. In our experience, bash maxes somewhere between 40,000 and 50,00 files. To get around these limitations, use the -f [ --hash-list ] or -r [--deep]. See [?].
To save the result to a file you can either use output stream redirection:
$ sdhash *.html > html.sdbf
or the -o [--output] option (which could be anywhere in the command line):
$ sdhash -o html.sdbf *.html
$ sdhash *.html -o html.sdbf
The naming convention is to use the .sdbf extension (short for similarity digest bloom filters) for the similarity digests (sdhash does not depend on it).
Warning
If you are generating digests from Windows PowerShell, always use the -o [--output] option to save the output.
Windows PowerShell automatically (and unconditionally) converts output streams into 16-bit Unicode format. This prevents the tool from correctly reading the sdbf file later (it also doubles the size of the output).
In all other cases, including running from cmd.exe on Windows, it is safe to use output stream redirection (>) to store the output. In fact, this is measurably faster on large data sets. (The -o options was created specifically to deal with PowerShell without introducing platform dependencies in the code.)
Note
The suggested naming convention is to use the .sdbf extension–short for s*imilarity *d*igest *b*loom *f*ilters–for the signature file. (The implementation does *not depend on the file name.)
The primary mode of use for sdhash is to first generate and store the hashes for the targets of interest, and then compare them. The comparison step is initiated by the -c [--compare] option, which can be invoked with one or two hash sets as arguments.
In this case, the single file parameter indicates that sdhash should compare all unique hash pairs in the set (for n hashes, this yields \(n\times(n-1)/2\) comparisons).
$ sdhash -o html.sdbf *.html
$ sdhash -c html.sdbf
195.html|206.html|001
201.html|206.html|007
428.html|608.html|058
428.html|607.html|069
061.html|062.html|026
060.html|059.html|031
199.html|198.html|066
...
The output consists of three columns separated by the pipe (vertical bar) symbol. The first two columns are names of the files being compared, whereas the third one gives the similarity score, which is a number between -1 and 100.
By default, only positive scores are shown.
In this case, every hash in the first set compared against every hash in the second one. For example, using the html.sdbf from above, we can compare the set against itself (twice!) using the two-set format:
$ sdhash -c html.sdbf html.sdbf | sort
014.html|014.html|100
056.html|056.html|100
057.html|057.html|100
059.html|059.html|100
059.html|060.html|031
060.html|059.html|031
060.html|060.html|100
061.html|061.html|100
061.html|062.html|026
062.html|061.html|026
...
The main use case for the two-set comparison (and sdhash in general) is the querying of a reference database with unknown data in an effort to find correlations:
$ sdhash -o unknown.sdbf unknown/*
sdhash -c unknown.sdbf reference.sdbf
In our example set, we could compare html versus text files:
$ sdhash *.html > html.sdbf
$ sdhash *.txt > txt.sdbf
$ sdhash -c txt.sdbf html.sdbf
In this case, we find no matches, which is common when comparing files of different encodings. The test set includes a set of text files that have been derived from the html files in the set using the html2text utility, which strips away the markup and leaves the plain text. Intuitively, we would expect html files that have large chunks of plain text inside to come out in the following experiment:
$ sdhash *.html-txt > html-txt.sdbf
$ sdhash -c html.sdbf html-txt.sdbf
740.html|740.html-txt|002
448.html|448.html-txt|024
418.html|418.html-txt|010
136.html|136.html-txt|016
597.html|597.html-txt|064
434.html|434.html-txt|095
As it turns out, the last file–434.html–is almost all text.
The threshold parameter instructs sdhash to display only results that are greater than or equal to the parameter; by default, it is set to 1, so all positive results are shown. The proper setting of the threshold is inherently case-specific but in the following sections we give a good starting point that works for most cases.
Setting the threshold to zero (-t 0) will display the results of all comparisons; setting it to negative one (-t -1) will also show comparisons that have not been performed due to insufficient data (see below).
As already mentioned, the result of the comparison is a number between 0 and 100; its interpretation depends on the scenario to which the tool is applied and the data encoding (file type) of the compared objects.
There are two basic usage scenarios:
Note that the distinction between the two is entirely in the eyes of the user–the tool works exactly the same way in all cases. The distinction is only important in understanding what the result means.
For example, we could compare an executable to a raw RAM snapshot, in search for clues that the executable had been loaded. We may get a result as high as 100, which obviously does not mean that the two objects are identical; it merely reflects that sdhash is highly confident that large pieces of the file are present.
Another example case would be to search for known embedded images (e.g., a corporate logo) inside a collection of documents.
Version correlation is very helpful in finding files that have non-trivial amount of commonality. For example, executables tend to change on a function-by-function basis and we have found that we can quite reliably identify new versions of the files from hashes of previous versions. For data, the commonality may come from common boilerplate (html, pdf, etc.), or as a result of normal editing operation.
Despite its range (0-100), the sdhash comparison result should not be intepreted as a percentage of common content. Rather, it should be viewed as a confidence value that indicates how certain the tool is that the two data objects have non-trivial amounts of commonality.
The proper use of the sdhash score is to examine the results in descending order, until the false positives (as defined by the user) exceed the true positives by some margin.
The following is a basic guide to intepreting the results from sdhash comparisons. In [?], we will add some finer points and caveats:
Warning
A result of 100 does not guarantee that two objects are identical. Use a crypto hash to establish identity. (We plan to incorporate a crypto hash to test for identity, starting with version 4.0.)