Table Of Contents

Previous topic

Quick start

Next topic

Advanced processing

Understanding your options

Once you are comfortable with the basics, it is time to work through the complete set of features that sdhash offers. We will still stick to the command-line utility–client/server operation is discussed in the next chapter.

Block-aligned hashes (-b)

In the baseline sdhash algorithm, the digests are generated sequentially, with each component filter representing 9-10KB, on average. Such digests are very suitable for objects of up to several MB. For larger targes, such as RAM/drive images, sdhash automatically switches to block mode, which allows for better parallel processing and faster comparisons.

In block mode, the target is split into fixed-size blocks (by default, 16KiB), and each one is hashed separately. The size of the block (in KiB) can be specified with the -b [--block-size] option. For example, it is often useful to use a block size of 4KiB for RAM images as it matches the typical page size:

$ sdhash -b 4 ram-capture.dd

The built-in threshold for transitioning to block mode is 16MiB; to disable block mode, specify a block size of zero:

$ sdhash -b 0 ram-capture.dd

Note

Block mode is tuned to work in the 4-16KiB range. Since the granularity of the digests goes down, the results have a somewhat higher false positive rate than the baseline algorithm.

Segmentation (-z)

Targets larger than 128MiB are split into 128MiB segments; that is, sdhashing a 256MiB target would result, by default, in 2 hashes.

Example
Assume that 256M.rnd contains 256MiB of random data. Then:
$ sdhash 256M.rnd
sdbf-dd:03:14:256M.rnd.0000M:134217728: ...
sdbf-dd:03:14:256M.rnd.0128M:134217728: ...

$ sdhash 256M.rnd > 256M.rnd.sdbf
$ sdhash -c 256M.rnd.sdbf 256M.rnd.sdbf
256M.rnd.0000M|256M.rnd.0000M|100
256M.rnd.0128M|256M.rnd.0128M|100

Segmentation is controlled by the -z [--segment-size] parameter, where the size is expressed in MiB.

Example
Create & compare the digest with no segmentation:
$ sdhash -z 0 256M.rnd > 256M.rnd.sdbf
$ sdhash -c 256M.rnd.sdbf 256M.rnd.sdbf
256M.rnd|256M.rnd|100
Example
Create & compare the digests with segment size of 64MiB:
$ sdhash -c 256M.rnd.sdbf 256M.rnd.sdbf
256M.rnd.0000M|256M.rnd.0000M|100
256M.rnd.0064M|256M.rnd.0064M|100
256M.rnd.0128M|256M.rnd.0128M|100
256M.rnd.0192M|256M.rnd.0192M|100

Parallel execution (-p)

By default, sdhash will detect the number of hardware-supported concurrent threads and will try to utilize all of them. If this is not the desired behavior, the -p [ --threads ] option provides a means to specify the exact level of parallelism to be used.

Note

For desktop systems (up to 12-16 cores), using all cores is easily the best choice.

For larger system (24+ cores) we find that using up to 75% of the available cores always results in improved overall throughput; beyond that, it depends on the hardware and OS.

In our experience, Intel processors provide superior performance to comparable AMD ones; this is particularly true for comparison operations, where the measured throughput is approximately two times that of AMD. (The latter is attributable to the much faster performance of the POPCNT instruction on Intel.)

[?] provides some reference numbers for throughput.

File list (-f) and recursion (-r)

As mentioned before, globbing–a service provided by the shell–has practical limitations, and if the chosen pathname expression results in an argument list that is too long, the execution will fail.

Note

If there is any chance that the set of files you are operating on will exceed 10,000, we strongly recommend that you explicitly generate the list of targets before hashing.

sdhash will happily take a target list in the form of a text file with one file name per line. Once the list is ready, use the -f [--target-list] option to digest it.

Example
Hash all html files in the current subtree (Linux):
$ find -name *.html > file-list
$ sdhash -f file-list
sdbf:03:13:data/014.html:38053:sha1:256:5:7ff:160:3:75:HZvUjQGMcYEfsk8IQgIggA ...
sdbf:03:13:data/056.html:6664:sha1:256:5:7ff:160:1:99:AAAYTGCQsiAOBEkANREgAIh ...
...

It is often useful to be able to hash an entire filesystem tree; this can be accomplished with the the -r [--deep].

Example
Enumerate and hash all files recursively:
$ sdhash -r .
sdbf:03:13:data/014.html:38053:sha1:256:5:7ff:160:3:75:HZvUjQGMcYEfsk8IQgIggA ...
sdbf:03:13:data/056.html:6664:sha1:256:5:7ff:160:1:99:AAAYTGCQsiAOBEkANREgAIh ...
...

Only proper files are hashed–directories, links, and special files are skipped over. This option is designed to hash large directory trees and has been tested on millions of files and partitions of up to 4TB.

Generate & compare (-g)

It is sometimes useful, especially for small trial runs, to hash and compare in one step without leaving any digest file behind. The -g option is a combination of digest generation and single-set comparison. In other words, the command:

$ sdhash -g *.html

is logically equivalent to:

$ sdhash -o temp_file.sdbf *.html
$ sdhash -c temp_file.sdbf
$ rm temp_file.sdbf

Standard input (-) and naming ([--hash-name])

Like many other hashing tools, sdhash can take input from the standard input device using the `-` option.

Example
Concatenate and hash all html files:
$ cat *.html | sdhash -
sdbf-dd:03:11:stdin.0000M:5784540:sha1:256:5:7ff:192:354:16384:7b:HRnEjQCMc ...

By default, the resulting hash is named stdin, which may create problems later on. To specify a custom name, use the [--hash-name] option.

Example
Concatenate and hash all html files; name the result html:
$ cat *.html | sdhash - --hash-name html
sdbf-dd:03:10:html.0000M:5784540:sha1:256:5:7ff:192:354:16384:7b:HRnEjQCMc ...

Note

For hashing from stdin, both block mode and segmentation are enabled and cannot be disabled (they can still be modified within allowable ranges).

Custom configuration ([--config-file])

If you find that the sdhash parameters that you use often differ from the built-in defaults, you should consider saving the combination of preferred parameters in a config(uration) file. The config file is a plain text file where each line specifies a default parameter value using the long option format. E.g.:

block-size=16
sample-size=4
thread-count=4
threshold=10

Naming the file sdhash.cfg and placing it in the current directory will cause sdhash to automatically load it and use it, without any additional options. Alternatively, use the [--config-file] option to explicitly point to a config file.

Note

Any option you specify on the command line will always override anything specified in a config file.

The simple options

  • --separator sets the separator character used in the comparison results. Valid arguments are pipe (default), csv, and tab.
  • --validate verifies the integrity of an sdbf file–it reads it in and parses the digests as usual but performs no computation on them.
  • --version prints out the version info for the sdhash executable. (Please include the output with any bug reports.)
  • --verbose provides a detailed log of the operations performed by sdhash; all output is sent to standard error.