Once you are comfortable with the basics, it is time to work through the complete set of features that sdhash offers. We will still stick to the command-line utility–client/server operation is discussed in the next chapter.
In the baseline sdhash algorithm, the digests are generated sequentially, with each component filter representing 9-10KB, on average. Such digests are very suitable for objects of up to several MB. For larger targes, such as RAM/drive images, sdhash automatically switches to block mode, which allows for better parallel processing and faster comparisons.
In block mode, the target is split into fixed-size blocks (by default, 16KiB), and each one is hashed separately. The size of the block (in KiB) can be specified with the -b [--block-size] option. For example, it is often useful to use a block size of 4KiB for RAM images as it matches the typical page size:
$ sdhash -b 4 ram-capture.dd
The built-in threshold for transitioning to block mode is 16MiB; to disable block mode, specify a block size of zero:
$ sdhash -b 0 ram-capture.dd
Note
Block mode is tuned to work in the 4-16KiB range. Since the granularity of the digests goes down, the results have a somewhat higher false positive rate than the baseline algorithm.
Targets larger than 128MiB are split into 128MiB segments; that is, sdhashing a 256MiB target would result, by default, in 2 hashes.
$ sdhash 256M.rnd
sdbf-dd:03:14:256M.rnd.0000M:134217728: ...
sdbf-dd:03:14:256M.rnd.0128M:134217728: ...
$ sdhash 256M.rnd > 256M.rnd.sdbf
$ sdhash -c 256M.rnd.sdbf 256M.rnd.sdbf
256M.rnd.0000M|256M.rnd.0000M|100
256M.rnd.0128M|256M.rnd.0128M|100
Segmentation is controlled by the -z [--segment-size] parameter, where the size is expressed in MiB.
$ sdhash -z 0 256M.rnd > 256M.rnd.sdbf
$ sdhash -c 256M.rnd.sdbf 256M.rnd.sdbf
256M.rnd|256M.rnd|100
$ sdhash -c 256M.rnd.sdbf 256M.rnd.sdbf
256M.rnd.0000M|256M.rnd.0000M|100
256M.rnd.0064M|256M.rnd.0064M|100
256M.rnd.0128M|256M.rnd.0128M|100
256M.rnd.0192M|256M.rnd.0192M|100
By default, sdhash will detect the number of hardware-supported concurrent threads and will try to utilize all of them. If this is not the desired behavior, the -p [ --threads ] option provides a means to specify the exact level of parallelism to be used.
Note
For desktop systems (up to 12-16 cores), using all cores is easily the best choice.
For larger system (24+ cores) we find that using up to 75% of the available cores always results in improved overall throughput; beyond that, it depends on the hardware and OS.
In our experience, Intel processors provide superior performance to comparable AMD ones; this is particularly true for comparison operations, where the measured throughput is approximately two times that of AMD. (The latter is attributable to the much faster performance of the POPCNT instruction on Intel.)
[?] provides some reference numbers for throughput.
As mentioned before, globbing–a service provided by the shell–has practical limitations, and if the chosen pathname expression results in an argument list that is too long, the execution will fail.
Note
If there is any chance that the set of files you are operating on will exceed 10,000, we strongly recommend that you explicitly generate the list of targets before hashing.
sdhash will happily take a target list in the form of a text file with one file name per line. Once the list is ready, use the -f [--target-list] option to digest it.
$ find -name *.html > file-list
$ sdhash -f file-list
sdbf:03:13:data/014.html:38053:sha1:256:5:7ff:160:3:75:HZvUjQGMcYEfsk8IQgIggA ...
sdbf:03:13:data/056.html:6664:sha1:256:5:7ff:160:1:99:AAAYTGCQsiAOBEkANREgAIh ...
...
It is often useful to be able to hash an entire filesystem tree; this can be accomplished with the the -r [--deep].
$ sdhash -r .
sdbf:03:13:data/014.html:38053:sha1:256:5:7ff:160:3:75:HZvUjQGMcYEfsk8IQgIggA ...
sdbf:03:13:data/056.html:6664:sha1:256:5:7ff:160:1:99:AAAYTGCQsiAOBEkANREgAIh ...
...
Only proper files are hashed–directories, links, and special files are skipped over. This option is designed to hash large directory trees and has been tested on millions of files and partitions of up to 4TB.
It is sometimes useful, especially for small trial runs, to hash and compare in one step without leaving any digest file behind. The -g option is a combination of digest generation and single-set comparison. In other words, the command:
$ sdhash -g *.html
is logically equivalent to:
$ sdhash -o temp_file.sdbf *.html
$ sdhash -c temp_file.sdbf
$ rm temp_file.sdbf
Like many other hashing tools, sdhash can take input from the standard input device using the `-` option.
$ cat *.html | sdhash -
sdbf-dd:03:11:stdin.0000M:5784540:sha1:256:5:7ff:192:354:16384:7b:HRnEjQCMc ...
By default, the resulting hash is named stdin, which may create problems later on. To specify a custom name, use the [--hash-name] option.
$ cat *.html | sdhash - --hash-name html
sdbf-dd:03:10:html.0000M:5784540:sha1:256:5:7ff:192:354:16384:7b:HRnEjQCMc ...
Note
For hashing from stdin, both block mode and segmentation are enabled and cannot be disabled (they can still be modified within allowable ranges).
If you find that the sdhash parameters that you use often differ from the built-in defaults, you should consider saving the combination of preferred parameters in a config(uration) file. The config file is a plain text file where each line specifies a default parameter value using the long option format. E.g.:
block-size=16
sample-size=4
thread-count=4
threshold=10
Naming the file sdhash.cfg and placing it in the current directory will cause sdhash to automatically load it and use it, without any additional options. Alternatively, use the [--config-file] option to explicitly point to a config file.
Note
Any option you specify on the command line will always override anything specified in a config file.