.. include:: ../sdhash-macros.rst Understanding your options ========================== Once you are comfortable with the basics, it is time to work through the complete set of features that |sdhash| offers. We will still stick to the command-line utility--client/server operation is discussed in the next chapter. Block-aligned hashes (``-b``) ----------------------------- In the baseline |sdhash| algorithm, the digests are generated sequentially, with each component filter representing 9-10KB, *on average*. Such digests are very suitable for objects of up to several MB. For larger targes, such as RAM/drive images, |sdhash| automatically switches to *block* mode, which allows for better parallel processing and faster comparisons. In block mode, the target is split into fixed-size blocks (by default, 16KiB), and each one is hashed separately. The size of the block (in KiB) can be specified with the ``-b [--block-size]`` option. For example, it is often useful to use a block size of 4KiB for RAM images as it matches the typical page size:: $ sdhash -b 4 ram-capture.dd The built-in threshold for transitioning to block mode is 16MiB; to *disable block mode*, specify a block size of zero:: $ sdhash -b 0 ram-capture.dd .. note:: Block mode is tuned to work in the 4-16KiB range. Since the granularity of the digests goes down, the results have a somewhat higher false positive rate than the baseline algorithm. Segmentation (``-z``) --------------------- Targets larger than 128MiB are split into 128MiB segments; that is, sdhashing a 256MiB target would result, by default, in 2 hashes. |ex| Assume that ``256M.rnd`` contains 256MiB of random data. Then: .. code:: $ sdhash 256M.rnd sdbf-dd:03:14:256M.rnd.0000M:134217728: ... sdbf-dd:03:14:256M.rnd.0128M:134217728: ... $ sdhash 256M.rnd > 256M.rnd.sdbf $ sdhash -c 256M.rnd.sdbf 256M.rnd.sdbf 256M.rnd.0000M|256M.rnd.0000M|100 256M.rnd.0128M|256M.rnd.0128M|100 Segmentation is controlled by the ``-z [--segment-size]`` parameter, where the size is expressed in MiB. |ex| Create & compare the digest with **no segmentation**: .. code:: $ sdhash -z 0 256M.rnd > 256M.rnd.sdbf $ sdhash -c 256M.rnd.sdbf 256M.rnd.sdbf 256M.rnd|256M.rnd|100 |ex| Create & compare the digests with **segment size of 64MiB**: .. code:: $ sdhash -c 256M.rnd.sdbf 256M.rnd.sdbf 256M.rnd.0000M|256M.rnd.0000M|100 256M.rnd.0064M|256M.rnd.0064M|100 256M.rnd.0128M|256M.rnd.0128M|100 256M.rnd.0192M|256M.rnd.0192M|100 Parallel execution (``-p``) --------------------------- By default, |sdhash| will detect the number of hardware-supported concurrent threads and will try to utilize all of them. If this is not the desired behavior, the ``-p [ --threads ]`` option provides a means to specify the exact level of parallelism to be used. .. note:: For desktop systems (up to 12-16 cores), using *all* cores is easily the best choice. For larger system (24+ cores) we find that using up to 75% of the available cores always results in improved overall throughput; beyond that, it depends on the hardware and OS. In our experience, Intel processors provide superior performance to comparable AMD ones; this is particularly true for comparison operations, where the measured throughput is approximately *two times* that of AMD. (The latter is attributable to the much faster performance of the ``POPCNT`` instruction on Intel.) [?] provides some reference numbers for throughput. File list (``-f``) and recursion (``-r``) ----------------------------------------- As mentioned before, globbing--a service provided by the shell--has practical limitations, and if the chosen pathname expression results in an argument list that is too long, the execution will fail. .. note:: If there is any chance that the set of files you are operating on will exceed 10,000, we strongly recommend that you explicitly generate the list of targets before hashing. |sdhash| will happily take a target list in the form of a text file with one file name per line. Once the list is ready, use the ``-f [--target-list]`` option to digest it. |ex| Hash all html files in the current subtree (Linux): .. code:: $ find -name *.html > file-list $ sdhash -f file-list sdbf:03:13:data/014.html:38053:sha1:256:5:7ff:160:3:75:HZvUjQGMcYEfsk8IQgIggA ... sdbf:03:13:data/056.html:6664:sha1:256:5:7ff:160:1:99:AAAYTGCQsiAOBEkANREgAIh ... ... It is often useful to be able to hash an entire filesystem tree; this can be accomplished with the the ``-r [--deep]``. |ex| Enumerate and hash *all* files recursively: .. code:: $ sdhash -r . sdbf:03:13:data/014.html:38053:sha1:256:5:7ff:160:3:75:HZvUjQGMcYEfsk8IQgIggA ... sdbf:03:13:data/056.html:6664:sha1:256:5:7ff:160:1:99:AAAYTGCQsiAOBEkANREgAIh ... ... Only proper files are hashed--directories, links, and special files are skipped over. This option *is* designed to hash large directory trees and has been tested on millions of files and partitions of up to 4TB. Generate & compare (``-g``) --------------------------- It is sometimes useful, especially for small trial runs, to hash and compare in one step without leaving any digest file behind. The ``-g`` option is a combination of digest generation and single-set comparison. In other words, the command:: $ sdhash -g *.html is logically equivalent to: .. code:: $ sdhash -o temp_file.sdbf *.html $ sdhash -c temp_file.sdbf $ rm temp_file.sdbf Standard input (``-``) and naming (``[--hash-name]``) ----------------------------------------------------- Like many other hashing tools, `sdhash` can take input from the standard input device using the **`-`** option. |ex| Concatenate and hash all html files: .. code:: $ cat *.html | sdhash - sdbf-dd:03:11:stdin.0000M:5784540:sha1:256:5:7ff:192:354:16384:7b:HRnEjQCMc ... By default, the resulting hash is named *stdin*, which may create problems later on. To specify a custom name, use the ``[--hash-name]`` option. |ex| Concatenate and hash all html files; name the result ``html``: .. code:: $ cat *.html | sdhash - --hash-name html sdbf-dd:03:10:html.0000M:5784540:sha1:256:5:7ff:192:354:16384:7b:HRnEjQCMc ... .. note:: For hashing from ``stdin``, both block mode and segmentation are enabled and *cannot* be disabled (they can still be modified within allowable ranges). Custom configuration (``[--config-file]``) ------------------------------------------ If you find that the |sdhash| parameters that you use often differ from the built-in defaults, you should consider saving the combination of preferred parameters in a config(uration) file. The config file is a plain text file where each line specifies a default parameter value using the long option format. E.g.:: block-size=16 sample-size=4 thread-count=4 threshold=10 Naming the file ``sdhash.cfg`` and placing it in the current directory will cause |sdhash| to automatically load it and use it, without any additional options. Alternatively, use the ``[--config-file]`` option to explicitly point to a config file. .. note:: Any option you specify on the command line will always override anything specified in a config file. The simple options ------------------ * ``--separator`` sets the separator character used in the comparison results. Valid arguments are ``pipe`` (default), ``csv``, and ``tab``. * ``--validate`` verifies the integrity of an ``sdbf`` file--it reads it in and parses the digests as usual but performs no computation on them. * ``--version`` prints out the version info for the |sdhash| executable. (Please include the output with any bug reports.) * ``--verbose`` provides a detailed log of the operations performed by |sdhash|; all output is sent to standard error.