.. include:: ../sdhash-macros.rst

Understanding your options
==========================
Once you are comfortable with the basics, it is time to work through the complete set of features that |sdhash| offers. We will still stick to the command-line utility--client/server operation is discussed in the next chapter.

Block-aligned hashes (``-b``)
-----------------------------
In the baseline |sdhash| algorithm, the digests are generated sequentially, with each component filter representing 9-10KB, *on average*. Such digests are very suitable for objects of up to several MB. For larger targes, such as RAM/drive images, |sdhash| automatically switches to *block* mode, which allows for better parallel processing and faster comparisons.

In block mode, the target is split into fixed-size blocks (by default, 16KiB), and each one is hashed separately. The size of the block (in KiB) can be specified with the ``-b [--block-size]`` option. For example, it is often useful to use a block size of 4KiB for RAM images as it matches the typical page size::

  $ sdhash -b 4 ram-capture.dd

The built-in threshold for transitioning to block mode is 16MiB; to *disable block mode*, specify a block size of zero::

  $ sdhash -b 0 ram-capture.dd

.. note::
  Block mode is tuned to work in the 4-16KiB range. Since the granularity of the digests goes down, the results have a somewhat higher false positive rate than the baseline algorithm.

Segmentation (``-z``)
---------------------
Targets larger than 128MiB are split into 128MiB segments; that is, sdhashing a 256MiB target would result, by default, in 2 hashes. 

|ex| 
  Assume that ``256M.rnd`` contains 256MiB of random data. Then:

.. code::

  $ sdhash 256M.rnd
  sdbf-dd:03:14:256M.rnd.0000M:134217728: ...
  sdbf-dd:03:14:256M.rnd.0128M:134217728: ...

  $ sdhash 256M.rnd > 256M.rnd.sdbf
  $ sdhash -c 256M.rnd.sdbf 256M.rnd.sdbf 
  256M.rnd.0000M|256M.rnd.0000M|100
  256M.rnd.0128M|256M.rnd.0128M|100

Segmentation is controlled by the ``-z [--segment-size]`` parameter, where the size is expressed in MiB.

|ex|
  Create & compare the digest with **no segmentation**:

.. code::

  $ sdhash -z 0 256M.rnd > 256M.rnd.sdbf
  $ sdhash -c 256M.rnd.sdbf 256M.rnd.sdbf 
  256M.rnd|256M.rnd|100

|ex|
  Create & compare the digests with **segment size of 64MiB**:

.. code::

  $ sdhash -c 256M.rnd.sdbf 256M.rnd.sdbf 
  256M.rnd.0000M|256M.rnd.0000M|100
  256M.rnd.0064M|256M.rnd.0064M|100
  256M.rnd.0128M|256M.rnd.0128M|100
  256M.rnd.0192M|256M.rnd.0192M|100

Parallel execution (``-p``)
---------------------------
By default, |sdhash| will detect the number of hardware-supported concurrent threads and will try to utilize all of them. If this is not the desired behavior, the ``-p [ --threads ]`` option provides a means to specify the exact level of parallelism to be used.

.. note:: 
  For desktop systems (up to 12-16 cores), using *all* cores is easily the best choice. 
  
  For larger system (24+ cores) we find that using up to 75% of the available cores always results in improved overall throughput; beyond that, it depends on the hardware and OS.

In our experience, Intel processors provide superior performance to comparable AMD ones; this is particularly true for comparison operations, where the measured throughput is approximately *two times* that of AMD. (The latter is attributable to the much faster performance of the ``POPCNT`` instruction on Intel.)

[?] provides some reference numbers for throughput.

File list (``-f``) and recursion (``-r``)
-----------------------------------------
As mentioned before, globbing--a service provided by the shell--has practical limitations, and if the chosen pathname expression results in an argument list that is too long, the execution will fail.

.. note:: If there is any chance that the set of files you are operating on will exceed 10,000, we strongly recommend that you explicitly generate the list of targets before hashing.

|sdhash| will happily take a target list in the form of a text file with one file name per line. Once the list is ready, use the ``-f [--target-list]`` option to digest it.

|ex| 
  Hash all html files in the current subtree (Linux):

.. code::

  $ find -name *.html > file-list
  $ sdhash -f file-list
  sdbf:03:13:data/014.html:38053:sha1:256:5:7ff:160:3:75:HZvUjQGMcYEfsk8IQgIggA ...
  sdbf:03:13:data/056.html:6664:sha1:256:5:7ff:160:1:99:AAAYTGCQsiAOBEkANREgAIh ...
  ...

It is often useful to be able to hash an entire filesystem tree; this can be accomplished with the the ``-r [--deep]``.

|ex|
  Enumerate and hash *all* files recursively:
  
.. code::

  $ sdhash -r .
  sdbf:03:13:data/014.html:38053:sha1:256:5:7ff:160:3:75:HZvUjQGMcYEfsk8IQgIggA ...
  sdbf:03:13:data/056.html:6664:sha1:256:5:7ff:160:1:99:AAAYTGCQsiAOBEkANREgAIh ...
  ...

Only proper files are hashed--directories, links, and special files are skipped over. This option *is* designed to hash large directory trees and has been tested on millions of files and partitions of up to 4TB.

Generate & compare (``-g``)
---------------------------
It is sometimes useful, especially for small trial runs, to hash and compare in one step without leaving any digest file behind. The ``-g`` option is a combination of digest generation and single-set comparison. In other words, the command::

  $ sdhash -g *.html

is logically equivalent to:

.. code::

  $ sdhash -o temp_file.sdbf *.html
  $ sdhash -c temp_file.sdbf
  $ rm temp_file.sdbf
  
  
Standard input (``-``) and naming (``[--hash-name]``)
-----------------------------------------------------
Like many other hashing tools, `sdhash` can take input from the standard input device using the **`-`** option.

|ex|
  Concatenate and hash all html files:
  
.. code::

  $ cat *.html | sdhash -
  sdbf-dd:03:11:stdin.0000M:5784540:sha1:256:5:7ff:192:354:16384:7b:HRnEjQCMc ...
        
By default, the resulting hash is named *stdin*, which may create problems later on. To specify a custom name, use the ``[--hash-name]`` option.

|ex|
  Concatenate and hash all html files; name the result ``html``:

.. code::

  $ cat *.html | sdhash - --hash-name html
  sdbf-dd:03:10:html.0000M:5784540:sha1:256:5:7ff:192:354:16384:7b:HRnEjQCMc ...
        
.. note::
  For hashing from ``stdin``, both block mode and segmentation are enabled and *cannot* be disabled (they can still be modified within allowable ranges).

Custom configuration (``[--config-file]``)
------------------------------------------
If you find that the |sdhash| parameters that you use often differ from the built-in defaults, you should consider saving the combination of preferred parameters in a config(uration) file. The config file is a plain text file where each line specifies a default parameter value using the long option format. E.g.::

  block-size=16
  sample-size=4
  thread-count=4
  threshold=10

Naming the file ``sdhash.cfg`` and placing it in the current directory will cause |sdhash| to automatically load it and use it, without any additional options. Alternatively, use the ``[--config-file]`` option to explicitly point to a config file.

.. note::
  Any option you specify on the command line will always override anything specified in a config file.

The simple options
------------------
* ``--separator`` sets the separator character used in the comparison results. Valid arguments are ``pipe`` (default), ``csv``, and ``tab``.

* ``--validate`` verifies the integrity of an ``sdbf`` file--it reads it in and parses the digests as usual but performs no computation on them.

* ``--version`` prints out the version info for the |sdhash| executable. (Please include the output with any bug reports.)

* ``--verbose`` provides a detailed log of the operations performed by |sdhash|; all output is sent to standard error.