Case studies

GPU acceleration¶

We have converted the sdhash comparison algorithm to utilize GPU. We currently support CUDA devices of capability 2.0 and higher, actively developed on both Tesla and Kepler platforms.

The usage differs slightly from the standard version of sdhash, as it is best used to process large quantities of sdhash files. We have been able to search a reference set of 1.6TB using this tool, using full comparisons.

sdhash-gpu BETA by Candice Quates, April 2013
Usage: sdhash-gpu -d dev -r ref.sdbf -t targ.sdbf
Each reference set is processed as a whole entity.
Target set processing compares each object to the current reference set.
Configuration:
-d [ --device ] arg (=0)     CUDA device to use (0,1,2 etc)
-r [ --reference-set ] arg   File or directory of fixed-size reference set(s)
(data size 640MB ideal, 32MB minimum)
-t [ --target-set ] arg      File or directory of variable-sized search
target set(s)
-c [ --confidence ] arg (=1) confidence level of results
--verbose                    debugging and progress output
--version                    show version info
-h [ --help ]                produce help message

Sampling (-s)¶

In some common scenarios, we can speed up comparison processing 10 to 20 times by sampling the query digests. In other words, instead if using the entire object digest, sdhash can grab a sample of it and search, effectively, for part of the data.

Such an approach is very effective when we are trying to establish whether, for example, a file can be found in a RAM/disk capture. In such cases, we generally expect that the file is either present, or not so sampling is the perfect approach to speed up processing.

Let us work through an example. In this case, we will construct a loose (but workable) approximation of a capture using the t5 test data set ([?]). The set consists of 4,457 data files (html, txt, doc, xls, etc.) for a total of 1.9GB of data.

We concatenate all the files to simulate the capture; then sdhash both the individual files, and the capture. Assuming all the files are in the ./t5 directory, the following bash script illustrates the idea:

$cat t5/* > t5.dd$ sdhash t5/* > t5.files.sdbf
$sdhash -z 0 t5.dd > t5.dd.sdbf (We turn off segmentation (-z 0) to simplify the comparison output.) Now we can time the execution of the comparison and count the number of files matched with the following command: $ time sdhash -c t5.files.dd t5.dd.sdbf | wc -l

We find that 4,456 out of 4,457 comparisons yield a positive result. On our development 24-threaded-2.9GHz-Intel server, this completes in 119.2 seconds. Then we time the sampled executions with:

\$ time sdhash -s <n> -c t5.files.dd t5.dd.sdbf | wc -l

where <n> is 2, 4, 8, ..., 1024. The results:

==== ========== =========
<n>  time (s)   speedup
==== ========== =========
0   119.20      1.00
2     3.50     34.06
4     5.90     20.20
8     8.90     13.39
16    13.80      8.64
32    21.70      5.49
64    32.20      3.70
128    45.60      2.61
256    63.90      1.87
512    90.20      1.32
1024   108.60      1.10
==== ========== =========

In our experience, -s 4 gives the best trade off between speed and accuracy. In this example, we find that using a sample of four yields the following number of weak scores:

score =  0:  1
score < 10:  4
score < 15:  8
score < 20: 33

Thus, choosing a threshold of significance of 15 would yield a true positive rate of 99.8% in exchange for 20 times speedup.

On general, even if you end up using no-sampling runs, sampling gives you the option of performing a quick preliminary scan to feel out your targets.

Note

An additional benefit of sampling is that makes the timing of the execution more predictable and easier to parallelize. If we observe the load that a long comparison (such as the one in the example) places on the hardware, we notice that in the beginning all cores are 100% utilized. After a certain point, due to uneven task distribution, some cores finish their computation while other continue on. Usually, there are a few stragglers that can considerably prolong the completion of the overall workload, although 99%+ of the computation is already done.

Indexing [--index][--index-dir arg]¶

Indexing is an experimental feature which uses a large bloom filter to represent (at the moment) 640mb groups of files at once. Hashing a large quantity of files with [--index] will group them into 640mb .sdbf files with corresponding .sdbf-idx index files with no guidance necessary from the user.

The indexes can be searched by passing a directory argument to [--index-dir] and a list of files to be searched. The index directory should contain .sdbf and corresponding .sdbf-idx files. Only search results, not hashing output, are produced by searching indexes.

Note

The GPU’s best fitting reference set size is also our default size for indexed sets. Hashing a large quantity of files with the [--index] option will generate a directory full of optimal sized .sdbf reference files for the GPU.

Client/server processing¶

To be continued...