``gold-miner``: analyzes unknown traffic
-----------------------------------------------

The ``gold-miner`` tool is the core of the package that takes a
training profile created using both gold-miner-trainer_ and
gold-miner-trainer-aggregator_ and uses it to try and predict an
unknown traffic source.

.. _gold-miner-trainer: goldminertrainer.html
.. _gold-miner-trainer-aggregator: goldminertraineraggregator.html

Example Invocation
^^^^^^^^^^^^^^^^^^^^

The following example command line processes a PCAP file containing
unknown data (*unknown.pcap*) using a training-profile created from
the gold-miner-trainer_ gold-miner-trainer-aggregator_ tools.  It
specifically looks for the `mail` label.  Note that multiple labels
can be passed to the `-g` flag in order to compare various values to
determine what the *best guess* might be.

::

   gold-miner -r unknown.pcap -p training-profile.fsdb -g mail

Example Output
^^^^^^^^^^^^^^^^

The output includes a bunch of columns in a tab-separated file (called
an FSDB_ file).  Example output may look like:

Interpreting Tab-Separated Value Output
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The output of this utility by default is a FSDB_ formatted dataset
containing (see below for turning on json output instead):


.. _FSDB: https://fsdb.readthedocs.io/

::

   #fsdb -F t timestamp:d identifier token confidence total:l
   1612276702.252115	(50, '10.0.3.2', '10.0.6.2')	email-client	0.019922011881270296	1
   1612276702.252115	(50, '10.0.3.2', '10.0.6.2')	email-server	0.09949107222576414	1
   1612276702.252115	(50, '10.0.3.2', '10.0.6.2')	https-client	0.1108216386959876	1
   1612276702.252115	(50, '10.0.3.2', '10.0.6.2')	https-server	0.0	1
   1612276702.252115	(50, '10.0.3.2', '10.0.6.2')	ftp-client	0.0	1
   1612276702.252115	(50, '10.0.3.2', '10.0.6.2')	ftp-server	0.0	1
   ...
   1612276796.641313	(50, '10.0.3.2', '10.0.6.2')	email-client	0.7168803043442527	1400
   1612276796.641313	(50, '10.0.3.2', '10.0.6.2')	email-server	0.21853768852005073	1400
   1612276796.641313	(50, '10.0.3.2', '10.0.6.2')	https-client	0.09337264365783604	1400
   1612276796.641313	(50, '10.0.3.2', '10.0.6.2')	https-server	0.3331007977800903	1400
   1612276796.641313	(50, '10.0.3.2', '10.0.6.2')	ftp-client	0.06126245747562031	1400
   1612276796.641313	(50, '10.0.3.2', '10.0.6.2')	ftp-server	0.16666655917372097	1400

The columns in question contain:

1. a packet timestamp
2. an identifier (5-tuple, 3-tuple or IPSec specific)
3. a token being searched for (eg: “mail”)
4. a confidence value 0-1
5. the packet counts seen per identifier so far

Example Graph
^^^^^^^^^^^^^^^^

The `multi-key-graph` tool that comes with the `multikeygraph` python
package can be used to graph the results:

::

   multi-key-graph -k token -c confidence -o graph.png gold-mine-output.fsdb

.. image:: ../tande-example/tande-example/b1f1fc5a469f926f506c1a1520b0f613f4ac2df146f4fba7b4e365bbbece6d15.test.0.png

This example graph shows that after a number of packets the
`email-client` label becomes the most likely prediction among the
options being graphed.

Selecting a sub-algorithm to use
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``gold-miner`` supports four different (sub-)algorithms for identifying
traffic:

-  comparison
-  comparison-wide
-  linear
-  lms

The following algorithms are available for use:

algorithm: comparison
^^^^^^^^^^^^^^^^^^^^^^

This is the default, and works best with entirely labeled traffic with
no unknown traffic expected. It works by comparing an unknown flow
against all known profiles to differentiate among the different types in
the training profile. Thus, it will not work when applied to a traffic
sample with an unprofiled traffic flow within it.

algorithm: linear
^^^^^^^^^^^^^^^^^^^^

The ``linear`` algorithm calculates the difference from a given flow vs
the training profile, regardless of what the other training flows use.
This may succeed at times when the ``comparison`` algorithm doesn’t,
especially in cases of unknown traffic being mixed in with the traffic
being prioritized.

algorithm: lms
^^^^^^^^^^^^^^^^^^^^^

The ``lms`` algorithm is similar to the ``linear`` algorithm, but uses
the common square of the difference instead of a linear distance. These
two algorithms usually perform closely together in performance but one
may be better than another.

algorithm: comparison-wide
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is rarely the right algorithm to use, but is left in for the
moment. It may go away in the future.

JSON output
^^^^^^^^^^^

The ``gold-miner`` tool can also output a stream JSON records if that’s
easier to parse. Run ``gold-miner`` with ``-j`` to enable this
feature, or ``-J`` to output a flattened JSON output.

Command Line Arguments
^^^^^^^^^^^^^^^^^^^^^^

.. sphinx_argparse_cli::
   :module: apropos.goldminer.tools.goldminer
   :func: parse_args
   :hook:
   :prog: introduction