Series of articles

Threat detection through ZAT combined with machine learning (1)

Threat detection through ZAT combined with machine learning (3)

Let me briefly describe several packages of machine learning:

The pandas library is a data analysis library in machine learning, we can simply regard it as execl

The sklearn library is a package that must be used in machine learning. It integrates a variety of calculation methods including unsupervised learning, supervised learning, etc.

The matplotlib library is a python drawing library (it is used for drawing and painting)

Nmupy is a python array function library, built-in various mathematical functions can perform matrix operations on data and n-dimensional array operations, students who want to quickly learn mathematics knowledge can directly read the official documentation of this library, the library can be displayed with matplotilb .

Machine learning algorithm on dns.log data

Import related libraries

Get zeek data set and output data

Using Pandas can efficiently calculate the index of the data. Here we use the pandas/Numpy vector to calculate the query length.

Since DNS data records values and texts, we need a way to perform unified operations. zat has a DataFrameToMatrix class, which handles numeric data and text data with many details and mechanisms, which we will use below.

Then use zat scikit-learn transformer class to convert Pandas DataFrame to numpy ndarray (matrix)

Use the zat DataframeTomatrix class to process the categorized data and perform an explicit conversion before sending it to the converter.

Now that we have a numpy ndaray matrix, we can proceed to sklearn

Now we start scikit learning, the following example is just a simple content, namely KMeans and TSNE mapping

Plot machine learning results

Let’s investigate 5 DNS data clusters

We put a lot of functions into the clustering algorithm. Features are both digital and textual. Did the cluster “do the right thing”? First, please note the following:

Obviously, we are processing a small amount of Zeek DNS data

This is an example showing how the conversion works (from Zeek to Pandas to Scikit)

DNS data is real data, but for this example and other examples, we have deliberately introduced other things

We know that K in KMeans should be 5 ðŸ™‚

Okay, all of these warnings will let us see how clustering combines numerical data and categorical data.

Cluster 0: (42 observations) looks like a “normal” DNS request

Cluster 1: (11 observations) All queries are “-” (Zeek for NA/not found/etc)

Cluster 2: (6 observations) The protocol is TCP instead of ordinary UDP

Cluster 3: (4 observations) All DNS queries are abnormally long

Cluster 4: (4 observations) The reserved Z bit is set to 1 (required to be 0)

Value + category = AOK

Through our sample data, we have successfully moved from Zeek log to Pandas to scikit-learn. Several clusters are meaningful. From an investigation and threat hunting point of view, it may be useful to cluster data using PCA for dimensionality reduction, depending on your example.

To sum up

The next article will briefly introduce how to use sprark to process zeek large batch data sets.

## Reviews

There are no reviews yet.