Series of articles
Threat detection through ZAT combined with machine learning (1)
Let me briefly describe several packages of machine learning:
The pandas library is a data analysis library in machine learning, we can simply regard it as execl
The sklearn library is a package that must be used in machine learning. It integrates a variety of calculation methods including unsupervised learning, supervised learning, etc.
The matplotlib library is a python drawing library (it is used for drawing and painting)
Nmupy is a python array function library, built-in various mathematical functions can perform matrix operations on data and n-dimensional array operations, students who want to quickly learn mathematics knowledge can directly read the official documentation of this library, the library can be displayed with matplotilb .
Machine learning algorithm on dns.log data
Import related libraries
Get zeek data set and output data
Using Pandas can efficiently calculate the index of the data. Here we use the pandas/Numpy vector to calculate the query length.
Since DNS data records values and texts, we need a way to perform unified operations. zat has a DataFrameToMatrix class, which handles numeric data and text data with many details and mechanisms, which we will use below.
Then use zat scikit-learn transformer class to convert Pandas DataFrame to numpy ndarray (matrix)
Use the zat DataframeTomatrix class to process the categorized data and perform an explicit conversion before sending it to the converter.
Now that we have a numpy ndaray matrix, we can proceed to sklearn
Now we start scikit learning, the following example is just a simple content, namely KMeans and TSNE mapping
Plot machine learning results
Let’s investigate 5 DNS data clusters
We put a lot of functions into the clustering algorithm. Features are both digital and textual. Did the cluster “do the right thing”? First, please note the following:
Obviously, we are processing a small amount of Zeek DNS data
This is an example showing how the conversion works (from Zeek to Pandas to Scikit)
DNS data is real data, but for this example and other examples, we have deliberately introduced other things
We know that K in KMeans should be 5 🙂
Okay, all of these warnings will let us see how clustering combines numerical data and categorical data.
Cluster 0: (42 observations) looks like a “normal” DNS request
Cluster 1: (11 observations) All queries are “-” (Zeek for NA/not found/etc)
Cluster 2: (6 observations) The protocol is TCP instead of ordinary UDP
Cluster 3: (4 observations) All DNS queries are abnormally long
Cluster 4: (4 observations) The reserved Z bit is set to 1 (required to be 0)
Value + category = AOK
Through our sample data, we have successfully moved from Zeek log to Pandas to scikit-learn. Several clusters are meaningful. From an investigation and threat hunting point of view, it may be useful to cluster data using PCA for dimensionality reduction, depending on your example.
To sum up
The next article will briefly introduce how to use sprark to process zeek large batch data sets.