Threat detection through ZAT combined with machine learning (3)

Category: Tag:

Series of articles
Threat detection through ZAT combined with machine learning (1)

Threat detection through ZAT combined with machine learning (2)

From zeek log to Parquet

First, let me briefly introduce parquet. The purpose of Parquet is to enable any project in the Hadoop ecosystem to use compressed and efficient columnar data. It is a columnar storage format that can be used by any project in the Hadoop ecosystem.

software

Zeek Analysis Tool (ZAT)

parquet

spark

 

data
A data set of approximately 23 million rows in zeek

Import the required libraries

Start Spark with 4 parallel executors

Here, we use 4 parallel executors to build a local Spark server. I am running on a MAC. For the spark server, I recommend a server with at least 8 cores. The following code starts 4 execution programs and loads conn.log data to spark

Spark Worker and data partition

Spark will read in and partition the data to our staff. Our dataframe (rdd) method will have some partitions divided in the work pool. Each worker will only operate on part of the data, as shown below.

spark_df.rdd.getNumPartitions()

Convert my Zeek logs to Parquet files

Apache Parquet is a columnar storage format focused on performance. Our code to convert Zeek/Zeek logs into Parquet files requires only one line of code. Since we use the Spark distributed executor for the conversion, the conversion is super scalable.

Parquet files are compressed
Here we can see that parquet stores data in compressed column format, there are several compression options to choose from

The original conn.log data reached 2 G

About 420MB after parquet

Now we have loaded the parquet data into spark, we demonstrated some simple spark operations

First get data about spark dataframe

Number of Rows: 22694356
Columns: ts,uid,id_orig_h,id_orig_p,id_resp_h,id_resp_p,
proto,service,duration,orig_bytes,resp_bytes,conn_state,
local_orig,missed_bytes,history,orig_pkts,orig_ip_bytes,
resp_pkts,resp_ip_bytes,tunnel_parents

The following query is for 4 execution programs. The data contains more than 22 million zeek conn log entries, and the completion time is only one second of running on a mac computer.

Let’s take a look at each host, grouped by port and service

To sum up
Spark has a powerful SQL engine and machine learning library. Now that we have loaded the data into the Spark Dataframe, in the next chapter we will use Spark SQL commands to perform some analysis and clustering using Spark MLLib.

Reviews

There are no reviews yet.

Be the first to review “Threat detection through ZAT combined with machine learning (3)”

Your email address will not be published. Required fields are marked *