Series of articles
Threat detection through ZAT combined with machine learning (1)
Threat detection through ZAT combined with machine learning (2)
From zeek log to Parquet
First, let me briefly introduce parquet. The purpose of Parquet is to enable any project in the Hadoop ecosystem to use compressed and efficient columnar data. It is a columnar storage format that can be used by any project in the Hadoop ecosystem.
software
Zeek Analysis Tool (ZAT)
parquet
spark
data
A data set of approximately 23 million rows in zeek
Import the required libraries
Start Spark with 4 parallel executors
Here, we use 4 parallel executors to build a local Spark server. I am running on a MAC. For the spark server, I recommend a server with at least 8 cores. The following code starts 4 execution programs and loads conn.log data to spark
Spark Worker and data partition
Spark will read in and partition the data to our staff. Our dataframe (rdd) method will have some partitions divided in the work pool. Each worker will only operate on part of the data, as shown below.
spark_df.rdd.getNumPartitions()
Convert my Zeek logs to Parquet files
Apache Parquet is a columnar storage format focused on performance. Our code to convert Zeek/Zeek logs into Parquet files requires only one line of code. Since we use the Spark distributed executor for the conversion, the conversion is super scalable.
Parquet files are compressed
Here we can see that parquet stores data in compressed column format, there are several compression options to choose from
The original conn.log data reached 2 G
About 420MB after parquet
Now we have loaded the parquet data into spark, we demonstrated some simple spark operations
First get data about spark dataframe
Number of Rows: 22694356 Columns: ts,uid,id_orig_h,id_orig_p,id_resp_h,id_resp_p, proto,service,duration,orig_bytes,resp_bytes,conn_state, local_orig,missed_bytes,history,orig_pkts,orig_ip_bytes, resp_pkts,resp_ip_bytes,tunnel_parents
The following query is for 4 execution programs. The data contains more than 22 million zeek conn log entries, and the completion time is only one second of running on a mac computer.
Let’s take a look at each host, grouped by port and service
To sum up
Spark has a powerful SQL engine and machine learning library. Now that we have loaded the data into the Spark Dataframe, in the next chapter we will use Spark SQL commands to perform some analysis and clustering using Spark MLLib.
Reviews
There are no reviews yet.