Series of articles
Threat detection through ZAT combined with machine learning (2)
Machine learning overview
In machine learning, our processing of security data is extremely important. We need to select appropriate algorithms for different types of attack data, but the general process is divided into the following steps:
Feature dimensionality reduction
Today’s article mainly introduces the analysis method of processing zeek’s full traffic data through zat.
zeek is an open source NIDS intrusion detection engine, and currently most used is the risk control business of Internet companies. Zeek provides a tool for zeek analysis, zat.
The zat toolkit has a variety of methods for processing zeek output, which are as follows:
Process log data dynamic polling
Zeek records to Pandas data frame and Scikit-Learn
Dynamically monitor files.log and perform VirusTotal query
Dynamically monitor http.log and display “uncommon” user agents
Run Yara signature on the extracted files
Check x509 certificate
Process zeek dhcp.log log data
Output a dictionary with timestamp
Process zeek’s dns.log log data, and use Pandas to output the dns.log file
Next, we use sklearn to divide the data set, here is part of the code
Perform virus file query on zeek’s file.log log, here is part of the code. vt_query is a related library for querying VirusTotal
Query the sha256 / sha1 value of each file against the VirusTotal service
Query the http.log log of zeek, here is mainly for the data in the UA header
Use Yara to dynamically monitor the extract_files directory. When Zeek deletes a file, the code will run a set of Yara rules on the file
Detect the domain name and perform a “check on the total number of viruses” on these URLs
When your machine accesses uni10.tk, the output effect is as follows
For x509.log data, because some phishing or malicious website traffic is encrypted. We can judge this by certificate.
After running, the output is as follows
For anomaly detection, we can use the isolated forest algorithm for anomaly processing. Once an anomaly is found, we can use a clustering algorithm to group the anomalies into organized parts, so that the analyst can browse the output group instead of looking at it line by line.
Output exception group
Detect tor and calculate port number. Determine the tor traffic by traversing zeek’s ssl.log file, and some code is posted here.
The output is as follows:
The next article will take the dns.log log of zeek as an example to introduce the sklearn feature engineering of dns.log data and the use of numpy for matrix operations and pca dimensionality reduction.