Automated data analysis in information collection

Categories: , Tag:


Nowadays, many information collection tools or articles basically stay in the early stage of information collection. Many articles or tools are focusing on broadening the channels of information collection, from search engines, passive DNS to the richness of blasting dictionaries, etc., which are rarely mentioned. After obtaining this information, whether and how to deal with it. In addition, there are also some penetration testers who put the results in a text file every time they collect information, and even every penetration test almost requires a new information collection process. The disadvantages of this method are very obvious. One is that it causes a relatively large time consumption. The second is that the secondary use of the collected information is very cumbersome and not automated enough. A lot of manual intervention and analysis are required in the later stage.

In order to solve the above two problems, there is naturally a later stage of information collection, which is the topic of this article to discuss-automated data analysis of information collection.

When it comes to information collection, everyone is familiar with it, but some people may have questions about what data analysis is in information collection? Before talking about data analysis, we think about a more common question. What are the ways to determine that two domain names are the assets of the same company? Readers can think about this question before publishing the answer. Here are my thoughts:

WHOIS contact information

Certificate information

DNS resolution information


Webpage response information

Since many large companies use self-built DNS, their domain names will basically point to the self-built DNS NAMESERVER. This feature can be used as one of the criteria for whether it is the same company’s assets. For details, please see whois of in the figure below; here is the web response information. You can use the CSP header, web page similarity, the domain name of the imported static resource, the website configuration file, etc. The process of determining whether the same company is the same through the above methods is a data analysis process. We can also find that this confirmation method is reversible, which means that when we know one of the domain names, we want to know the company’s other domain names. Time, then the above method still works.


2, WHY
If you want to ask why you want to do automated data analysis, you must first know the purpose of automated data analysis. The data analysis that I understand mainly includes two purposes, one is to collect as much target information as possible, and the other is to confirm the target information and remove irrelevant pseudo-information.

For external testers, the assets targeted by penetration testing mainly include domain names, IPs, and so on. Of course, it can also include clients such as installation packages. However, if it is for online system testing, then the most important assets are actually domain names and IPs. If more domain names and IPs are obtained, then more assets are obtained.

If A wants to submit a vulnerability to TSRC, how to collect more Tencent-related assets? Many articles will talk about subdomain blasting, port scanning, etc. But which domain names are blasted? Are these domain names complete enough? Is the IP for obtaining port information complete enough? Answering these questions is what data analysis needs to do. So back to the question of whether it is the same asset mentioned above. If A confirms that the domain name of belongs to Tencent, through the above model relationship, you can find more brother domain names besides Of course This is only a small part of the data analysis. The function of data analysis also includes obtaining more subdomains, determining which domains are in an inactive state, and determining which domains are public service domains, etc. This is the process of removing the fake and authenticating.

Simply put, the data analysis in the information collection is to find more associations such as the above, and to build it in the information collection through automated methods to improve the breadth and quality of our information collection.


3, HOW

I just talked about what data analysis is in information collection and why we use data analysis, but we have overlooked a very critical question, that is, what is the data for data analysis? Here is a table that lists some common data that can be used for analysis, and the correspondence between them:


Note: DOMAIN stands for domain name, SUBDOMAIN stands for subdomain name, BRODOMAIN stands for brother domain name, ORG stands for organization, NAMESERVER stands for DNS address, CNAME stands for CNAME field of DNS resolution; url forward stands for accessing IP to redirect to domain name;

According to my practice, part of the data here can be obtained through public third parties. For example, Censys ( provides global CERTFICATE data, mainly the certificate data of port 443 and port 8443. These data are also available earlier. It can be downloaded directly from the public network. Now it has some restrictions, but it can still be obtained through application. If you think the application is more troublesome and you have resources, you can scan it yourself; and for WHOIS and DNS, Python has a relatively complete package. Directly call, but with the emphasis on privacy, the current DNS contact information is basically blocked, but some ORG or NAMESERVER fields are still reserved, and these can still be used; URL FORWARD information and CSP information are being crawled Or add a small amount of code when the catalog is blasted to complete this requirement; as for the ICP data, there are actually some interfaces, but the data update is too slow, and some require charges, so you have to find a way to solve it yourself. As for the verification codes of some systems, the current success rate of AI recognizing traditional character verification codes is relatively high, basically 90%+, and the recognition time is milliseconds, as shown in the figure below.


I just talked about a lot of theories, and then enter the actual combat stage. Let’s take CERTFICATE data as an example to explain how to obtain, store and associate these data. There are some mature projects for scanning and obtaining certificate information. As mentioned earlier, Censys uses zgrab (, which belongs to the same family as Zmap. Its advantage is that it is relatively fast. But the shortcomings are similar to Zmap, and there will be a larger scan traffic. Since everyone currently uses more Python language, for the convenience of readers, here is also the key code for Python to obtain certificate information:

In addition, CERTFICATE itself has a lot of information, such as encryption methods, etc. If it is only used for association scenarios, we only need to pay attention to the organization, common names and alternate names:

Table 1. Certificate information table

domain ip orgid iscommon update_time
domain name ip Organization id Is the domain name commonly used? Update time
Remove *, etc., and use the reverse function to facilitate the like operation, and the crystallization of data through multiple rounds of optimization. This is different from alternate domain names There is an automatic update mechanism, more than 30 days, automatic update

Table 2. Organization information table

org orgid area
name of association Organization id area

Then you can import the data after parsing, or scan it freely. It is worth noting that if you scan by yourself, it is recommended that Zmap go through the port first and then get the certificate. Once you get the data, you can do whatever you want. Speaking of A’s needs just now, he already knows that Tencent owns the domain name. What should I do if I want to obtain other root domain names? It is actually very simple, just one SQL statement:

Let [.] equal to 1 is not standardized, and the situation of is not considered. However, in order to simplify the demonstration, the actual processing will be a little more complicated. You need to associate the dns data to see the basic information of the domain name for the next second confirm. After execution, some of the results can be obtained as follows, there are almost 100 in total:

Get the subdomain of, some screenshots of the results are as follows:

There are many other practices, and I will not give examples one by one here. In short, with the data in hand, you can dig out many things. In addition, just like the certificate example above, there are many points that can be mined for other data in the table, get these data and store them, and solidify these relationships through the code, then our automated data analysis platform is basically completed. At this moment, we will conduct a new review of the information collection process. After automated data analysis, information collection will evolve into what kind of process:

Associate CERTFICAE information to find all brother domain names of the same organization, and at the same time recursively associate to find the brother domain names and subdomains of the same organization brother domain names;

Link to WHOIS information to find all brother domain names of the same company, same organization, and same email address;

Associate CSP information to find all brother domains and subdomains;

Related ICP information to find the brother domain name of the same company domain name and part of the record IP;

Associate DNS information to find the domain name corresponding to the IP (non-CDN); find the domain name with the DNS NAMESERVER;

Correlate the FORWARD information to obtain the domain name corresponding to the IP;


In fact, these processes are carried out recursively until all the domain name information and IP information are found. After these are completed, other tasks can be carried out, such as blasting subdomain names.

I just mentioned the automated analysis of obtaining brother domains and subdomains in information collection. So what other application practices are there? I randomly list two according to my own practice:

Case number one:

Let’s take the CDN that often appears in domain names as an example, how to determine whether a domain name uses CDN? Currently, there are generally two ways, one is to aggregate by IP attributes, and the other is to determine by the CNAME suffix string. For the second type, you can directly filter by aggregating the suffixes of the same CNAME corresponding to the domain names of different ORGs. It can also be done by ip aggregation attributes, but it is relatively troublesome.

Case 2:

When collecting the IP information of the target, you can check the target C segment IP request response information after blasting, and you can confirm whether it is the target IP. At the same time, junk pages can also be found through the aggregation of TitTitle and status code.

Of course, there are many other relationship attributes of this kind, and readers can explore by themselves. I think that automated data analysis is not just to facilitate information collection. More importantly, it is the change of thinking and vision. In the past, information collection was like going to the vegetable market to buy vegetables, but now I have my own garden. Just take it off at the time.



There are no reviews yet.

Be the first to review “Automated data analysis in information collection”

Your email address will not be published. Required fields are marked *