Web Penetration Testing: Information Collection

In the previous series of articles, we mainly talked about some knowledge of intranet penetration, but in reality, to carry out intranet penetration, a very important premise is: you must be able to enter the intranet! Therefore, starting from this article, we will start the study of Web penetration (intranet penetration series will continue to be updated for a long time).

Penetration testing is a penetration test engineer that completely simulates the attack technology and vulnerability discovery technology that hackers may use, and conducts in-depth detection of the security of the target network, host, and application, helping companies to uncover security flaws and vulnerabilities in normal business processes, and helping companies Discover security risks before hackers, and prevent them before they happen.

The penetration testing process of web applications is mainly divided into three stages, namely: information collection→vulnerability discovery→vulnerability exploitation. In this article, we will give a basic explanation of this link of information collection.

Information collection introduction
Before web penetration testing, the most important step is to collect information. As the saying goes, “the essence of penetration is information collection.” The depth of information collection is directly related to the success or failure of penetration testing. Laying a good foundation for information collection allows testers to choose appropriate and accurate penetration testing methods and shorten the time for penetration testing. Generally speaking, the more information collected, the better, usually including the following parts:

Domain information collection

Subdomain information collection

Site information collection

Sensitive information collection

Server information collection

Port information collection

Real IP address recognition

Social engineering

Below we will explain these types of information collection respectively.

Domain information collection

Domain Name, also known as network domain, is the name of a computer or computer group on the Internet composed of a series of names separated by dots. It is used to identify the computer or computer group during data transmission (sometimes also Refers to geographic location). Due to the shortcomings of inconvenient memory and the inability to display the name and nature of the address organization, people designed domain names and used the domain name system to map the domain names and IP addresses to each other, making it more convenient for people to access the Internet instead of To remember the number of IP addresses that can be directly read by the machine.

Top-level domain name/first-level domain name:

The top-level domain (or top-level domain name, also known as the first-level domain name) is the most advanced domain in the Internet DNS hierarchy, and it is stored in the name space of the DNS root domain. The top-level domain name is the last part of the domain name, that is, the letter after the last point of the domain name. For example, in the domain name http://www.example.com, the top-level domain is .com.

secondary domain:

In addition to the top-level domain name, there are also second-level domain names, which are the fields closest to the left side of the top-level domain name. For example, in the domain name http://www.example.com, example is the second-level domain name.


A subdomain  is a domain belonging to a higher domain in the domain name system hierarchy. For example, mail.example.com and calendar.example.com are two subdomains of example.com, and example.com is a subdomain of the top-level domain.com. All top-level domain names with prefixes are sub-domains of that top-level domain, and sub-domains are divided into second-level sub-domains, third-level sub-domains, and multi-level sub-domains according to the number of technologies.

Generally speaking, before the penetration test, the information that the penetration tester can know is limited, and generally only knows one domain name. This requires the penetration tester to first collect information for a unique domain name and obtain the registration of the domain name. Information, including the DNS server information of the domain name, subdomain information and contact information of the registrant. There are several ways to collect domain name information.

Whois lookup
whois is a transmission protocol used to query the IP and owner of a domain name. Simply put, whois is a database used to query whether a domain name has been registered and the detailed information of the registered domain name (such as domain name owners, domain name registrars). Whois information for different domain name suffixes needs to be queried in different Whois databases. Through whois to query the domain name information, you can get the registrant’s name and email information is usually very useful for testing personal sites, because we can dig out a lot of information about the domain name owner through search engines and social networks.

(1) Online query

Nowadays, there are some online query tools with simplified web interface, which can query different databases at once. The query tools of the web interface still rely on the whois protocol to send query requests to the server, and the tools of the command line interface are still widely used by system administrators. Whois usually uses TCP port 43. The whois information of each domain name/IP is kept by the corresponding management organization.

Common websites include:



Whois Lookup Find information about the owner of the target website

Netcraft Site Report Show the technology used on the target website:http://toolbar.netcraft.com/site_report?url=

Robtex DNS The query displays comprehensive DNS information about the target website

Global Whois Inquiry:https://www.whois365.com/cn/


(2) Use the whois tool in kali to query

In the Whois query tool that comes with Kali Linux, you can query the domain name information by command Whois, just enter the domain name you want to query, as shown in the figure below.

Subdomain information collection
A subdomain (or subdomain; English: Subdomain) is a domain belonging to a higher domain in the domain name system hierarchy. For example, mail.example.com and calendar.example.com are two subdomains of example.com, and example.com is a subdomain of the top-level domain.com. All top-level domain names with prefixes are sub-domains of that top-level domain, and sub-domains are divided into second-level sub-domains, third-level sub-domains, and multi-level sub-domains according to the number of technologies.

Why collect subdomains

Subdomain enumeration can discover more domains or subdomains within the scope of the test, which will increase the probability of vulnerability discovery.

Some hidden, ignored applications running on subdomains may help us discover major vulnerabilities.

The same vulnerabilities often exist in different domains or applications of the same organization

Assuming that our target network is relatively large, it is obviously irrational to start directly from the main domain. For targets of this scale, the main domain is generally a key protection area, so it is better to enter a certain subdomain of the target first, and then It is undoubtedly a better choice to find ways to approach the real goal in a roundabout way.

There are several ways to collect subdomains:

Use search engine query
We can use Google grammar to search for subdomains. Let’s take Baidu’s domain name as an example and use the “site:baidu.com” grammar, as shown in the figure below.

Use online tools to query
There are many subdomain query sites on the Internet, through which you can retrieve the subdomain of a given domain name. Such as:

DNSdumpster: https://dnsdumpster.com/
Whois reverse check: http://whois.chinaz.com/
virustotal: www.virustotal.com
Subdomain blasting: https://phpinfo.me/domain/
IP reverse check bind domain name: http://dns.aizhan.com/

We use DNSdumpster to query nasa’s subdomains:

Enumerate subdomains through certificate transparency public log

Certificate transparency is a project of the certificate authority, which will publish each SSL/TLS certificate to the public log. An SSL/TLS certificate usually contains domain name, subdomain name, and email address, which are often useful information that attackers want to obtain.

The easiest way to find the certificate of a domain name is to use a search engine to search some public CT logs, such as the following website:

Use tools to enumerate subdomains
Tools on kali

In the DNS analysis of the information collection module in kali, there are many tools to collect domain name information, as shown in the figure above.

  • Dnsenum:Domain information collection

  • Dnsmap:Collect information and enumerate DNS information

  • Dnsrecon:Used for DNS reconnaissance

  • Fierce :Subdomain query

  • whois query

We can use the Fierce tool to enumerate subdomains. The tool first tests whether there is a domain transmission vulnerability. If it exists, it should directly collect subdomain information through domain transmission. If there is no domain transmission vulnerability, blasting is used.


fierce -dns <domain name>
fierce -dns <domain name> -threads 100 // threads is the number of threads, you can specify it yourself

Tools on Windows

The subdomain query tool on Windows is mainly composed of:

  • Layer Subdomain excavator
  • subDomainsbrute

  • K8
  • Sublist3r
  • Maltego
  • ……

The subDomainsbrute tool can be used for second-level domain name collection, download address: https://github.com/lijiejie/subDomainsBrute

The aiodns library needs to be installed to run in the Python3 environment. The command to use this tool is executed as follows:

python3 subDomainsBrute.py xxxx.com

After the collection is completed, the collection results will be written into a file corresponding to a domain name:

In addition to the subDomainsbrute tool, the Layer subdomain name excavator is also very powerful. Using it to collect subdomain names will display detailed information such as domain names, resolved IPs, CDN lists, web servers and website status:

Please search and download the tool yourself.

Site information collection
Next, we collect information on the web site, mainly collecting the following information:

  • CMS fingerprint recognition
    Historical loopholes
    Scripting language
    Sensitive directory/file
    Waf recognition

CMS fingerprint recognition
CMS (Content Management System), also known as the whole site system or article system, is used for website content management. Users only need to download the corresponding CMS software package to deploy and build, and directly use the CMS. However, various CMS have their unique structure naming rules and specific file content, so these contents can be used to obtain the specific software CMS and version of the CMS site.

In penetration testing, it is necessary to perform fingerprint identification. Only when the corresponding CMS is identified can the related vulnerabilities be found, and then the corresponding penetration operations can be carried out.

Common CMS are Dedecms , Discuz, PHPWEB, PHPWind, PHPCMS, ECShop, Dvbbs, SiteWeaver, ASPCMS, Empire, Z-Blog, WordPress, etc.

(1) Online recognition

Nowadays, some online websites check CMS fingerprint recognition, as shown below:

(2) Use tools

Common CMS fingerprint recognition tools include WhatWeb, WebRobo, Coco, Yujian Web fingerprint recognition. Dayu CMS recognition program, etc., can quickly identify some mainstream CMS.

As follows, we use the WhatWeb tool on kali to identify the cms of the target site:

As shown in the figure above, WhatWeb has identified the server, middle section, cms, etc. of the target site.

When we learn about the cms type of a site, we can find related vulnerabilities online and conduct corresponding tests.

(3) Manual identification

  • 1. According to the HTTP response header, focus on fields such as X-Powered-By and cookie
    2. According to HTML features, focus on the content and attributes of tags such as body, title, and meta.
    3. Judge according to the special class. There are certain div tags with specific class attributes in HTML, such as <body class=”ke-content”>

Collection of sensitive directories/files
That is to do a directory scan of the target website. In web penetration, detecting the web directory structure and hidden sensitive files is a very important link, from which you can obtain the website’s background management page, file upload interface, robots.txt, and even scan the backup files to get the website source code .

The main scanning tools for common website directories are:

  • dirbuster scanning tool
    dirsearch scanning tool
  • dirb
  • wwwscan
  • Spinder.py
  • Sensitivefilescan
  • Weakfilescan
  • ……

(1) dirsearch directory scanning

Download link: https://github.com/maurosoria/dirsearch

The tool is very simple to use, simple to use as follows:

python3 dirsearch.py -u <URL> -e <EXTENSION>
  • -u: url (required)
    -e: Scanning the website needs to specify the script type of the website, * is all types of scripts (required)
    -w: dictionary (optional)
    -t: thread (optional)

(2) DirBuster catalog scan

DirBuster is a section developed by Owasp (Open Web Application Security Project-Open Web Application Security Project) specifically for detecting directories and hidden files on Web servers. (Requires java environment)

Use as follows:

First enter the URL to be scanned in the Target URL input box and set the request method during the scanning process to “Auto Switch (HEAD and GET)”.
Set the thread by yourself (too large may cause the system to crash)
Select the scan type. If you use a personal dictionary scan, select the “List based bruteforce” option.
Click “Browse” to load the dictionary.
Stand-alone “URL Fuzz”, select URL fuzzing test (standard mode is used if this option is not selected)
Enter “/{dir}” in URL to fuzz. Here {dir} is a variable, used to represent each line in the dictionary, {dir} will be replaced by the directory in the dictionary at runtime.
Click “start” to start scanning
After scanning with DirBuster, check the scanning results. The display method here can be tree display, or you can directly list all existing pages:

Waf recognition
Web application protection system (also known as: website application-level intrusion prevention system. English: Web Application Firewall, referred to as: WAF). Using an internationally recognized saying: Web application firewall is a product that specifically provides protection for Web applications by implementing a series of security policies for HTTP/HTTPS.

wafw00f is a web application firewall (WAF) fingerprint identification tool.

Download link: https://github.com/EnableSecurity/wafw00f

The working principle of wafw00f:

1. Send a normal HTTP request, and then analyze the response, which can identify a lot of WAF.

2. If it is unsuccessful, it will send some (possibly malicious) HTTP requests and use simple logic to infer which WAF is.

3. If this is also unsuccessful, it will analyze the previously returned response and use other simple algorithms to guess whether a WAF or security solution has responded to our attack.

The tool is built into kali:

wafw00f supports a lot of WAF recognition. To see which WAFs it can detect, use the -l option:


Simple use is as follows:

wafw00f https://www.xxx.com/

Sensitive information collection
Sometimes, for some security goals that are well done, penetration testing cannot be done directly through the technical level. At this point, you can use the search engine to search for related information exposed on the Internet by the target. For example: database files, SQL injections, service configuration information, and even leaked source code found on the site through Git, as well as unauthorized access to Redis and other sensitive information such as Robots.txt to achieve the purpose of penetration.

Google hacking

The Google search engine has been in use since 1998, and almost all of our questions can be answered on Google. Google can be used to find far more information than we should even be able to find. Google can find sensitive files, network vulnerabilities, it allows operating systems and identification, and can even be used to find passwords, databases and even the entire contents of mailboxes… .

Google Hacking uses the power of Google search to search for information beyond our imagination in the vast Internet. Lightweight search can search for some legacy backdoors, background entrances that you don’t want to be discovered, sql injection and other network vulnerabilities, middleweight searches reveal some user information leaks, source code leaks, unauthorized access, etc., heavyweight ones It may be that the mdb file is downloaded, the CMS is not locked, the install page, the website configuration password, the php remote file contains vulnerabilities and other important information.

Using Google Hacking we can collect a lot of useful intelligence to us, there is no doubt that Google is a great information gathering tool.

To use Google to search for the information we want, we need to cooperate with some syntax of the Google search engine:

intext: Find web pages with keywords in the text
intitle: Find web pages with keywords in the title
allintitle: usage is similar to intitle, but multiple words can be specified
inurl: search for web pages containing keywords in the url
allinurl: usage is similar to inurl, but multiple words can be specified
site: Specify the site to visit
filetype: Specifies the type of file to be accessed
link: specify the linked page
related: Search for pages of similar types
info: Return the specified information of the site, for example: info:www.baidu.com will return some information of Baidu
phonebook: phone book to query American street address and phone number information
Index of: Use Index of syntax to find web sites that allow directory browsing, just like a normal local directory

Find website backend

intext: background login: only web pages containing “background login” in the text will be returned
intitle: background login: only webpages with “background login” in the title will be returned

Find the administrator login page:


Find the background database management page

Find the designated website backend

site:xx.com intext:admin

site:xx.com inurl:login

site:xx.com intitle:Background

View file upload vulnerabilities of specified websites

site:xx.com inurl:file

site:xx.com inurl:load

Use Index of to find web sites that allow directory browsing, just like a normal local directory

index of /admin
index of /passwd
index of /password
index of /mail
"index of /" +passwd
"index of /" +password.txt
"index of /config"

Use the index of directory list to list the files and directories that exist on a web server.

intitle:index.of The rest here represents a single letter wildcard

Just go in and take a look:

Leaked backup files

intitle:index.of index.php.bak
intitle:index.of www.zip
Find SQL injection

inurl: php?id=
GHDB Google Hacking Database

Link: https://www.exploit-db.com/google-hacking-database/

Hackers can go in and out of the network through a simple search box, and there is a strong backing behind this, that is Google Hacking Database (GHDB)

This is a database that has been voluntarily maintained by hacker friends all over the world, which gathers all kinds of optimized query statements, and is constantly updating various useful and effective Google query statements every day.

Github information leak

As an open source code platform, GitHub provides programmers with a lot of convenience. However, if it is used improperly, such as uploading code that contains account passwords, keys, and other configuration files, the attacker can discover and further use the leaked information. A typical GitHub sensitive information disclosure vulnerability. For example, when developers are developing, they often submit the source code to github first, and finally pull the source code from the remote hosting website to the web directory of the server. If you forget to delete the .git file, Cause this loophole. Use the .git file to restore the source code of the website, and there may be database information in the source code.

Many websites and systems will use pop3 and smtp to send emails. Many developers will also put the relevant configuration file information on Github due to insufficient security awareness. So if we use Google search syntax at this time, we can make these sensitive The information has been found.

site:Github.com smtp
site:Github.com smtp @qq.com
site:Github.com smtp @126.com
site:Github.com smtp @163.com
site:Github.com smtp @sina.com.cn

Database information leakage:

site:Github.com sa password
site:Github.com root password


Server information collection
We also need to collect the information of the target server, which mainly includes the following parts:

  • Web server fingerprint recognition
    Real IP address recognition
    Programming language
    Web middleware
    Port information collection
    Back-end storage technology identification
  • ……

Web server fingerprint recognition
Web server fingerprint recognition is to understand the type and version of the web server that is running. There are currently several different web server providers and software versions on the market. Knowing the type of web server being tested allows the tester to better test the known Vulnerabilities and general methods of exploitation will be of great help in the penetration testing process, and even change the testing route.

Web server fingerprint recognition mainly identifies the following information:

1. Web server name and version

2. Is there an application server on the back end of the web server?

3. Whether the database (DBMS) is deployed on the same host (host), database type

4. Programming language used by web applications

5. Web application framework


(1) Manual inspection

1. HTTP header analysis

That is, to view the Server, X-Powered-By, Cookie and other fields in the HTTP response header, this is also the most basic method.

As shown above, from the Server field, we can find that the server may be Apache, version 2.4.6, running on CentOS Linux system.

According to the X-Powered-By field, we can judge and identify the web framework, and different web frameworks have their own unique cookies, based on this we can also judge and identify the web application framework.

2. Agreement Act
That is to analyze the order of HTTP header fields and observe the order of organization of HTTP response headers, because each server has an internal HTTP header sorting method.

3. Browse and observe the website
We can observe the HTML source code (special class name) and its comment part in some places of the website, which may reveal valuable information. Observing the website page suffix can determine the programming language and framework used by the web application.

4. Deliberately constructed errors
Error pages can give you a lot of information about the server. You can try to get a 404 page by constructing a URL containing a random string and visiting it.

(2) Use tools to identify

whatweb is an auxiliary automated fingerprint analysis tool for web applications

Regular scan:

whatweb domain name/ip address

Batch scan:

Specify the file to be scanned

whatweb -i contains the path of the file containing the domain name to be scanned
Detailed echo scan:

whatweb -v domain name

Whatweb is an open source website fingerprint recognition software based on Ruby language. As shown above, as its name is, whatweb can identify various detailed information about the website including: CMS type, blog platform, middleware, web framework module, website server, Script type, JavaScript library, IP, cookie, etc.

In addition, we can use the Nmap OS fingerprint to initially judge the operating system. For the identification of the back-end DBMS, if the host opens the DBMS to the outside world, it can be judged by port characteristics, especially when opening default ports such as 3306, 1443, 27017, etc.

Real IP address recognition
In penetration testing, you will usually only be given a domain name, so we must determine the real IP of the target server based on this domain name. We can directly obtain some IP and domain names of the target through an IP query network like www.ip138.com Information, but the premise here is that the target server does not use CDN.

What is CDN?

The full name of CDN is Content Delivery Network, that is, content delivery network. CDN is an intelligent virtual network built on the basis of the existing network, relying on edge servers deployed in various places, through the load balancing, content distribution, scheduling and other functional modules of the central platform, so that users can obtain the required content nearby and reduce network congestion. Improve user access response speed and hit rate. The key technologies of CDN mainly include content storage and distribution technology.

CDN directly caches the static data resources frequently accessed by users on the node server. When the user requests again, it will be directly distributed to the node server close to the user to respond to the user. When the user has actual data interaction, it will be sent from the remote Web Respond on the server, which can greatly improve the website’s response speed and user experience. The birth of the CDN network has greatly improved the quality of Internet services, so traditional large-scale network operators have begun to build their own CDN networks.

Therefore, if the target server uses the CDN service, then the IP we directly query is not the IP of the real target server, but a CDN server of the target node closest to you, which makes us unable to get the target directly The real IP of the server.

How to determine whether the target server uses CDN?

We can ping this website domain name, for example, we ping Baidu:

As shown above, we can see that Baidu uses CDN.

We can also set up a proxy or conduct a ping test in different regions through an online ping website, and then compare the pinged IP results in each region to see if these IPs are the same. If they are the same, it is very likely that there is no CDN. According to the working principle of CDN, if the website uses CDN, then the IP address of the website accessed from all over the country is the IP address of each CDN node, then if the pinged IP is mostly different or has strong regularity, you can try to query these IPs Attribution of, determine whether there is a CDN. The following websites can be ping tested:


Take https://www.wepcc.com/ as an example, as shown in the figure, perform a ping command test on https://www.baidu.com. According to different IP addresses and attributions, https://www. baidu.com uses CDN.

How to find the target real IP by bypassing the CDN?

1. Use subdomains. Generally speaking, many webmasters may only use CDN for the main station or substations with high traffic, but some substations with relatively small traffic may not have CDNs. Although these substations and the main station are not the same IP, they are all on the same IP. A situation below the C segment, so we can obtain the IP of the substation by pinging the second-level domain name, so as to determine the real IP segment of the target.

2. Query the main domain. In the past, when using CDN, there was a habit of only letting WWW domain names use cdn, and bald domain names were not used, in order to be more convenient when maintaining the website, without waiting for cdn cache. So try to remove the www from the target website and ping to see if the ip has changed. As shown in the figure below, this method is quite effective:

3. Scan the website’s sensitive files, such as phpinfo.php, etc. to find the real IP of the target.

4. Visit from abroad. Many domestic CDN manufacturers only make domestic lines for various reasons, and there may be almost no foreign lines. At this time, we may be able to obtain the real P by directly accessing foreign hosts. We can visit through foreign online proxy websites, and we may get the real IP address. Online proxy website:


As shown in the figure above, the IP of the target website from foreign agents is the same.

5. Via the mail server. The general mail system is internal and has not been resolved by CDN. Check the mail through the user registration or RSS subscription function of the target website, look for the mail server domain name IP in the mail header, and ping the domain name of the mail server. It may be in the same segment with the target Web. We scan directly one by one to see if the returned HTML source code matches the web. Then we can get the real IP of the target (must be the target’s own internal mail server, a third party or public The mail server is useless).

6. View the historical resolution records of the domain name. Maybe the target did not use CDN a long time ago, so there may be records before CDN use. So you can observe the IP history record of the domain name through websites such as https://www.netcraft.com and https://viewdns.info/.

7. Nslookup query. The NS records, MX records, TXT records, etc. of the query domain name are likely to point to the real IP or the same C-segment server.

8. Use cyberspace search engines. The main purpose here is to use the content returned by the website to find the real original IP. If the original server IP also returns the content of the website, you can search for a lot of relevant data on the Internet. The most common web search engines are as follows:

  • Shodan:https://www.shodan.io/
  • zoomeye: https://www.zoomeye.org/
  • FOFA:https://fofa.so/

9. Let the target actively connect with us.

1. Send an email to us. For example, when subscribing or registering, there will be a registration link sent to our email, and then check the full text source code of the email or the email header, and find the domain name IP of the mail server in the email header.

2. Exploit website vulnerabilities. For example, code execution vulnerabilities, SSRF, and stored XSS can allow the server to actively access our preset web server, then the real IP of the target website server can be seen in the log.


Verify the real IP address obtained

A lot of IP addresses have been obtained through the above method (method 4 above). At this time, we need to determine which one is the real IP address. If it is Web, the easiest way to verify is to directly try to access with IP and see the response Is the page the same as that returned by visiting the domain name? :

Port information collection
In the process of penetration testing, collecting port information is a very important process. By scanning the open port of the target server, the service running on the server can be judged from the port. Because there are different attack methods for different ports, collecting port information can prescribe the right medicine, so that we can penetrate the target server. We can collect the port information of the target server through the following methods:

1. Use nmap tool to collect
nmap -A -v -T4 -O -sV  target address

2. Use masscan to detect port opening information
Masscan claims to be the fastest Internet port scanner, which can scan the Internet in six minutes. The scan result of masscan is similar to nmap (a well-known port scanner). Internally, it is more like scanrand, unicornscan, and ZMap, using asynchronous transmission. The main difference between it and these scanners is that it is faster than these scanners. Moreover, masscan is more flexible, it allows to customize any address range and port range.

Since the use of tools usually leaves traces on the target website, the next step is to provide an online website detection method.

Online website: http://tool.chinaz.com/port/
ThreatScan online website (basic website information collection): https://scan.top15.cn/
Shodan: https://www.shodan.io/

Common ports and attack directions can be referred to:Web penetration testing: summary of commonly used port numbers and attack directions


Social engineering:


I believe that friends who have watched “Who Am I: No Absolute Security System” may be deeply impressed by social engineering. Throughout the film, the principles of social engineering have been permeated everywhere, using people’s timidity to gain their own interests. Later, the male protagonist applied social engineering to the extreme and successfully won a new identity for himself.

Social Engineering is a non-technical penetration method to obtain information through interpersonal communication. In fact, the current hacking attacks are not only remote penetration and intrusion through the network, but also corresponding attacks on human weaknesses in offline scenarios through social engineering. And unfortunately, this method is very effective for hackers, and the success rate is also very high. In fact, social engineering is one of the biggest threats to corporate security. The most obvious difference between the narrow sense and the broad sense of social engineering is whether it will interact with the victim. In a broad sense, it is a targeted attack on a single or multiple targets. Social engineering plays a big role in penetration testing. Using social engineering, an attacker can dig out information that should be secret from an employee.

Kevin Mitnick mentioned in “The Art of Anti-Deception” that human factors are the weakness of security. Many companies and companies invest a lot of money in information security, and the cause of data leakage is often the person himself. You may never imagine that for hackers, through a user name, a string of numbers, and a string of English codes, social engineers can use these few clues to filter and sort them through social engineering attacks. Be able to grasp all your personal information, family status, hobbies, marital status, all traces you leave on the Internet and other personal information clearly. Although this may be the most inconspicuous, but also the most troublesome method. A hacking technique that does not need to rely on any hacking software but pays more attention to the study of human weakness is emerging. This is the social engineering hacking technique.

The social engineering attack consists of four stages:

Research: Information collection (WEB, media, trash can, physics), determine and research goals
Hook: establish the first conversation with the target (HOOK, next set)
Get started: build trust with the target and obtain information
Exit: Leave the attacking scene without causing the target to suspect

Common information collected by social engineering includes: name, gender, date of birth, identity zheng number, identity zheng home address, identity zheng’s public security bureau, express delivery address, general activity range, qq, mobile phone number, email address, bank card number (Bank account opening bank), paypal, yahoo, google,yandex,bing, Twitter, linkedln,amazon, Walmart, Netdisk, WeChat, commonly used ID, education (small/beginner/high/university/resume), detailed analysis of target personality, commonly used password , Photo EXIF ​​information.

Commonly available information systems include:  Airlines system, bus system,  major telecom operators websites, national population basic information resource database, national motor vehicle/driver information resource database, major express delivery systems (unauthorized), nationwide Entry-exit personnel resource database, national fugitive information resource database, enterprise-related systems, national security key unit information resource database, etc.

for example:

Suppose we want to conduct a penetration test on a company and we are collecting the real IP of the target. At this time, we can use the collected email address of a salesperson of this company. First, send an email to the salesperson, pretending to be interested in a certain product, and obviously the salesperson will reply to the email. In this way, an attacker can collect information about the company’s real IP address and internal email server by analyzing the email header. By further applying social engineering, assuming that the target person’s mailbox, phone number, name, and domain name service provider have now been collected, and the password of the mailbox is also obtained by blasting or hitting the database, then you can impersonate the target person Ask customer service staff to assist in resetting the domain management password, and even technicians will help reset the password, so that the attacker can take down the domain management console and do domain hijacking.


Before web penetration testing, the most important step is to collect information. As the saying goes, “the essence of penetration is information collection.” The depth of information collection is directly related to the success or failure of penetration testing. Laying a good foundation for information collection allows testers to choose appropriate and accurate penetration testing methods and shorten the time for penetration testing.


There are no reviews yet.

Be the first to review “Web Penetration Testing: Information Collection”

Your email address will not be published. Required fields are marked *