task-域名抓取

2018/8/17 12:51:19 加载中...

任务描述

在“全国公安机关互联网站安全服务平台”（网址：http://www.beian.gov.cn/portal/index.do ）上有一个“备案展厅”的导航。如下图：

未命名图片.png

点击此链接之后，跳转到网址“http://www.beian.gov.cn/portal/recordShow?token=24b4857f-003c-4c02-8c4c-ed4aa905ced9”，

在这里，我们可以看到部分备案的域名信息（如下图），因为这里的域名信息是定时更新的，所以任务就是，定时将这里的备案信息获取下来并保存到数据库中。

未命名图片2.png

新建项目

既然我们的内容是在“http://www.beian.gov.cn/portal/recordShow?token=24b4857f-003c-4c02-8c4c-ed4aa905ced9”这个网址下面，那么我们通过程序请求这个地址，获取这个页面的内容即可。

编程语言：java

先建一个项目吧，我使用的开发工具（IDE）为 Eclipse （Java EE Oxygen.3a Release (4.7.3a)）

File → New → Maven Project

在 New Maven project 窗口，

勾选 Create a simple project，也就是不选择框架。

点击 Next，

填写内容

Group Id： com.sqber

Artifact Id: domaininfo

其他默认

点击 Finish

这样基本的项目就创建好了。

然后我们创建一个起始类，在 src/main/java 上右击，选择 New → Class

在创建类的窗口

Package 包我们写 com.sqber.domaininfo

类名我们写：Program

并勾选上生成main函数

点击 OK

此时的项目结构为：

获取页面内容

发送请求获取页面内容，这里我们用HTTPClient，在 pom.xml 文件中引入库。

<dependencies>
	<!-- HTTPClient -->
	<dependency>
		<groupId>org.apache.httpcomponents</groupId>
		<artifactId>httpclient</artifactId>
		<version>4.5.5</version>
	</dependency>
	<dependency>
		<groupId>org.apache.httpcomponents</groupId>
		<artifactId>httpmime</artifactId>
		<version>4.5.5</version>
	</dependency>
</dependencies>

此时 pom.xml 文件中还没有 dependencies 节点，所以这里还需要加上此节点。

我们先看下这个httpclient怎么用

在包 com.sqber.domaininfo 下面新建类 HttpClientDemo.java

我们新建一个方法 simpleDemo，来简单实用下 httpclient，代码如下：

package com.sqber.domaininfo;

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

public class HttpClientDemo {

	public static void main(String[] args)  {
		simpleDemo();
	}

	//请求百度地址，看返回的结果信息
	private static void simpleDemo() {
		
		try {
			
			// 创建一个连接对象
			CloseableHttpClient client = HttpClients.createDefault();
			// 创建一个请求对象，这里是Get请求
			HttpGet httpget = new HttpGet("http://www.baidu.com");

			// 连接对象执行请求,并返回请求结果
			CloseableHttpResponse response = client.execute(httpget);
			// 从返回结果中获取返回的实体
			HttpEntity entity = response.getEntity();

			// 通过实体工具类将实体信息转化为字符串
			String content = EntityUtils.toString(entity);

			client.close();

			System.out.println(content);
			
		}catch(Exception e) {
			e.printStackTrace();
		}
	}
}

右键，执行 HttpClientDemo，在 “Console”控制台，我们能看到获取到的HTML信息。

现在把url地址修改为 http://www.beian.gov.cn/portal/recordShow?token=24b4857f-003c-4c02-8c4c-ed4aa905ced9

却发现没有任何信息，在另一个浏览器中打开，发现也不行了。

从url中我们发现了一个 token 参数，显然和此有关系，看首页导航“备案展厅”，打开控制台，搜素token后面的字符串。

但是，之前请求的 url 中也有字符串，莫非请求的时候，还需要传特定的cookie？

打开 fiddler 工具，来监控一下。

果然，在请求数据中，还存在两个 Cookie，BIGipServerPOOL-WebAGPT 和 JSESSIONID。

这两个 Cookie 应该是请求首页返回来的。使用 fiddler 来监控一下首页（http://www.beian.gov.cn/portal/index.do）的请求。

在返回的结果中，我们看到了两个cookie 的值。

看来，要想请求备案信息页面（即 portal/recordShow）则首先要请求首页，获取 token 的值，还有两个cookie的值。

获取页面内容-准备数据

从第一个页面获取到 token 和组建好第二个页面用的 cookie 字符串就好了，这里主要用到了正则表达式提取字符串的内容。

提取token字符串

public static String getToken(String content) {

	/*
	 * 参考： https://blog.csdn.net/yb642518034/article/details/61198976 (?=pattern)
	 * 正向先行断言 (?<=pattern) 正向后行断言
	 */
	String pattern = "(?<=var taken_for_user = ')(.*)(?=')";
	Pattern r = Pattern.compile(pattern);
	Matcher m = r.matcher(content);

	if (m.find()) {
		return m.group();
		// int groupCount = m.groupCount();
		// for(int i=0; i<groupCount; i++) {
		// String g = m.group(i);
		// print(g);
		// }
	}

	return "";
}

提取cookie字符串

public static String getCookie(String headerVal) {
	/* headerVal示例： JSESSIONID=8C22DB5AE5D032BBFD685B13152614BA; Path=/; HttpOnly */
	/*
	 * 、+限定符都是贪婪的，因为它们会尽可能多的匹配文字，只有在它们的后面加上一个?就可以实现非贪婪或最小匹配。
	 */
	String pattern = "(.*?)(?=;)";
	Pattern r = Pattern.compile(pattern);
	Matcher m = r.matcher(headerVal);

	if (m.find()) {
		// return m.group();
		int groupCount = m.groupCount();
		for (int i = 0; i < groupCount; i++) {
			String g = m.group(i);
			// print(g);
			return g;
		}
	}
	return "";
}

后面就是根据获取到的token和cookie获取第二个页面的内容，并解析，保存到数据库即可，包括定时任务，具体的不再说明，可以直接看代码。

地址：https://github.com/shenqiangbin/domainSpider