使用Apache HttpClient编写Java爬虫-软件玩家

admin管理员组
文章数量:1442986

使用Apache HttpClient编写Java爬虫

想要一个使用Apache HttpClient库的爬虫程序。首先，我需要确定用户的需求是什么。他们可能想用Java写一个网络爬虫，用来抓取网页内容。Apache HttpClient是一个常用的HTTP客户端库，用来发送HTTP请求和处理响应。所以，我需要先考虑如何用这个库来构造一个基本的爬虫程序。

以下是使用Apache HttpClient库编写的简单Java爬虫程序示例，包含详细步骤和中文注释：

代码语言：javascript代码运行次数：0运行复制

import org.apache.http.HttpEntity;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.ssl.SSLContexts;
import org.apache.http.util.EntityUtils;

import javax.ssl.SSLContext;
import java.io.IOException;

public class HttpClientCrawler {

    public static void main(String[] args) {
        // 创建自定义HttpClient（支持HTTPS）
        try (CloseableHttpClient httpClient = createHttpClient()) {
            
            // 设置目标URL
            String url = ";;
            
            // 创建HttpGet请求
            HttpGet httpGet = new HttpGet(url);
            
            // 设置请求配置（超时时间）
            RequestConfig config = RequestConfig.custom()
                    .setConnectTimeout(5000)    // 连接超时5秒
                    .setSocketTimeout(5000)     // 读取超时5秒
                    .build();
            httpGet.setConfig(config);
            
            // 设置请求头（模拟浏览器访问）
            httpGet.setHeader("User-Agent", 
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36");

            // 执行请求
            try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
                
                // 获取响应状态码
                int statusCode = response.getStatusLine().getStatusCode();
                System.out.println("HTTP状态码: " + statusCode);

                // 处理成功响应
                if (statusCode == 200) {
                    HttpEntity entity = response.getEntity();
                    if (entity != null) {
                        // 将响应内容转换为字符串
                        String html = EntityUtils.toString(entity, "UTF-8");
                        System.out.println("网页内容长度: " + html.length());
                        
                        // 这里可以添加HTML解析逻辑（如使用Jsoup）
                        // Document doc = Jsoup.parse(html);
                        // 提取需要的数据...
                    }
                } else {
                    System.err.println("请求失败，状态码: " + statusCode);
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    // 创建支持HTTPS的HttpClient（忽略证书验证）
    private static CloseableHttpClient createHttpClient() throws Exception {
        // 创建SSL上下文（信任所有证书，仅用于测试环境！）
        SSLContext sslContext = SSLContexts.custom()
                .loadTrustMaterial(null, (chain, authType) -> true) // 信任所有证书
                .build();

        // 创建SSL连接工厂
        SSLConnectionSocketFactory sslSocketFactory = new SSLConnectionSocketFactory(
                sslContext,
                new String[]{"TLSv1.2", "TLSv1.3"}, // 支持的协议
                null,
                SSLConnectionSocketFactory.getDefaultHostnameVerifier());

        return HttpClients.custom()
                .setSSLSocketFactory(sslSocketFactory)
                .build();
    }
}

使用说明：

代码语言：txt复制

添加Maven依赖：
<dependencies>     <!-- HttpClient核心库 -->     <dependency>         <groupId>org.apache.httpcomponents</groupId>         <artifactId>httpclient</artifactId>         <version>4.5.13</version>     </dependency>          <!-- 如果需要HTML解析，添加Jsoup -->     <dependency>         <groupId>org.jsoup</groupId>         <artifactId>jsoup</artifactId>         <version>1.15.3</version>     </dependency> </dependencies>

主要功能说明：
- 自动处理HTTPS证书验证（测试环境使用）
- 设置合理的超时时间
- 模拟浏览器User-Agent
- 处理HTTP状态码
- 支持响应内容转换

代码语言：txt复制

扩展建议：
// 添加代理支持 RequestConfig config = RequestConfig.custom()     .setProxy(new HttpHost("proxy.example", 8080))     .build();  // 添加Cookie管理 CookieStore cookieStore = new BasicCookieStore(); CloseableHttpClient client = HttpClients.custom()     .setDefaultCookieStore(cookieStore)     .build();  // 使用连接池（推荐） PoolingHttpClientConnectionManager connManager =      new PoolingHttpClientConnectionManager(); connManager.setMaxTotal(200);  // 最大连接数 connManager.setDefaultMaxPerRoute(20);  // 每个路由最大连接数

注意事项：
- 生产环境不要跳过SSL证书验证
- 遵守目标网站的robots.txt协议
- 设置合理的爬取间隔（建议3-5秒）
- 处理可能遇到的429/503状态码
- 添加日志记录和错误重试机制

如果需要解析HTML内容，建议结合Jsoup库使用：

代码语言：javascript代码运行次数：0运行复制

Document doc = Jsoup.parse(html);
Elements links = doc.select("a[href]");
for (Element link : links) {
    System.out.println("发现链接: " + link.attr("abs:href"));
}

请根据实际需求调整代码，并确保遵守目标网站的使用条款和相关法律法规。

本文标签：使用Apache HttpClient编写Java爬虫

版权声明：本文标题：使用Apache HttpClient编写Java爬虫内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/biancheng/1748026387a2793256.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

使用Apache HttpClient编写Java爬虫

使用Apache HttpClient编写Java爬虫

使用说明：

更多相关文章

使用Apache HttpClient编写Java爬虫

发表评论

推荐文章

初识Redis · 客户端“Hello world“

5分钟教你用AI做表情迁移，让一只猫萌萌哒的唱歌。

Java基础：浅析List、Set、Map的特点和区别（整合版）

【YOLOv8】YOLOv8改进系列（12）

【YOLOv8】YOLOv8改进系列（11）

热门文章

C#.NET.NET Core技术前沿周刊

华为服务器自动镜像,华为服务器RH2288H V3 引导ServiceCD安装Windows系统方

让word格式中的封面和目录不显示页码，正文内容从1开始

大厂都在用！Protobuf原理解析与优化技巧

RocketMQ实战—4.消息零丢失的方案

@ConfigurationProperties简介

人人都能看懂的销量预测方案

常见系统体系架构设计模式

Sitecore中Core，Master和Web数据库之间的区别

SpringCloud入门之YAML格式文件规范学习

最新文章

48days强训——day6

【蓝桥杯每日一题】3.28

G1原理—8.如何优化G1中的YGC

C++方向就业

【redis】集群主节点宕机后选择 master 的详细流程

javascript - Type 'undefined' is not assignable to type 'menuItemProps[]' - Stack Overflow

javascript - VS 2015 Angular 2 import modules cannot be resolved - Stack Overflow

javascript - Get the JSON objects that are not present in another array - Stack Overflow

javascript - How to dismiss a phonegap notification programmatically - Stack Overflow

c - Solaris 10 make Error code 1 Fatal Error when trying to build python 2.7.16 - Stack Overflow

技械骑士HZ60 13代酷睿i716GB512GB4G独显参数报价

ThinkPad L13 11代酷睿 i7 1165G716GB512GB集显参数报价

ThinkPad L13 11代酷睿 i7 1165G716GB1TB集显参数报价

ThinkPad L13 11代酷睿 i7 1165G78GB512GB集显参数报价

华硕E510 15.6英寸 N51008GB1TB参数报价

编程频道|软件玩家 - 软件改变生活！

使用Apache HttpClient编写Java爬虫

使用Apache HttpClient编写Java爬虫

使用说明：

更多相关文章

使用Apache HttpClient编写Java爬虫

发表评论

推荐文章

初识Redis · 客户端“Hello world“

5分钟教你用AI做表情迁移，让一只猫萌萌哒的唱歌。

Java基础：浅析List、Set、Map的特点和区别（整合版）

【YOLOv8】YOLOv8改进系列（12）

【YOLOv8】YOLOv8改进系列（11）

热门文章

C#.NET.NET Core技术前沿周刊

华为服务器自动镜像,华为服务器RH2288H V3 引导ServiceCD安装Windows系统方

让word格式中的封面和目录不显示页码，正文内容从1开始

大厂都在用！Protobuf原理解析与优化技巧

RocketMQ实战—4.消息零丢失的方案

@ConfigurationProperties简介

人人都能看懂的销量预测方案

常见系统体系架构设计模式

Sitecore中Core，Master和Web数据库之间的区别

SpringCloud入门之YAML格式文件规范学习

最新文章

48days强训——day6

【蓝桥杯每日一题】3.28

G1原理—8.如何优化G1中的YGC

C++方向就业

【redis】集群 主节点宕机后选择 master 的详细流程

javascript - Type &#39;undefined&#39; is not assignable to type &#39;menuItemProps[]&#39; - Stack Overflow

javascript - VS 2015 Angular 2 import modules cannot be resolved - Stack Overflow

javascript - Get the JSON objects that are not present in another array - Stack Overflow

javascript - How to dismiss a phonegap notification programmatically - Stack Overflow

c - Solaris 10 make Error code 1 Fatal Error when trying to build python 2.7.16 - Stack Overflow

技械骑士HZ60 13代酷睿i716GB512GB4G独显参数报价

ThinkPad L13 11代酷睿 i7 1165G716GB512GB集显 参数报价

ThinkPad L13 11代酷睿 i7 1165G716GB1TB集显 参数报价

ThinkPad L13 11代酷睿 i7 1165G78GB512GB集显 参数报价

华硕E510 15.6英寸 N51008GB1TB参数报价

【redis】集群主节点宕机后选择 master 的详细流程

javascript - Type 'undefined' is not assignable to type 'menuItemProps[]' - Stack Overflow

ThinkPad L13 11代酷睿 i7 1165G716GB512GB集显参数报价

ThinkPad L13 11代酷睿 i7 1165G716GB1TB集显参数报价

ThinkPad L13 11代酷睿 i7 1165G78GB512GB集显参数报价