admin管理员组

文章数量:817317

Scrapy学习第四课

python爬虫框架scrapy学习第四课

  • 任务:爬取凤凰网导航下所有一级、二级和具体新闻数据
  • 执行:爬虫实例
  • 结果:爬取结果展示

任务:爬取凤凰网导航下所有一级、二级和具体新闻数据

凤凰网导航
一级标题:

二级标题:

新闻链接:

具体新闻标题:

执行:爬虫实例

1、items.py文件:明确要爬取的数据字段

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# .htmlimport scrapyclass IfengprojectItem(scrapy.Item):#一级大标题和超链接parentTitle = scrapy.Field()parentUrls = scrapy.Field()#二级标题和超链接secondTitle = scrapy.Field()secondUrls = scrapy.Field()#新闻链接路径和路径newsUrls = scrapy.Field()newsFileName = scrapy.Field()#具体新闻标题、正文内容、新闻发布时间newsHead = scrapy.Field()newsContent = scrapy.Field()newsPublicTime = scrapy.Field()

2、ifeng.py文件:执行具体爬虫过程

# -*- coding: utf-8 -*-
import scrapy
import os
from iFengProject.items import IfengprojectItemclass IfengSpider(scrapy.Spider):name = 'ifeng'allowed_domains = ['ifeng.com']start_urls = ['/']def parse(self, response):items = []#一级大标题和超链接parentUrls = response.xpath('//div[@class = "col3"]/h2/a/@href').extract()parentTitle = response.xpath('//div[@class = "col3"]/h2/a/text()').extract()#二级标题和超链接secondUrls = response.xpath('//ul[@class = "clearfix"]/li/a/@href').extract()secondTitle = response.xpath('//ul[@class = "clearfix"]/li/a/text()').extract()#爬取所有的一级大标题和超链接for i in range(0, len(parentTitle)):#指定大类目录的路径和目录名parentFileName = "./数据/"+parentTitle[i]#如果目录不存在,则创建目录if(not os.path.exists(parentFileName)):os.makedirs(parentFileName)#爬取所有二级数据for j in range(0, len(secondTitle)):item = IfengprojectItem()#保存大类的title和urlsitem['parentTitle'] = parentTitle[i]item['parentUrls'] = parentUrls[i]# 检查小类的url是否以同类别大类url开头,如果是返回True# 由于网站二级链接有些并不以一级链接的url开头,故会丢弃一部分数据if_belong = secondUrls[j].startswith(item['parentUrls'])if(if_belong):secondFileName = parentFileName + '/' + secondTitle[j]#如果目录不存在,创建二级路径if(not os.path.exists(secondFileName)):os.makedirs(secondFileName)item['secondUrls'] = secondUrls[j]item['secondTitle'] = secondTitle[j]item['newsFileName'] = secondFileNamefilename = 'secondFileName.html'f = open(filename,'a+')f.write(item['newsFileName']+ "  **   ")f.close()items.append(item)for item in items:yield scrapy.Request(url=item['secondUrls'], meta={'meta_1': item}, callback=self.second_parse)#对于二级标题和超链接下的url,再进行递归请求def second_parse(self, response):#提取每次Response的meta数据meta_1 = response.meta['meta_1']#取出小类里的新闻链接listxpathStr = "//div[@class='juti_list']/h3/a/@href"xpathStr += " | " +"//div[@class='box_list clearfix']/h2/a/@href"newsUrls = response.xpath(xpathStr).extract()items=[]for i in range(0, len(newsUrls)):# 检查每个链接是否以大类url开头、以.shtml结尾,如果是返回Trueif_belong = newsUrls[i].endswith('.shtml') and newsUrls[i].startswith(meta_1['parentUrls'])if(if_belong):item = IfengprojectItem()item['parentUrls'] = meta_1['parentUrls']item['parentTitle'] = meta_1['parentTitle']item['secondUrls'] = meta_1['secondUrls']item['secondTitle'] = meta_1['secondTitle']item['newsFileName'] = meta_1['newsFileName']item['newsUrls'] = newsUrls[i]items.append(item)for item in items:yield scrapy.Request(url=item['newsUrls'], meta={'meta_2':item}, callback=self.news_parse)def news_parse(self,response):item = response.meta['meta_2']content = ""head  = response.xpath("//title/text()")[0].extract()content_list = response.xpath('//div[@id="main_content"]/p/text() | //div[@id="yc_con_txt"]/p/text()').extract()if response.xpath("//span[@class='ss01']/text() | //div[@class='yc_tit']/p/span/text()"):newsPublicTime = response.xpath("//span[@class='ss01']/text() | //div[@class='yc_tit']/p/span/text()")[0].extract()else:newsPublicTime = "时间未统计出来"    for each in content_list:content += eachitem['newsHead'] = headitem['newsContent'] = contentitem['newsPublicTime'] = newsPublicTimeyield item

3、pipelines.py文件:执行爬取数据的存储操作

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: .html
import reclass IfengprojectPipeline(object):def process_item(self, item, spider):head = item['newsHead']headStr = re.sub("[\\\\/:*?\"<>|]","",head)newsPublicTime = item['newsPublicTime']filename = newsPublicTime + head.rstrip()pattern=r'[\\/:*?"<>|\r\n]+'filename = re.sub(pattern,'-',filename)filename = filename + ".txt"fp = open(item['newsFileName']+ '/'+ filename, "w",encoding='utf-8')fp.write(item['newsContent'])fp.close()

4、settings.py文件:配置文件,打开管道处理爬虫数据的开关

# 设置管道文件
ITEM_PIPELINES = {'iFengProject.pipelines.IfengprojectPipeline': 300,
}

结果:爬取结果展示


补充:由于每一个新闻链接的源码格式不统一,在爬取过程,设置的规则有限,并不能覆盖所有新闻链接,因为有些文件夹/文件为空。

本文标签: Scrapy学习第四课