admin管理员组

文章数量:1516870

新浪微博爬虫模拟登录(爬手机版)

新浪微博爬虫模拟登录

最近一直在想做新浪微博爬取,开始的时候做了一个PC版的爬取,但是发现提取内容真心难过所以在这里爬了手机版的话不多说,show you my code 这个可以提取固定人微博内容,接下来的工作是要放入到mysql提取大量内容做一些聚类分析,这个验证码是要自己输入的,等我把验证码自动识别那部分做好后再来补充,下面是微博内容爬取

#-*-coding:utf8-*-import requests
from lxml import etree
import urllib
import cookielib
import urllib2s = requests.Session()#记住cookiesurl = '' #此处请修改为微博地址
html_you = s.get(url).content
selector_you = etree.HTML(html_you)
url_login = selector_you.xpath('//a[@id="top"]/@href')[0]html = s.get(url_login).content
selector = etree.HTML(html)
password = selector.xpath('//input[@type="password"]/@name')[0]
vk = selector.xpath('//input[@name="vk"]/@value')[0]
action = selector.xpath('//form[@method="post"]/@action')[0]
capId = selector.xpath('//input[@name="capId"]/@value')[0]
url_1 = '.php?cpt='+capId
path = "d://downloads//1.GIF"
data = urllib.urlretrieve(url_1,path)
print'Pic Saved!'
print capId
print action
print password
print vk
code = raw_input('please input the:')
new_url = url_login + action
print new_url
data = {'backTitle' : u'手机新浪网','backURL' : '', #此处请填写微博地址'capId' : capId,'code': code,'mobile' : '登录名',password : '密码','remember' : 'on','tryCount' : '','vk' : vk,'submit' : u'登录'}newhtml = s.post(new_url,data=data).content
new_selector = etree.HTML(newhtml, parser=etree.HTMLParser(encoding='UTF-8'))
page = new_selector.xpath('//input[@type="hidden"]/@value')[0]
print page
for i in range(1,int(page)+1):url_page = '=%s'%iurl_page_1 = s.get(url_page).contentnew_selector_1 = etree.HTML(url_page_1, parser=etree.HTMLParser(encoding='UTF-8'))content = new_selector_1.xpath('//span[@class="ctt"]')for each in content:text = each.xpath('string(.)')print text

本文标签: 新浪微博爬虫模拟登录(爬手机版)