如题所示

用python写网络爬虫的三种网页抓取方法

辅助函数

用来下载页面html源码：

import urllib2
def download(url,user_agent='wswp',num_retries=2):
	print 'Downloading:',url
	headers={'User-agent':user_agent}
	request=urllib2.Request(url,headers=headers)
	try:
		html=urllib2.urlopen(request).read()
	except urllib2.URLError as e:
		print 'Download error:',e.reason
		html=None
		if num_retries >0 :
			if hasattr(e,'code') and 500 <=e.code<600:
				return download(url,user_agent,num_retries-1)
	return html

三种方法

方法一：正则表达式

import re
url='http://example.webscraping.com/places/default/view/Afghanistan-1'
html=download(url)
print re.findall('<tr id="places_area__row"><td class=".*?"><label class="readonly" for="places_area" id="places_area__label">Area: </label></td><td class="w2p_fw">(.*?)</td><td class="w2p_fc"></td></tr>',html)

方法二：BeautifulSoup模块

BeautifulSoup模块可以解析网页，提供定位内容的便捷接口，并且能够正确解析缺失的引号并闭合标签。美中不足的是由于是python编写，速度偏慢

from bs4 import BeautifulSoup
url='http://example.webscraping.com/places/default/view/Afghanistan-1'
html=download(url)
soup=BeautifulSoup(html)
tr=soup.find(attrs={'id':'places_area__row'})
td=tr.find(attrs={'class':'w2p_fw'})
area=td.text
print area

方法三：Lxml

Lxml是基于libxml2解析库的python封装，因使用C语言编写，比BeautifulSoup更快，下面是使用lxml的CSS选择器抽取面积的示例代码

import lxml.html
html=download('http://example.webscraping.com/places/default/view/Afghanistan-1')
tree=lxml.html.fromstring(html)
td=tree.cssselect('tr#places_area__row > td.w2p_fw')[0]
area=td.text_content()
print area

CSS选择器示例：

选择所有标签：*
选择a标签：a
选择所有class=”link”的元素：.link
选择id=”home”的a标签：a#home
选择父元素为a标签的所有span子标签：a > span
选择a标签内部的所有span标签：a span
选择title属性为”Home”的所有a标签：a[title=Home]