python爬蟲筆記安裝篇（python爬蟲模塊安裝）

本文目錄一覽：

1、「python爬蟲保姆級教學」urllib的使用以及頁面解析
2、python如何安裝網路爬蟲？
3、python 爬蟲
4、如何python安裝及配置擴展包爬蟲爬取
5、python爬蟲怎麼做？
6、python爬蟲需要安裝哪些庫

「python爬蟲保姆級教學」urllib的使用以及頁面解析

使用urllib來獲取百度首頁的源碼

get請求參數，如果是中文，需要對中文進行編碼，如下面這樣，如果不編碼會報錯。

urlencode應用場景：多個參數的時候。如下

為什麼要學習handler？

為什麼需要代理？因為有的網站是禁止爬蟲的，如果用真實的ip去爬蟲，容易被封掉。

2.解析技術

1.安裝lxml庫

2.導入lxml.etree

3.etree.parse() 解析本地文件

4.etree.HTML() 伺服器響應文件

5.解析獲取DOM元素

1.路徑查詢

2.謂詞查詢

3.屬性查詢

4.模糊查詢

5.內容查詢

6.邏輯運算

示例：

JsonPath只能解析本地文件。

pip安裝：

jsonpath的使用：

示例：

解析上面的json數據

缺點：效率沒有lxml的效率高

優點：介面設計人性化，使用方便

pip install bs4 -i

from bs4 import BeautifulSoup

1.根據標籤名查找節點

soup.a.attrs

2.函數

find(『a』)：只找到第一個a標籤

find(『a』, title=『名字』)

find(『a』, class_=『名字』)

find_all(『a』) ：查找到所有的a

find_all([『a』, 『span』]) 返回所有的a和span

find_all(『a』, limit=2) 只找前兩個a

obj.string

obj.get_text()【推薦】

tag.name：獲取標籤名

tag.attrs：將屬性值作為一個字典返回

obj.attrs.get(『title』)【常用】

obj.get(『title』)

obj[『title』]

示例：

使用BeautifulSoup解析上面的html

python如何安裝網路爬蟲？

你的模塊沒有安裝

你在win系統下用pip工具安裝第三方模塊

pip install 模塊名

然後再執行你上面的代碼就可以了

python 爬蟲

驗證碼（CAPTCHA）全稱為全自動區分計算機和人類的公開圖靈測試（Completely Automated Public Turing test to tell Computersand Humans Apart）。從其全稱可以看出，驗證碼用於測試用戶是真實的人類還是計算機機器人。

1.獲得驗證碼圖片

每次載入註冊網頁都會顯示不同的驗證驗圖像，為了了解表單需要哪些參數，我們可以復用上一章編寫的parse_form()函數。

import cookielib,urllib2,pprint import form REGISTER_URL = ” cj=cookielib.CookieJar() opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) html=opener.open(REGISTER_URL).read() form=form.parse_form(html) pprint.pprint(form)

{‘_formkey’: ‘a67cbc84-f291-4ecd-9c2c-93937faca2e2’, ‘_formname’: ‘register’, ‘_next’: ‘/places/default/index’, ’email’: ”, ‘first_name’: ”, ‘last_name’: ”, ‘password’: ”, ‘password_two’: ”, ‘recaptcha_response_field’: None} 123456789101112131415161718

上面recaptcha_response_field是存儲驗證碼的值，其值可以用Pillow從驗證碼圖像獲取出來。先安裝pip install Pillow，其它安裝Pillow的方法可以參考。Pillow提價了一個便捷的Image類，其中包含了很多用於處理驗證碼圖像的高級方法。下面的函數使用註冊頁的HTML作為輸入參數，返回包含驗證碼圖像的Image對象。

import lxml.html from io import BytesIO from PIL import Image tree=lxml.html.fromstring(html) print tree

Element html at 0x7f8b006ba890 img_data_all=tree.cssselect(‘div#recaptcha img’)[0].get(‘src’) print img_data_all

data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAABgCAIAAAB9kzvfAACAtklEQVR4nO29Z5gcZ5ku3F2dc865

…

rkJggg== img_data=img_data_all.partition(‘,’)[2] print img_data

iVBORw0KGgoAAAANSUhEUgAAAQAAAABgCAIAAAB9kzvfAACAtklEQVR4nO29Z5gcZ5ku3F2dc865

…

rkJggg== binary_img_data=img_data.decode(‘base64’) file_like=BytesIO(binary_img_data) print file_like

_io.BytesIO object at 0x7f8aff6736b0 img=Image.open(file_like) print img

PIL.PngImagePlugin.PngImageFile image mode=RGB size=256×96 at 0x7F8AFF5FAC90 12345678910111213141516171819202122232425

在本例中，這是一張進行了Base64編碼的PNG圖像，這種格式會使用ASCII編碼表示二進位數據。我們可以通過在第一個逗號處分割的方法移除該前綴。然後，使用Base64解碼圖像數據，回到最初的二進位格式。要想載入圖像，PIL需要一個類似文件的介面，所以在傳給Image類之前，我們以使用了BytesIO對這個二進位數據進行了封裝。

完整代碼:

# -*- coding: utf-8 -*-form.pyimport urllibimport urllib2import cookielibfrom io import BytesIOimport lxml.htmlfrom PIL import Image

REGISTER_URL = ”#REGISTER_URL = ”def extract_image(html):

tree = lxml.html.fromstring(html)

img_data = tree.cssselect(‘div#recaptcha img’)[0].get(‘src’) # remove data:image/png;base64, header

img_data = img_data.partition(‘,’)[-1] #open(‘test_.png’, ‘wb’).write(data.decode(‘base64’))

binary_img_data = img_data.decode(‘base64’)

file_like = BytesIO(binary_img_data)

img = Image.open(file_like) #img.save(‘test.png’)

return imgdef parse_form(html):

“””extract all input properties from the form

“””

tree = lxml.html.fromstring(html)

data = {} for e in tree.cssselect(‘form input’): if e.get(‘name’):

data[e.get(‘name’)] = e.get(‘value’) return datadef register(first_name, last_name, email, password, captcha_fn):

cj = cookielib.CookieJar()

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

html = opener.open(REGISTER_URL).read()

form = parse_form(html)

form[‘first_name’] = first_name

form[‘last_name’] = last_name

form[’email’] = email

form[‘password’] = form[‘password_two’] = password

img = extract_image(html)#

captcha = captcha_fn(img)#

form[‘recaptcha_response_field’] = captcha

encoded_data = urllib.urlencode(form)

request = urllib2.Request(REGISTER_URL, encoded_data)

response = opener.open(request)

success = ‘/user/register’ not in response.geturl() #success = ‘/places/default/user/register’ not in response.geturl()

return success12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152

2.光學字元識別驗證碼

光學字元識別（Optical Character Recognition, OCR）用於圖像中抽取文本。本節中，我們將使用開源的Tesseract OCR引擎，該引擎最初由惠普公司開發的，目前由Google主導。Tesseract的安裝說明可以從獲取。然後可以使用pip安裝其Python封裝版本pytesseractpip install pytesseract。

下面我們用光學字元識別圖像驗證碼：

import pytesseract import form img=form.extract_image(html) pytesseract.image_to_string(img)” 123456

如果直接把驗證碼原始圖像傳給pytesseract，一般不能解析出來。這是因為Tesseract是抽取更加典型的文本，比如背景統一的書頁。下面我們進行去除背景噪音，只保留文本部分。驗證碼文本一般都是黑色的，背景則會更加明亮，所以我們可以通過檢查是否為黑色將文本分離出來，該處理過程又被稱為閾值化。

img.save(‘2captcha_1original.png’) gray=img.convert(‘L’) gray.save(‘2captcha_2gray.png’) bw=gray.point(lambda x:0 if x1 else 255,’1′) bw.save(‘2captcha_3thresholded.png’) 1234567

這裡只有閾值小於1的像素（全黑）都會保留下來，分別得到三張圖像：原始驗證碼圖像、轉換後的灰度圖和閾值化處理後的黑白圖像。最後我們將閾值化處理後黑白圖像再進行Tesseract處理，驗證碼中的文字已經被成功抽取出來了。

pytesseract.image_to_string(bw)’language’ import Image,pytesseract img=Image.open(‘2captcha_3thresholded.png’) pytesseract.image_to_string(img)’language’ 123456789

我們通過示例樣本測試，100張驗證碼能正確識別出90張。

import ocr ocr.test_samples()

Accuracy: 90/100 1234

下面是註冊賬號完整代碼：

# -*- coding: utf-8 -*-import csvimport stringfrom PIL import Imageimport pytesseractfrom form import registerdef main():

print register(‘Wu1’, ‘Being1’, ‘Wu_Being001@qq.com’, ‘example’, ocr)def ocr(img):

# threshold the image to ignore background and keep text

gray = img.convert(‘L’) #gray.save(‘captcha_greyscale.png’)

bw = gray.point(lambda x: 0 if x 1 else 255, ‘1’) #bw.save(‘captcha_threshold.png’)

word = pytesseract.image_to_string(bw)

ascii_word = ”.join(c for c in word if c in string.letters).lower() return ascii_wordif __name__ == ‘__main__’:

main()1234567891011121314151617181920212223

我們可以進一步改善OCR性能：

– 實驗不同閾值

– 腐蝕閾值文本，突出字元形狀

– 調整圖像大小

– 根據驗證碼字體訓練ORC工具

– 限制結果為字典單詞

如何python安裝及配置擴展包爬蟲爬取

一.安裝Python及基礎知識

一.安裝Python

在開始使用Python編程之前，需要介紹Python的安裝過程。python解釋器在Linux中可以內置使用安裝，windows中需要去官網downloads頁面下載。具體步驟如下：

第一步：打開Web瀏覽器並訪問官網；

第二步：在官網首頁點擊Download鏈接，進入下載界面，選擇Python軟體的版本，作者選擇下載python 2.7.8，點擊「Download」鏈接。

Python下載地址：

第三步：選擇文件下載地址，並下載文件。

第四步：雙擊下載的「python-2.7.8.msi」軟體，並對軟體進行安裝。

第五步：在Python安裝嚮導中選擇默認設置，點擊「Next」，選擇安裝路徑，這裡設置為默認的安裝路徑「C:\Python27」，點擊「Next」按鈕，如圖所示。

注意1：建議將Python安裝在C盤下，通常路徑為C:\Python27，不要存在中文路徑。

在Python安裝嚮導中選擇默認設置，點擊「Next」，選擇安裝路徑，這裡設置為默認的安裝路徑「C:\Python27」，點擊「Next」按鈕。

安裝成功後，如下圖所示：

python爬蟲怎麼做？

具體步驟

整體思路流程

簡單代碼演示

準備工作

下載並安裝所需要的python庫，包括：

對所需要的網頁進行請求並解析返回的數據

對於想要做一個簡單的爬蟲而言，這一步其實很簡單，主要是通過requests庫來進行請求，然後對返回的數據進行一個解析，解析之後通過對於元素的定位和選擇來獲取所需要的數據元素，進而獲取到數據的一個過程。

可以通過定義不同的爬蟲來實現爬取不同頁面的信息，並通過程序的控制來實現一個自動化爬蟲。

以下是一個爬蟲的實例

python爬蟲需要安裝哪些庫

一、請求庫

1. requests

requests 類庫是第三方庫，比 Python 自帶的 urllib 類庫使用方便和

2. selenium

利用它執行瀏覽器動作，模擬操作。

3. chromedriver

安裝chromedriver來驅動chrome。

4. aiohttp

aiohttp是非同步請求庫，抓取數據時可以提升效率。

二、解析庫

1. lxml

lxml是Python的一個解析庫，支持解析HTML和XML，支持XPath的解析方式，而且解析效率非常高。

2. beautifulsoup4

Beautiful Soup可以使用它更方便的從 HTML 文檔中提取數據。

3. pyquery

pyquery是一個網頁解析庫，採用類似jquery的語法來解析HTML文檔。

三、存儲庫

1. mysql

2. mongodb

3. redis

四、爬蟲框架scrapy

Scrapy 是一套非同步處理框架，純python實現的爬蟲框架，用來抓取網頁內容以及各種圖片

需要先安裝scrapy基本依賴庫，比如lxml、pyOpenSSL、Twisted

原創文章，作者：EKTK，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/148283.html