本文目錄一覽:
python 如何抓取動態頁面內容?
輸入url,得到html,我早就寫了函數了
自己搜:
getUrlRespHtml
就可以找到對應的python函數:
#——————————————————————————
def getUrlResponse(url, postDict={}, headerDict={}, timeout=0, useGzip=False, postDataDelimiter=””) :
“””Get response from url, support optional postDict,headerDict,timeout,useGzip
Note:
1. if postDict not null, url request auto become to POST instead of default GET
2 if you want to auto handle cookies, should call initAutoHandleCookies() before use this function.
then following urllib2.Request will auto handle cookies
“””
# makesure url is string, not unicode, otherwise urllib2.urlopen will error
url = str(url);
if (postDict) :
if(postDataDelimiter==””):
postData = urllib.urlencode(postDict);
else:
postData = “”;
for eachKey in postDict.keys() :
postData += str(eachKey) + “=” + str(postDict[eachKey]) + postDataDelimiter;
postData = postData.strip();
logging.info(“postData=%s”, postData);
req = urllib2.Request(url, postData);
logging.info(“req=%s”, req);
req.add_header(‘Content-Type’, “application/x-www-form-urlencoded”);
else :
req = urllib2.Request(url);
defHeaderDict = {
‘User-Agent’ : gConst[‘UserAgent’],
‘Cache-Control’ : ‘no-cache’,
‘Accept’ : ‘*/*’,
‘Connection’ : ‘Keep-Alive’,
};
# add default headers firstly
for eachDefHd in defHeaderDict.keys() :
#print “add default header: %s=%s”%(eachDefHd,defHeaderDict[eachDefHd]);
req.add_header(eachDefHd, defHeaderDict[eachDefHd]);
if(useGzip) :
#print “use gzip for”,url;
req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);
# add customized header later – allow overwrite default header
if(headerDict) :
#print “added header:”,headerDict;
for key in headerDict.keys() :
req.add_header(key, headerDict[key]);
if(timeout 0) :
# set timeout value if necessary
resp = urllib2.urlopen(req, timeout=timeout);
else :
resp = urllib2.urlopen(req);
#update cookies into local file
if(gVal[‘cookieUseFile’]):
gVal[‘cj’].save();
logging.info(“gVal[‘cj’]=%s”, gVal[‘cj’]);
return resp;
#——————————————————————————
# get response html==body from url
#def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :
def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=True, postDataDelimiter=””) :
resp = getUrlResponse(url, postDict, headerDict, timeout, useGzip, postDataDelimiter);
respHtml = resp.read();
#here, maybe, even if not send Accept-Encoding: gzip, deflate
#but still response gzip or deflate, so directly do undecompress
#if(useGzip) :
#print “—before unzip, len(respHtml)=”,len(respHtml);
respInfo = resp.info();
# Server: nginx/1.0.8
# Date: Sun, 08 Apr 2012 12:30:35 GMT
# Content-Type: text/html
# Transfer-Encoding: chunked
# Connection: close
# Vary: Accept-Encoding
# …
# Content-Encoding: gzip
# sometime, the request use gzip,deflate, but actually returned is un-gzip html
# – response info not include above “Content-Encoding: gzip”
# eg:
# – so here only decode when it is indeed is gziped data
#Content-Encoding: deflate
if(“Content-Encoding” in respInfo):
if(“gzip” == respInfo[‘Content-Encoding’]):
respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS);
elif(“deflate” == respInfo[‘Content-Encoding’]):
respHtml = zlib.decompress(respHtml, -zlib.MAX_WBITS);
return respHtml;
及示例代碼:
url = “”;
respHtml = getUrlRespHtml(url);
完全庫函數,自己搜:
crifanLib.py
關於抓取動態頁面,詳見:
Python專題教程:抓取網站,模擬登陸,抓取動態網頁
(自己搜標題即可找到)
python怎麼獲取動態網頁鏈接?
四中方法:
”’
得到當前頁面所有連接
”’
import requests
import re
from bs4 import BeautifulSoup
from lxml import etree
from selenium import webdriver
url = ”
r = requests.get(url)
r.encoding = ‘gb2312’
# 利用 re
matchs = re.findall(r”(?=href=\”).+?(?=\”)|(?=href=\’).+?(?=\’)” , r.text)
for link in matchs:
print(link)
print()
# 利用 BeautifulSoup4 (DOM樹)
soup = BeautifulSoup(r.text,’lxml’)
for a in soup.find_all(‘a’):
link = a[‘href’]
print(link)
print()
# 利用 lxml.etree (XPath)
tree = etree.HTML(r.text)
for link in tree.xpath(“//@href”):
print(link)
print()
# 利用selenium(要開瀏覽器!)
driver = webdriver.Firefox()
driver.get(url)
for link in driver.find_elements_by_tag_name(“a”):
print(link.get_attribute(“href”))
driver.close()
如何用Python爬取動態加載的網頁數據
動態網頁抓取都是典型的辦法
直接查看動態網頁的加載規則。如果是ajax,則將ajax請求找出來給python。 如果是js去處後生成的URL。就要閱讀JS,搞清楚規則。再讓python生成URL。這就是常用辦法
辦法2,使用python調用webkit內核的,IE內核,或者是firefox內核的瀏覽器。然後將瀏覽結果保存下來。通常可以使用瀏覽器測試框架。它們內置了這些功能
辦法3,通過http proxy,抓取內容並進行組裝。甚至可以嵌入自己的js腳本進行hook. 這個方法通常用於系統的反向工程軟件
原創文章,作者:小藍,如若轉載,請註明出處:https://www.506064.com/zh-hant/n/244467.html