python保存動態網頁,python將網頁保存為圖片

本文目錄一覽：

1、python 如何抓取動態頁面內容？
2、python怎麼獲取動態網頁鏈接？
3、如何用Python爬取動態加載的網頁數據

python 如何抓取動態頁面內容？

輸入url，得到html，我早就寫了函數了

自己搜：

getUrlRespHtml

就可以找到對應的python函數：

#——————————————————————————

def getUrlResponse(url, postDict={}, headerDict={}, timeout=0, useGzip=False, postDataDelimiter=””) :

“””Get response from url, support optional postDict,headerDict,timeout,useGzip

Note:

1. if postDict not null, url request auto become to POST instead of default GET

2 if you want to auto handle cookies, should call initAutoHandleCookies() before use this function.

then following urllib2.Request will auto handle cookies

“””

# makesure url is string, not unicode, otherwise urllib2.urlopen will error

url = str(url);

if (postDict) :

if(postDataDelimiter==””):

postData = urllib.urlencode(postDict);

else:

postData = “”;

for eachKey in postDict.keys() :

postData += str(eachKey) + “=” + str(postDict[eachKey]) + postDataDelimiter;

postData = postData.strip();

logging.info(“postData=%s”, postData);

req = urllib2.Request(url, postData);

logging.info(“req=%s”, req);

req.add_header(‘Content-Type’, “application/x-www-form-urlencoded”);

else :

req = urllib2.Request(url);

defHeaderDict = {

‘User-Agent’ : gConst[‘UserAgent’],

‘Cache-Control’ : ‘no-cache’,

‘Accept’ : ‘*/*’,

‘Connection’ : ‘Keep-Alive’,

};

# add default headers firstly

for eachDefHd in defHeaderDict.keys() :

#print “add default header: %s=%s”%(eachDefHd,defHeaderDict[eachDefHd]);

req.add_header(eachDefHd, defHeaderDict[eachDefHd]);

if(useGzip) :

#print “use gzip for”,url;

req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);

# add customized header later – allow overwrite default header

if(headerDict) :

#print “added header:”,headerDict;

for key in headerDict.keys() :

req.add_header(key, headerDict[key]);

if(timeout 0) :

# set timeout value if necessary

resp = urllib2.urlopen(req, timeout=timeout);

else :

resp = urllib2.urlopen(req);

#update cookies into local file

if(gVal[‘cookieUseFile’]):

gVal[‘cj’].save();

logging.info(“gVal[‘cj’]=%s”, gVal[‘cj’]);

return resp;

#——————————————————————————

# get response html==body from url

#def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :

def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=True, postDataDelimiter=””) :

resp = getUrlResponse(url, postDict, headerDict, timeout, useGzip, postDataDelimiter);

respHtml = resp.read();

#here, maybe, even if not send Accept-Encoding: gzip, deflate

#but still response gzip or deflate, so directly do undecompress

#if(useGzip) :

#print “—before unzip, len(respHtml)=”,len(respHtml);

respInfo = resp.info();

# Server: nginx/1.0.8

# Date: Sun, 08 Apr 2012 12:30:35 GMT

# Content-Type: text/html

# Transfer-Encoding: chunked

# Connection: close

# Vary: Accept-Encoding

# …

# Content-Encoding: gzip

# sometime, the request use gzip,deflate, but actually returned is un-gzip html

# – response info not include above “Content-Encoding: gzip”

# eg:

# – so here only decode when it is indeed is gziped data

#Content-Encoding: deflate

if(“Content-Encoding” in respInfo):

if(“gzip” == respInfo[‘Content-Encoding’]):

respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS);

elif(“deflate” == respInfo[‘Content-Encoding’]):

respHtml = zlib.decompress(respHtml, -zlib.MAX_WBITS);

return respHtml;

及示例代碼：

url = “”;

respHtml = getUrlRespHtml(url);

完全庫函數，自己搜：

crifanLib.py

關於抓取動態頁面，詳見：

Python專題教程：抓取網站，模擬登陸，抓取動態網頁

（自己搜標題即可找到）

python怎麼獲取動態網頁鏈接？

四中方法：

”’

得到當前頁面所有連接

”’

import requests

import re

from bs4 import BeautifulSoup

from lxml import etree

from selenium import webdriver

url = ”

r = requests.get(url)

r.encoding = ‘gb2312’

# 利用 re

matchs = re.findall(r”(?=href=\”).+?(?=\”)|(?=href=\’).+?(?=\’)” , r.text)

for link in matchs:

print(link)

print()

# 利用 BeautifulSoup4 （DOM樹）

soup = BeautifulSoup(r.text,’lxml’)

for a in soup.find_all(‘a’):

link = a[‘href’]

print(link)

print()

# 利用 lxml.etree （XPath）

tree = etree.HTML(r.text)

for link in tree.xpath(“//@href”):

print(link)

print()

# 利用selenium（要開瀏覽器！）

driver = webdriver.Firefox()

driver.get(url)

for link in driver.find_elements_by_tag_name(“a”):

print(link.get_attribute(“href”))

driver.close()

如何用Python爬取動態加載的網頁數據

動態網頁抓取都是典型的辦法

直接查看動態網頁的加載規則。如果是ajax，則將ajax請求找出來給python。如果是js去處後生成的URL。就要閱讀JS，搞清楚規則。再讓python生成URL。這就是常用辦法

辦法2，使用python調用webkit內核的，IE內核，或者是firefox內核的瀏覽器。然後將瀏覽結果保存下來。通常可以使用瀏覽器測試框架。它們內置了這些功能

辦法3，通過http proxy，抓取內容並進行組裝。甚至可以嵌入自己的js腳本進行hook. 這個方法通常用於系統的反向工程軟件

原創文章，作者：小藍，如若轉載，請註明出處：https://www.506064.com/zh-hant/n/244467.html

python保存動態網頁,python將網頁保存為圖片

本文目錄一覽：

python 如何抓取動態頁面內容？

python怎麼獲取動態網頁鏈接？

如何用Python爬取動態加載的網頁數據

相關推薦

發表回復