Python模擬登錄

一、登錄的基本原理

模擬登錄是指使用程序模擬用戶的行為來進行登錄。登錄有通用的原理，但是因為每個網站可能有不同的驗證機制，因此模擬登錄的具體實現方式各有不同。

通用的登錄原理是：向伺服器發送登錄請求，服務端返回一個登錄成功後的Cookie。在後續的請求中，需要該Cookie才能夠獲得登錄後的頁面內容。


import requests

login_url = "http://www.example.com/login"
data = {
    "username": "myusername",
    "password": "mypassword"
}
session = requests.Session()
response = session.post(login_url, data=data)

二、GET與POST請求的區別

Get和Post都是HTTP請求方法，使用GET和POST請求都可以發送數據。兩者的區別在於：

數據位置：GET請求的數據在URL中，POST請求的數據在HTTP請求的Body中。
數據長度限制：GET請求的數據長度限制，取決於瀏覽器或伺服器限制；POST請求的數據長度限制，取決於伺服器限制。
請求對象：GET請求的請求對象是無狀態的，也就是說，兩次相同的GET請求返回的數據是相同的；POST請求的請求對象是有狀態的，同一個POST請求的數據不同，返回的數據也會不同。

三、編寫爬蟲模擬登錄示例

下面以西刺免費代理IP為例，演示使用Python模擬登錄篩選高匿代理IP，並將其存入MongoDB的方法。

西刺網站登錄的流程

首先，登錄西刺網站需要輸入用戶名和密碼，並且需要輸入驗證碼。但是，由於驗證碼可能無法識別，因此我們可以使用已經登錄成功的賬號獲取登錄後頁面的Cookie，直接使用該Cookie發送請求即可。

代碼示例


import random
import re
import time
import requests
import pymongo

class XiciSpider(object):
    def __init__(self, proxy_type='高匿', page_num=10):
        self.proxy_type = proxy_type
        self.page_num = page_num
        self.session = requests.Session()
        self.session.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
        self.base_url = "https://www.xicidaili.com/nt/{}"

        self.conn = pymongo.MongoClient("mongodb://localhost:27017/")
        self.db = self.conn["proxies"]
        self.collection = self.db["xici"]

        self.cookies = self.get_cookies()

    def get_cookies(self):
        url = "https://www.xicidaili.com/nt/"
        response = self.session.get(url)
        cookies = response.cookies
        return cookies

    def get_proxy_list(self):
        proxy_list = []
        for page in range(1,self.page_num+1):
            url = self.base_url.format(page)
            response = self.session.get(url, cookies=self.cookies)
            text = response.text
            pattern = re.compile(r'<tr class.*?(.*?).*?(.*?).*?(.*?).*?(.*?).*?', re.S)
            items = re.findall(pattern, text)
            for item in items:
                if item[3] == self.proxy_type:
                    proxy = {
                        "ip": item[0],
                        "port": item[1],
                        "protocol": item[2],
                        "proxy_type": item[3],
                    }
                    proxy_list.append(proxy)
            time.sleep(random.uniform(1, 2))
        return proxy_list

    def check_proxy(self, proxy):
        url = "http://icanhazip.com/"
        proxies = {
            "http": "http://{}:{}".format(proxy['ip'], proxy['port']),
            "https": "https://{}:{}".format(proxy['ip'], proxy['port']),
        }
        try:
            response = requests.get(url, proxies=proxies, timeout=5)
            if response.ok:
                return True
            else:
                return False
        except requests.exceptions.RequestException:
            return False

    def save_to_mongo(self, proxy_list):
        for proxy in proxy_list:
            if self.check_proxy(proxy):
                print("驗證通過：{}:{} {}".format(proxy['ip'], proxy['port'], proxy['protocol']))
                self.collection.update_one({"ip": proxy['ip']}, {"$set": proxy}, upsert=True)
            else:
                print("驗證失敗：{}:{}".format(proxy['ip'], proxy['port']))

    def run(self):
        proxy_list = self.get_proxy_list()
        self.save_to_mongo(proxy_list)

if __name__ == '__main__':
    spider = XiciSpider()
    spider.run()

四、小結

模擬登錄是Web爬蟲的基礎知識之一，對於具有登錄機制的網站，模擬登錄可以獲取更多的數據。Python中有很多Web框架和網路庫可以使用，有了基本的理論知識和實踐技能，可以調用這些強大的工具進行更高效的數據爬取。

原創文章，作者：RZFV，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/150119.html