用python新聞網站抓取新聞,python爬取新浪新聞

本文目錄一覽：

1、Python如何簡單爬取騰訊新聞網前五頁文字內容？
2、怎麼用Python網路爬蟲爬取騰訊新聞內容
3、python3 怎麼爬取新聞網站
4、如何用Python爬蟲抓取網頁內容?

Python如何簡單爬取騰訊新聞網前五頁文字內容？

可以使用python裡面的一個爬蟲庫，beautifulsoup，這個庫可以很方便的爬取數據。爬蟲首先就得知道網頁的鏈接，然後獲取網頁的源代碼，通過正則表達式或者其他方法來獲取所需要的內容，具體還是要對著網頁源代碼進行操作，查看需要哪些地方的數據，然後通過beautifulsoup來爬取特定html標籤的內容。網上有很多相關的內容，可以看看。

怎麼用Python網路爬蟲爬取騰訊新聞內容

所謂網頁抓取，就是把URL地址中指定的網路資源從網路流中讀取出來，保存到本地。類似於使用程序模擬IE瀏覽器的功能，把URL作為HTTP請求的內容發送到伺服器端，然後讀取伺服器端的響應資源。在Python中，我們使用urllib2這個組件來抓取網頁。u…

python3 怎麼爬取新聞網站

需求：

從門戶網站爬取新聞，將新聞標題，作者，時間，內容保存到本地txt中。

用到的python模塊：

import re # 正則表達式

import bs4 # Beautiful Soup 4 解析模塊

import urllib2 # 網路訪問模塊

import News #自己定義的新聞結構

import codecs #解決編碼問題的關鍵，使用codecs.open打開文件

import sys #1解決不同頁面編碼問題

其中bs4需要自己裝一下，安裝方法可以參考：Windows命令行下pip安裝python whl包

程序：

#coding=utf-8

import re # 正則表達式

import bs4 # Beautiful Soup 4 解析模塊

import urllib2 # 網路訪問模塊

import News #自己定義的新聞結構

import codecs #解決編碼問題的關鍵，使用codecs.open打開文件

import sys #1解決不同頁面編碼問題

reload(sys) # 2

sys.setdefaultencoding(‘utf-8’) # 3

# 從首頁獲取所有鏈接

def GetAllUrl(home):

html = urllib2.urlopen(home).read().decode(‘utf8’)

soup = bs4.BeautifulSoup(html, ‘html.parser’)

pattern = ‘http://\w+\.baijia\.baidu\.com/article/\w+’

links = soup.find_all(‘a’, href=re.compile(pattern))

for link in links:

url_set.add(link[‘href’])

def GetNews(url):

global NewsCount,MaxNewsCount #全局記錄新聞數量

while len(url_set) != 0:

try:

# 獲取鏈接

url = url_set.pop()

url_old.add(url)

# 獲取代碼

html = urllib2.urlopen(url).read().decode(‘utf8’)

# 解析

soup = bs4.BeautifulSoup(html, ‘html.parser’)

pattern = ‘http://\w+\.baijia\.baidu\.com/article/\w+’ # 鏈接匹配規則

links = soup.find_all(‘a’, href=re.compile(pattern))

# 獲取URL

for link in links:

if link[‘href’] not in url_old:

url_set.add(link[‘href’])

# 獲取信息

article = News.News()

article.url = url # URL信息

page = soup.find(‘div’, {‘id’: ‘page’})

article.title = page.find(‘h1’).get_text() # 標題信息

info = page.find(‘div’, {‘class’: ‘article-info’})

article.author = info.find(‘a’, {‘class’: ‘name’}).get_text() # 作者信息

article.date = info.find(‘span’, {‘class’: ‘time’}).get_text() # 日期信息

article.about = page.find(‘blockquote’).get_text()

pnode = page.find(‘div’, {‘class’: ‘article-detail’}).find_all(‘p’)

article.content = ”

for node in pnode: # 獲取文章段落

article.content += node.get_text() + ‘\n’ # 追加段落信息

SaveNews(article)

print NewsCount

break

except Exception as e:

print(e)

continue

else:

print(article.title)

NewsCount+=1

finally:

# 判斷數據是否收集完成

if NewsCount == MaxNewsCount:

break

def SaveNews(Object):

file.write(“【”+Object.title+”】”+”\t”)

file.write(Object.author+”\t”+Object.date+”\n”)

file.write(Object.content+”\n”+”\n”)

url_set = set() # url集合

url_old = set() # 爬過的url集合

NewsCount = 0

MaxNewsCount=3

home = ” # 起始位置

GetAllUrl(home)

file=codecs.open(“D:\\test.txt”,”a+”) #文件操作

for url in url_set:

GetNews(url)

# 判斷數據是否收集完成

if NewsCount == MaxNewsCount:

break

file.close()

新聞文章結構

#coding: utf-8

# 文章類定義

class News(object):

def __init__(self):

self.url = None

self.title = None

self.author = None

self.date = None

self.about = None

self.content = None

對爬取的文章數量就行統計。

如何用Python爬蟲抓取網頁內容?

爬蟲流程

其實把網路爬蟲抽象開來看，它無外乎包含如下幾個步驟

模擬請求網頁。模擬瀏覽器，打開目標網站。

獲取數據。打開網站之後，就可以自動化的獲取我們所需要的網站數據。

保存數據。拿到數據之後，需要持久化到本地文件或者資料庫等存儲設備中。

那麼我們該如何使用 Python 來編寫自己的爬蟲程序呢，在這裡我要重點介紹一個 Python 庫：Requests。

Requests 使用

Requests 庫是 Python 中發起 HTTP 請求的庫，使用非常方便簡單。

模擬發送 HTTP 請求

發送 GET 請求

當我們用瀏覽器打開豆瓣首頁時，其實發送的最原始的請求就是 GET 請求

import requests

res = requests.get(”)

print(res)

print(type(res))

Response [200]

class ‘requests.models.Response’

原創文章，作者：小藍，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/295666.html

用python新聞網站抓取新聞,python爬取新浪新聞

本文目錄一覽：

Python如何簡單爬取騰訊新聞網前五頁文字內容？

怎麼用Python網路爬蟲爬取騰訊新聞內容

python3 怎麼爬取新聞網站

如何用Python爬蟲抓取網頁內容?

相關推薦

發表回復