Python讀取HTML文件內容詳解

一、基本介紹

<html>
  <head>
    <title> This is a test title </title>
  </head>
  <body>
    <p> This is a test paragraph </p>
  </body>
</html>

Python 是一門非常有用的編程語言，它也可以用來讀取 HTML 文件內容。在 Python 中，可以使用一些內置模塊和第三方庫來讀取 HTML 文件的內容。HTML 文件是由標籤、屬性和文本組成的，如果我們需要從 HTML 文件中獲取我們所需要的信息，就需要了解這些組成 HTML 文件的基本元素。

二、讀取 HTML 文件

Python 有內置的模塊來讀取文件，其中就包括讀取 HTML 文件的能力。我們可以使用 Python 內置的模塊 urllib 來獲取網站的 HTML 內容，並以字符串的形式提供。下面是一個 Python 讀取 HTML 文件的簡單示例：

import urllib.request

with urllib.request.urlopen('http://www.example.com/') as response:
   html = response.read()

print(html)

在這個例子中，我們使用到了 Python 的 urllib.request 模塊，該模塊可以打開 URL 地址。打開 URL 地址後，我們使用 response.read() 方法讀取 HTML 內容，並將其存儲在一個變量中，最後打印出 HTML 內容。

三、解析 HTML 文件

Python 有很多第三方庫可以用來解析 HTML 文件，其中比較常用的有以下幾個：

BeautifulSoup
lxml
html5lib

1、使用 BeautifulSoup 解析 HTML 文件

BeautifulSoup 是一個 Python 庫，用於從 HTML 和 XML 文件中提取數據。它能夠通過不同的解析器實現，包括 Python 標準庫中的 html.parser。

from bs4 import BeautifulSoup

html_doc = """
  <html>
    <head>
      <title> This is a test title </title>
    </head>
    <body>
      <p class="first"> This is a test paragraph </p>
    </body>
  </html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

在這個例子中，我們輸入了一個 HTML 代碼塊並使用 BeautifulSoup 解析器解析它。通過使用soup.prettify()方法，我們可以獲得結構化的 HTML 代碼如下：

<html>
  <head>
    <title>
     This is a test title
    </title>
  </head>
  <body>
    <p class="first">
     This is a test paragraph
    </p>
  </body>
</html>

2、使用 lxml 解析 HTML 文件

如果我們需要使用 XPath 解析 HTML 文件，那麼需要使用 lxml 庫。XPath 是一種在 XML 文檔中導航的語言，它可以幫助我們選擇特定的 XML 元素。

import requests
from lxml import html

page = requests.get('http://www.example.com/')
tree = html.fromstring(page.content)

print(tree)

在這個例子中，我們使用 requests 庫向 http://www.example.com 發送請求，並使用 fromstring() 方法解析頁面內容。我們可以使用 CSS 選擇器或 XPath 表達式從 HTML 文件中提取所需的信息。

3、使用 html5lib 解析 HTML 文件

html5lib 庫能夠根據 HTML5 規範解析 HTML 文件。

import requests
import html5lib

page = requests.get('http://www.example.com/')
html = page.content
soup = html5lib.parse(html, treebuilder='html5lib')

print(soup)

在這個例子中，我們使用 requests 庫向 http://www.example.com 發送請求，並使用 html5lib 解析器解析頁面內容。

四、獲取 HTML 文件中的元素

1、獲取 HTML 文件中的所有鏈接

我們可以使用 Python 的內置模塊 re 來獲取 HTML 文件中的所有鏈接。

import re

html = """
  <html>
    <head>
      <title> This is a test title </title>
    </head>
    <body>
      <p class="first"> This is a test paragraph </p>
      <a href="https://www.example.com"> This is a link </a>
    </body>
  </html>
"""

link_regex = re.compile(']+href=["\'](.*?)["\']', re.IGNORECASE)

links = link_regex.findall(html)

print(links)

在這個例子中，我們使用 re 模塊創建了一個鏈接的正則表達式，並使用 findall() 方法查找 HTML 文件中的所有鏈接。

2、使用 BeautifulSoup 解析 HTML 文件中的鏈接

我們可以使用 BeautifulSoup 庫來簡化這個過程。

from bs4 import BeautifulSoup
import requests

url = 'http://www.example.com/'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

links = []

for link in soup.find_all('a'):
    links.append(link.get('href'))

print(links)

在這個例子中，我們使用 requests 庫向 http://www.example.com 發送請求，並使用 BeautifulSoup 解析器解析頁面內容。我們使用 find_all() 方法找到所有的鏈接標籤，並通過 get() 方法獲取鏈接的 href 屬性。