org.jsoup.jsoup：Java HTML解析器詳解

一、簡介

org.jsoup.jsoup是一個用於處理HTML文檔的Java類庫。該類庫提供了豐富的API，方便了我們在Java程序中對HTML文檔進行解析、處理和操作，使得我們可以輕鬆地獲取HTML文檔中的各種信息，並通過編程的方式實現HTML文檔的自動化管理。

二、使用方法

org.jsoup.jsoup的使用十分簡單，我們只需要在項目中引入相應的jar包，然後就可以在Java程序中使用它提供的各種方法了。

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Main {
    public static void main(String[] args) throws Exception {
        // 從URL獲取HTML文檔
        Document doc = Jsoup.connect("http://example.com").get();
        
        // 從字元串獲取HTML文檔
        String html = "<html><head></head><body><p>Hello World!</p></body></html>";
        Document docFromString = Jsoup.parse(html);
        
        // 通過選擇器獲取元素
        Elements elements = doc.select("a[href]");
        
        // 遍曆元素集合，獲取元素信息
        for (Element element : elements) {
            String href = element.attr("href");
            String text = element.text();
            System.out.printf("href=\"%s\", text=\"%s\"\n", href, text);
        }
    }
}

三、核心功能

1、獲取HTML文檔

org.jsoup.jsoup提供了多種方式獲取HTML文檔，包括從URL獲取、從文件獲取、從字元串獲取等。

// 從URL獲取HTML文檔
Document doc = Jsoup.connect("http://example.com").get();

// 從文件獲取HTML文檔
Document docFromFile = Jsoup.parse(new File("example.html"), "UTF-8", "http://example.com/");

// 從字元串獲取HTML文檔
String html = "<html><head></head><body><p>Hello World!</p></body></html>";
Document docFromString = Jsoup.parse(html);

2、元素選擇器

元素選擇器可以方便地從HTML文檔中選擇出符合條件的元素，並進行操作和處理。

// 通過選擇器獲取元素
Elements elements = doc.select("a[href]");

// 通過屬性值獲取元素
Element element = doc.selectFirst("a[href=\"http://example.com\"]");

// 遍曆元素集合，獲取元素信息
for (Element element : elements) {
    String href = element.attr("href");
    String text = element.text();
    System.out.printf("href=\"%s\", text=\"%s\"\n", href, text);
}

3、DOM操作

org.jsoup.jsoup支持對HTML文檔進行DOM操作，包括修改元素屬性、添加、刪除元素等。

// 修改元素屬性
Element element = doc.selectFirst("a[href=\"http://example.com\"]");
element.attr("href", "http://example.org");

// 添加元素
Element newElem = doc.createElement("p");
newElem.text("This is a new paragraph.");
doc.body().appendChild(newElem);

// 刪除元素
Element oldElem = doc.selectFirst("a[href=\"http://example.com\"]");
oldElem.remove();

四、擴展功能

org.jsoup.jsoup提供了豐富的擴展功能，包括處理XML文檔、處理字元串等。

// 處理XML文檔
Document xmlDoc = Jsoup.parse(xmlString, "", Parser.xmlParser());

// 處理字元串
String encodedString = org.jsoup.parser.Parser.unescapeEntities("&lt;div&gt;Hello&lt;/div&gt;", true);

五、總結

通過本文的介紹，我們了解了org.jsoup.jsoup的基本使用方法和核心功能，還學習了如何利用它豐富的API對HTML文檔進行解析、處理和操作。在實際開發過程中，我們可以利用org.jsoup.jsoup輕鬆地完成HTML文檔的自動化管理任務，提高開發效率。

原創文章，作者：BSMQ，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/147437.html