Python正則表達式：用字符模式匹配和替換文本

一、正則表達式基礎

在正式介紹Python正則表達式之前，需要先了解一些正則表達式的基礎知識。正則表達式是一種描述字符串結構的方法，幫助我們在文本中查找、匹配和替換特定的字符或字符串。正則表達式通常由字符、元字符和模式組成。其中，字符指的是正則表達式中的普通字母和數字，用來匹配對應的字符或數字。而元字符是特殊字符，具有特殊的含義，常用來描述模式，如通配符、邊界、重複等。模式是由字符和元字符組成的匹配規則。

下面是一些常用的正則表達式元字符：

.   匹配除換行符以外的任意字符
^   匹配字符串的開始位置
$   匹配字符串的結束位置
*   匹配前面的字符或子表達式0次或多次
+   匹配前面的字符或子表達式1次或多次
?   匹配前面的字符或子表達式0次或1次
{n} 匹配前面的字符或子表達式恰好n次
{n,}匹配前面的字符或子表達式至少n次
{n,m}匹配前面的字符或子表達式至少n次，但不超過m次
[]  匹配方括號中任意一個字符
|   匹配左右兩側表達式的任意一個
()

二、re模塊：Python正則表達式的基本庫

Python標準庫中的re模塊提供了正則表達式的工具和方法。

1、re.match(pattern, string, flags=0)：嘗試從字符串的開頭匹配一個模式，如果匹配成功返回匹配對象；否則返回None。

import re

pattern = 'hello'
string = 'hello, world!'
result = re.match(pattern, string)
print(result)

輸出結果為：<re.Match object; span=(0, 5), match=’hello’>。其中，re.Match object表示匹配結果，span表示匹配的起始位置和結束位置，match表示匹配的字符串。

2、re.search(pattern, string, flags=0)：掃描整個字符串，返回第一個匹配的對象。

import re

pattern = 'world'
string1 = 'hello, world!'
string2 = 'hello, python!'
result1 = re.search(pattern, string1)
result2 = re.search(pattern, string2)
print(result1, result2)

輸出結果為：<re.Match object; span=(7, 12), match=’world’> None。

3、re.findall(pattern, string, flags=0)：查找字符串中所有匹配的子串，並返回一個列表。

import re

pattern = 'l'
string1 = 'hello, world!'
string2 = 'hello, python!'
result1 = re.findall(pattern, string1)
result2 = re.findall(pattern, string2)
print(result1, result2)

輸出結果為：[‘l’, ‘l’, ‘l’] [‘l’]。

三、使用re模塊進行文本匹配和替換

re模塊可以幫助我們查找和替換文本中的特定字符或字符串，下面是一些示例。

1、匹配整個單詞

import re

pattern = r'\bhello\b'
string = 'hello, world! hello python!'
result = re.findall(pattern, string)
print(result)

其中，r表示“原始字符串”，\b表示“單詞邊界”，匹配的結果為[‘hello’, ‘hello’]。

2、匹配郵箱地址

import re

pattern = r'\b\w+@\w+\.\w+(?:\.\w+)?\b'
string = 'My email address is abc123@qq.com.'
result = re.search(pattern, string)
print(result)

其中，\w表示“字母、數字、下劃線”，(?:\.\w+)?表示“可選的多級域名”，匹配的結果為<re.Match object; span=(20, 32), match=’abc123@qq.com’>。

3、替換字符串中的特定字符

import re

pattern = r'[aeiou]'
string = 'hello, world!'
result = re.sub(pattern, '*', string)
print(result)

其中，[aeiou]表示“匹配任意一個元音字母”，將字符串中的元音字母替換成了星號。輸出結果為“h*ll*, w*rld!”。

四、正則表達式的高級用法

正則表達式還有許多高級用法，可以幫助我們更精準、高效地匹配和替換文本。這裡簡單介紹一些常用的高級用法。

1、分組和捕獲

分組和捕獲是正則表達式中常用的一種技術，用於對匹配結果進行更細粒度的操作。正則表達式中使用圓括號將子表達式括起來，形成一個組。

import re

pattern = r'(.*?)'
string = '<a href="http://www.baidu.com">百度</a>'
result = re.search(pattern, string)
print(result.group(1), result.group(2))

其中，(.*?)表示“匹配任意一個字符0次或多次，儘可能少地匹配”，第一個分組匹配的是鏈接地址，第二個分組匹配的是鏈接文本。輸出結果為“http://www.baidu.com 百度”。

2、非貪婪匹配

非貪婪匹配是正則表達式中的一種特殊的匹配模式，通常用於在匹配結果中儘可能少地匹配字符和字符串。在元字符*、+、?後面添加?可以實現非貪婪匹配。

import re

pattern = r'(.*?)'
string = '<a href="http://www.baidu.com">百度</a>'
result = re.search(pattern, string)
print(result.group(1))

其中，.*?表示“匹配任意一個字符0次或多次，儘可能少地匹配”，匹配的是鏈接文本。輸出結果為“百度”。

3、回溯引用

回溯引用是一種強大的正則表達式技術，它可以幫助我們匹配重複出現的模式。一般情況下，我們可以通過(?:pattern)來創建一個不捕獲的組，而通過\1、\2等反斜線引用來引用之前已經捕獲的組。

import re

pattern = r'(\b\w+)\s+\1'
string = 'hello hello, world world!'
result = re.findall(pattern, string)
print(result)

其中，(\b\w+)表示“匹配一個單詞”，\s+表示“匹配一個或多個空格”，\1表示“引用前面已經捕獲的第一個組”，匹配的結果為[‘hello’, ‘world’]。

五、總結

Python的re模塊提供了豐富的正則表達式工具和方法，可以幫助我們在文本中查找、匹配和替換特定的字符和字符串。正則表達式的學習和掌握需要長期實踐和使用，希望本文對各位讀者有所啟發和幫助。

原創文章，作者：RGCX，如若轉載，請註明出處：https://www.506064.com/zh-hant/n/135304.html