使用Python正則表達式進行文本匹配和提取

一、正則表達式概述

正則表達式（Regular Expression，簡稱regex或RegExp）是一種字元序列，可以用來描述字元串的特徵。Python中內置的re模塊可以對字元串進行正則表達式匹配、搜索、替換等操作。

正則表達式的基礎字元包括普通字元和元字元兩種，其中普通字元包括大小寫字母、數字和各種符號，而元字元則具有特殊含義，如匹配任意字元、重複n次等。

Python中的正則表達式使用原生字元串進行表示，以r開頭的字元串就是原生字元串（raw string），所有的轉義字元都不會被轉義。

>>> pattern = r'\d+'  # 匹配一個或多個數字
>>> string = '123hello456world789'
>>> re.findall(pattern, string)
['123', '456', '789']

二、匹配和搜索

Python中re模塊提供了多種方法進行正則表達式的匹配和搜索，其中最常用的方法是findall、search和match。

findall方法可以在字元串中找到所有符合正則表達式的子串，並返回一個列表。如果沒有找到，則返回空列表。

>>> pattern = r'\d+'  # 匹配一個或多個數字
>>> string = '123hello456world789'
>>> re.findall(pattern, string)
['123', '456', '789']

search方法可以在字元串中搜索到第一個符合正則表達式的子串，如果沒有找到，則返回None。

>>> pattern = r'hello'
>>> string = '123hello456world789'
>>> re.search(pattern, string)
<re.Match object; span=(3, 8), match='hello'>

match方法只能在字元串的開頭進行匹配，如果沒有找到符合正則表達式的子串，則返回None。

>>> pattern = r'^\d+'  # 匹配開頭的數字
>>> string = '123hello456world789'
>>> re.match(pattern, string)
<re.Match object; span=(0, 3), match='123'>

三、分組和捕獲

正則表達式中可以使用小括弧來分組，並使用|來分隔多個選擇項。可以使用groups方法或group(index)方法獲取分組的內容，其中index表示該分組的編號（從1開始）或者名稱。

>>> pattern = r'(hello|world), (\d+)'  # 匹配"hello, 123"或"world, 456"
>>> string = 'hello, 123; world, 456'
>>> match = re.search(pattern, string)
>>> match.groups()
('hello', '123')
>>> match.group(1)
'hello'
>>> match.group(2)
'123'

如果需要對分組進行捕獲，可以在小括弧里加上?P<name>來給分組設置一個名稱。可以使用groupdict方法獲取分組的內容字典。

>>> pattern = r'(?P<fruit>\w+), (?P<count>\d+)'  # 匹配"apple, 3"等
>>> string = 'apple, 3; banana, 2'
>>> match = re.search(pattern, string)
>>> match.groupdict()
{'fruit': 'apple', 'count': '3'}

四、替換和修改

Python中的re.sub方法可以用來對字元串進行替換。替換時，可以使用正則表達式來匹配要替換的內容，並將替換內容作為第二個參數傳入。如果要保留原字元串中原始內容，則可以在替換內容中使用\g<name>表示引用該分組的內容。

>>> pattern = r'(\d+)/(\d+)/(\d+)'  # 匹配日期格式"yyyy/mm/dd"
>>> string = 'today is 2022/01/01'
>>> re.sub(pattern, r'\3-\1-\2', string)
'today is 01-2022-01'

除了替換外，還可以使用re.split方法對字元串進行分割。如果要在特定的字元串位置進行分割，則可以使用正則表達式來匹配該位置。

>>> pattern = r'[\s,;]'  # 匹配空格、逗號和分號
>>> string = 'hello, world; python is easy'
>>> re.split(pattern, string)
['hello', '', 'world', '', 'python', 'is', 'easy']

五、高級應用

除了基本功能外，Python還可以使用正則表達式實現一些複雜的功能。例如，可以使用前後向匹配來進行斷言，或者使用re模塊的子模塊regex進行更高級的正則表達式操作。

>>> pattern = r'(?<=hello )\w+'  # 匹配以"hello "開頭的單詞
>>> string = 'hello world, hello python'
>>> re.findall(pattern, string)
['world', 'python']

可以在正則表達式中使用條件匹配，根據不同條件進行不同的匹配。例如，可以根據不同的操作系統選擇不同的文件分隔符。

>>> pattern = r'(\\. |\\/)+'
>>> string = 'hello\\world/world\\python'
>>> re.split(pattern, string)
['hello', 'world', 'python']

六、總結

使用Python正則表達式進行文本匹配和提取可以極大地簡化字元串處理的工作。對於複雜的字元串處理，正則表達式可以提供更加方便和高效的解決方案。掌握了正則表達式的各種語法和功能，可以讓我們在處理文本數據時事半功倍。

原創文章，作者：小藍，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/286152.html