Python 中的分詞器

眾所周知，互聯網上有大量的文本數據。但是，我們大多數人可能不熟悉開始處理這些文本數據的方法。此外，我們還知道，在機器學習中導航我們語言的字母是一個棘手的部分，因為機器可以識別數字，而不是字母。

那麼，如何進行文本數據操作和清理來創建模型呢？為了回答這個問題，讓我們探索一下【自然語言處理(NLP) 下面的一些奇妙的概念。

解決自然語言處理問題是一個分為多個階段的過程。首先，我們必須在進入建模階段之前清理非結構化文本數據。數據清理包括一些關鍵步驟。這些步驟如下:

單詞分詞
每個標記的詞性預測
文本詞形還原
停止單詞識別和刪除，等等。

在接下來的教程中，我們將學習更多關於被稱為分詞的非常初級的步驟。我們將了解什麼是分詞，為什麼它對自然語言處理是必要的。此外，我們還將在 Python 中發現一些執行分詞的獨特方法。

理解分詞

分詞據說是將大量文本分割成更小的片段，稱為標記。這些片段或標記對於找到模式非常有用，並且被認為是詞幹化和詞形還原的基礎步驟。分詞還支持用非敏感數據元素替換敏感數據元素。

自然語言處理(NLP) 用於創建文本分類、情感分析、智能聊天機械人、語言翻譯等應用。因此，理解文本模式以達到上述目的變得很重要。

但是現在，考慮詞幹化和詞條化作為在自然語言處理的幫助下清理文本數據的主要步驟。像文本分類或垃圾郵件過濾這樣的任務使用自然語言處理以及像和 [Tensorflow](https://www.javatpoint.com/tensorflow) 這樣的深度學習庫。

**### 理解分詞在自然語言處理中的意義

為了理解分詞的意義，讓我們以英語為例。讓我們在理解下一節時，選擇任何一個句子並牢記在心。

在處理自然語言之前，我們必須識別構成字符串的單詞。因此，分詞似乎是進行自然語言處理的最基本步驟

這一步是必要的，因為文本的實際含義可以通過分析文本中出現的每個單詞來解釋。

現在，讓我們以下面的字符串為例:

對上述字符串執行分詞後，我們將獲得如下所示的輸出:

[‘我的’，’名字’，’是’，’傑米’，’克拉克’]

執行該操作有多種用途。我們可以利用分詞的形式來:

數數課文中的單詞總數。
計算單詞出現的頻率，即特定單詞出現的總次數，還有更多。

現在，讓我們了解在 Python 自然語言處理中執行分詞的幾種方法。

Python 中執行分詞的一些方法

對文本數據執行分詞有各種獨特的方法。下面描述了其中一些獨特的方法:

在 Python 中使用 split()函數進行分詞

split() 函數是分割字符串的基本方法之一。此函數在通過特定分隔符拆分提供的字符串後返回字符串列表。默認情況下， split() 函數在每個空格處斷開一個字符串。但是，我們可以根據需要指定分隔符。

讓我們考慮以下例子:

示例 1.1:使用 split()函數的單詞分詞


my_text = """Let's play a game, Would You Rather! It's simple, you have to pick one or the other. Let's get started. Would you rather try Vanilla Ice Cream or Chocolate one? Would you rather be a bird or a bat? Would you rather explore space or the ocean? Would you rather live on Mars or on the Moon? Would you rather have many good friends or one very best friend? Isn't it easy though? When we have less choices, it's easier to decide. But what if the options would be complicated? I guess, you pretty much not understand my point, neither did I, at first place and that led me to a Bad Decision."""

print(my_text.split())

輸出:

['Let's', 'play', 'a', 'game,', 'Would', 'You', 'Rather!', 'It's', 'simple,', 'you', 'have', 'to', 'pick', 'one', 'or', 'the', 'other.', 'Let's', 'get', 'started.', 'Would', 'you', 'rather', 'try', 'Vanilla', 'Ice', 'Cream', 'or', 'Chocolate', 'one?', 'Would', 'you', 'rather', 'be', 'a', 'bird', 'or', 'a', 'bat?', 'Would', 'you', 'rather', 'explore', 'space', 'or', 'the', 'ocean?', 'Would', 'you', 'rather', 'live', 'on', 'Mars', 'or', 'on', 'the', 'Moon?', 'Would', 'you', 'rather', 'have', 'many', 'good', 'friends', 'or', 'one', 'very', 'best', 'friend?', 'Isn't', 'it', 'easy', 'though?', 'When', 'we', 'have', 'less', 'choices,', 'it's', 'easier', 'to', 'decide.', 'But', 'what', 'if', 'the', 'options', 'would', 'be', 'complicated?', 'I', 'guess,', 'you', 'pretty', 'much', 'not', 'understand', 'my', 'point,', 'neither', 'did', 'I,', 'at', 'first', 'place', 'and', 'that', 'led', 'me', 'to', 'a', 'Bad', 'Decision.']

說明:

在上面的例子中，我們使用了 split() 方法，以便將段落分成更小的片段或說出單詞。同樣，我們也可以通過指定分隔符作為 split() 函數的參數來將段落分成句子。正如我們所知，一個句子通常以句號結尾；這意味着我們可以利用。」作為拆分字符串的分隔符。

讓我們在下面的例子中考慮同樣的情況:

示例 1.2:使用 split()函數的句子分詞


my_text = """Dreams. Desires. Reality. There is a fine line between dream to become a desire and a desire to become a reality but expectations are way far then the reality. Nevertheless, we live in a world of mirrors, where we always want to reflect the best of us. We all see a dream, a dream of no wonder what; a dream that we want to be accomplished no matter how much efforts it needed but we try."""

print(my_text.split('. '))

輸出:

['Dreams', 'Desires', 'Reality', 'There is a fine line between dream to become a desire and a desire to become a reality but expectations are way far then the reality', 'Nevertheless, we live in a world of mirrors, where we always want to reflect the best of us', 'We all see a dream, a dream of no wonder what; a dream that we want to be accomplished no matter how much efforts it needed but we try.']

說明:

在上面的例子中，我們使用了句號(。)作為其參數，以便在句號處中斷段落。使用 split() 函數的一個主要缺點是，該函數一次只取一個參數。因此，我們只能使用分隔符來拆分字符串。此外， split() 函數不將標點符號視為單獨的片段。

Python 中使用正則表達式的分詞

在進入下一個方法之前，讓我們簡單地了解一下正則表達式。一個正則表達式，也被稱為正則表達式，是一個特殊的字符序列，允許用戶在該序列的幫助下找到或匹配其他字符串或字符串集作為模式。

為了開始使用正則表達式，Python 提供了名為 re 的庫。 re 庫是 Python 中預先安裝的庫之一。

讓我們考慮以下基於使用 Python 中 RegEx 方法的單詞分詞和句子分詞的示例。

示例 2.1:使用 Python 中的 RegEx 方法進行單詞分詞


import re

my_text = """Joseph Arthur was a young businessman. He was one of the shareholders at Ryan Cloud's Start-Up with James Foster and George Wilson. The Start-Up took its flight in the mid-90s and became one of the biggest firms in the United States of America. The business was expanded in all major sectors of livelihood, starting from Personal Care to Transportation by the end of 2000\. Joseph was used to be a good friend of Ryan."""

my_tokens = re.findall

輸出:

['Joseph', 'Arthur', 'was', 'a', 'young', 'businessman', 'He', 'was', 'one', 'of', 'the', 'shareholders', 'at', 'Ryan', 'Cloud', 's', 'Start', 'Up', 'with', 'James', 'Foster', 'and', 'George', 'Wilson', 'The', 'Start', 'Up', 'took', 'its', 'flight', 'in', 'the', 'mid', '90s', 'and', 'became', 'one', 'of', 'the', 'biggest', 'firms', 'in', 'the', 'United', 'States', 'of', 'America', 'The', 'business', 'was', 'expanded', 'in', 'all', 'major', 'sectors', 'of', 'livelihood', 'starting', 'from', 'Personal', 'Care', 'to', 'Transportation', 'by', 'the', 'end', 'of', '2000', 'Joseph', 'was', 'used', 'to', 'be', 'a', 'good', 'friend', 'of', 'Ryan']

說明:

在上面的例子中，我們已經導入了 re 庫，以便使用它的功能。然後我們使用了 re 庫的 findall() 功能。該函數幫助用戶找到與參數中的模式匹配的所有單詞，並將它們存儲在列表中。

此外，「\ w」用於表示任何單詞字符，指字母數字(包括字母、數字)和下劃線（_）。「+」表示任意頻率。因此，我們遵循了 [\w’]+ 模式，這樣程序應該查找並找到所有字母數字字符，直到遇到任何其他字符。

現在，讓我們看一下使用 RegEx 方法的句子分詞。

示例 2.2:使用 Python 中的 RegEx 方法進行句子分詞


import re

my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him."""

my_sentences = re.compile('[.!?] ').split(my_text)
print(my_sentences)

輸出:

['The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America', 'The product became so successful among the people that the production was increased', 'Two new plant sites were finalized, and the construction was started', "Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care", 'Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories', 'Many popular magazines were started publishing Critiques about him.']

說明:

在上面的例子中，我們使用了參數為「[」的 re 庫的 compile() 函數。？！]，並使用 split() 方法從指定的分隔符中分離字符串。因此，一旦遇到這些字符，程序就會拆分句子。

Python 中使用自然語言工具包的分詞

自然語言工具包，又名 NLTK ，是一個用 Python 編寫的庫。 NLTK 庫一般用於符號和統計自然語言處理，與文本數據配合良好。

自然語言工具包(NLTK) 是一個第三方庫，可以在命令 Shell 或終端中使用以下語法安裝:


$ pip install --user -U nltk

為了驗證安裝，可以在程序中導入 nltk 庫並執行，如下所示:


import nltk

如果程序沒有產生錯誤，那麼庫已經成功安裝。否則，建議再次遵循上述安裝程序，並閱讀官方文檔了解更多詳細信息。

自然語言工具包(NLTK) 有一個名為 tokenize() 的模塊。本模塊進一步分為兩個子類別:單詞分詞和句子分詞

單詞 token ize:**單詞 _tokenize()** 方法用於將字符串拆分為標記或說出單詞。
句子 Tokenize: 使用send _ token ize()方法將字符串或段落拆分成句子。

讓我們考慮一些基於這兩種方法的例子:

示例 3.1:使用 Python 中的 NLTK 庫進行單詞分詞


from nltk.tokenize import word_tokenize

my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him."""

print(word_tokenize(my_text))

輸出:

['The', 'Advertisement', 'was', 'telecasted', 'nationwide', ',', 'and', 'the', 'product', 'was', 'sold', 'in', 'around', '30', 'states', 'of', 'America', '.', 'The', 'product', 'became', 'so', 'successful', 'among', 'the', 'people', 'that', 'the', 'production', 'was', 'increased', '.', 'Two', 'new', 'plant', 'sites', 'were', 'finalized', ',', 'and', 'the', 'construction', 'was', 'started', '.', 'Now', ',', 'The', 'Cloud', 'Enterprise', 'became', 'one', 'of', 'America', "'s", 'biggest', 'firms', 'and', 'the', 'mass', 'producer', 'in', 'all', 'major', 'sectors', ',', 'from', 'transportation', 'to', 'personal', 'care', '.', 'Director', 'of', 'The', 'Cloud', 'Enterprise', ',', 'Ryan', 'Cloud', ',', 'was', 'now', 'started', 'getting', 'interviewed', 'over', 'his', 'success', 'stories', '.', 'Many', 'popular', 'magazines', 'were', 'started', 'publishing', 'Critiques', 'about', 'him', '.']

說明:

在上面的程序中，我們已經從 NLTK 庫的 tokenize 模塊導入了 word_tokenize() 方法。因此，該方法將字符串分解成不同的標記，並將其存儲在列表中。最後，我們打印了清單。此外，該方法還包括 句號和其他標點符號 作為單獨的標記。

示例 3.1:使用 Python 中的 NLTK 庫進行句子分詞


from nltk.tokenize import sent_tokenize

my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him."""

print(sent_tokenize(my_text))

輸出:

['The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America.', 'The product became so successful among the people that the production was increased.', 'Two new plant sites were finalized, and the construction was started.', "Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care.", 'Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories.', 'Many popular magazines were started publishing Critiques about him.']

說明:

在上面的程序中，我們已經從 NLTK 庫的 tokenize 模塊導入了send _ token ize()方法。因此，該方法將段落分成不同的句子，並將其存儲在列表中。最後，我們打印了清單。

結論

在上面的教程中，我們已經發現了分詞的概念及其在整個自然語言處理(NLP) 管道中的作用。我們還討論了 Python 中從特定文本或字符串進行分詞的幾種方法(包括單詞分詞和句子分詞)。

原創文章，作者：GKGOH，如若轉載，請註明出處：https://www.506064.com/zh-hk/n/130250.html