Python自然語言處理實戰：打造高效文本處理工具

一、文本清洗與預處理

在自然語言處理中，文本清洗和預處理是必不可少的一步。因為原始文本中包含了各種各樣的雜訊、特殊符號等，這些都會對後續的自然語言處理產生干擾和誤差。以下是一些文本清洗和預處理的技巧：

1、去除非文本部分，例如HTML標籤

import re

def remove_html_tags(text):
    """去除HTML標籤"""
    clean = re.compile('')
    return re.sub(clean, '', text)

text = 'This is a headline.
This is a paragraph.'
print(remove_html_tags(text))

2、去除特殊字元，如標點符號、數字等

import string

def remove_punctuation(text):
    """去掉標點符號"""
    return text.translate(str.maketrans('', '', string.punctuation))

text = "Let's try to remove punctuation from this text!"
print(remove_punctuation(text))

3、單詞分詞

import nltk

text = "This is a sentence for word tokenization."
tokens = nltk.word_tokenize(text)
print(tokens)

二、文本特徵提取

文本特徵提取是自然語言處理中的一個重要概念。在建立自然語言處理模型時，我們需要將文本轉換為一些有意義的特徵表示。以下是一些文本特徵提取的技巧：

1、詞袋模型

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

2、TF-IDF模型

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

三、文本分類

文本分類是自然語言處理中的一個重要應用。在進行文本分類時，我們需要建立一個分類器，將文本自動歸類到預定義的類別中。以下是一些文本分類的技巧：

1、樸素貝葉斯分類器

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

clf = MultinomialNB()
clf.fit(X, [1, 1, 2, 2])

test_text = "Is this the third document?"
test_vec = vectorizer.transform([test_text])
print(clf.predict(test_vec))

2、支持向量機分類器

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

clf = LinearSVC()
clf.fit(X, [1, 1, 2, 2])

test_text = "Is this the third document?"
test_vec = vectorizer.transform([test_text])
print(clf.predict(test_vec))

原創文章，作者：小藍，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/243684.html

Python自然語言處理實戰：打造高效文本處理工具

一、文本清洗與預處理

This is a headline.

二、文本特徵提取

三、文本分類

相關推薦

發表回復