一、文本清洗與預處理
在自然語言處理中,文本清洗和預處理是必不可少的一步。因為原始文本中包含了各種各樣的噪聲、特殊符號等,這些都會對後續的自然語言處理產生干擾和誤差。以下是一些文本清洗和預處理的技巧:
1、去除非文本部分,例如HTML標籤
import re
def remove_html_tags(text):
"""去除HTML標籤"""
clean = re.compile('')
return re.sub(clean, '', text)
text = 'This is a headline.
This is a paragraph.
'
print(remove_html_tags(text))
2、去除特殊字符,如標點符號、數字等
import string
def remove_punctuation(text):
"""去掉標點符號"""
return text.translate(str.maketrans('', '', string.punctuation))
text = "Let's try to remove punctuation from this text!"
print(remove_punctuation(text))
3、單詞分詞
import nltk text = "This is a sentence for word tokenization." tokens = nltk.word_tokenize(text) print(tokens)
二、文本特徵提取
文本特徵提取是自然語言處理中的一個重要概念。在建立自然語言處理模型時,我們需要將文本轉換為一些有意義的特徵表示。以下是一些文本特徵提取的技巧:
1、詞袋模型
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
2、TF-IDF模型
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
三、文本分類
文本分類是自然語言處理中的一個重要應用。在進行文本分類時,我們需要建立一個分類器,將文本自動歸類到預定義的類別中。以下是一些文本分類的技巧:
1、樸素貝葉斯分類器
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
clf = MultinomialNB()
clf.fit(X, [1, 1, 2, 2])
test_text = "Is this the third document?"
test_vec = vectorizer.transform([test_text])
print(clf.predict(test_vec))
2、支持向量機分類器
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
clf = LinearSVC()
clf.fit(X, [1, 1, 2, 2])
test_text = "Is this the third document?"
test_vec = vectorizer.transform([test_text])
print(clf.predict(test_vec))
原創文章,作者:小藍,如若轉載,請註明出處:https://www.506064.com/zh-hant/n/243684.html
微信掃一掃
支付寶掃一掃