Python模塊：自然語言處理（NLP）的情感分析

Python是一種通用編程語言，也是自然語言處理（NLP）中使用最廣泛的語言之一。在NLP中，情感分析是一項非常重要的任務。情感分析是指對文本進行分析、分類和評估，以確定它表達的情緒是積極的、消極的還是中性的。情感分析在社交媒體監控、市場營銷、輿論分析等領域中都有廣泛的應用。

一、安裝Python模塊NLTK

NLTK（自然語言工具包）是Python編程語言中最流行的NLP庫之一。要使用NLTK進行情感分析，需要先安裝它。可以使用pip安裝它：

pip install nltk

安裝完成後，在Python中導入包：

import nltk

二、加載情感分析數據集

在進行情感分析時，需要有一個用於訓練和測試的已標記或已打標籤數據集。NLTK中已經有一個包含50000個電影評論的數據集，這些評論已經被標記為“正面”、“消極”或“中性”。

可以使用以下代碼從NLTK數據集中加載電影評論數據：

from nltk.corpus import movie_reviews
movie_reviews.categories()

輸出結果應該為 [‘neg’, ‘pos’]，表示這個數據集中有兩個類別：消極的評論（neg）和積極的評論（pos）。

三、數據準備和清理

在進行情感分析之前，需要對文本進行一系列的處理和清洗，包括：

1、去除標點符號、數字和其他特殊字符。

2、將所有字符轉換為小寫字母。

3、將文本分成單詞。

4、過濾停用詞。

可以使用以下代碼進行預處理：

import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def clean_text(text):
    # 去除標點符號和數字
    text = text.translate(str.maketrans("", "", string.punctuation + string.digits))
    # 將所有字符轉換為小寫字母
    text = text.lower()
    # 分詞
    words = word_tokenize(text)
    # 過濾停用詞
    stop_words = set(stopwords.words("english"))
    words = [word for word in words if word not in stop_words]
    # 返回處理後的單詞列表
    return words

四、特徵提取

在進行情感分析時，需要將文本表示為向量或數字。常用的方法是使用特徵提取器將每個文本轉換為一個數字向量。在這裡，我們將使用詞袋模型來創建特徵向量。

可以使用以下代碼創建一個詞袋特徵提取器：

from nltk import FreqDist
from nltk import classify
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy
from sklearn.metrics import precision_recall_fscore_support as score

class BagOfWords:
    def __init__(self, all_words):
        self.all_words = all_words
    # 特徵提取器方法
    def bag_of_words(self, cleaned_words):
        words_dict = dict([(word, True) for word in cleaned_words])
        return words_dict

    # 整個文本的單詞列表
    def all_words_cleaned(self, reviews):
        cleaned_words = []
        for review in reviews:
            for word in review:
                cleaned_words.append(word)
        return cleaned_words


    # 詞頻分布
    def frequencies(self, cleaned_words):
        freq_dist = FreqDist(cleaned_words)
        print(freq_dist)

    # 訓練和測試特徵提取器
    def train_test(self, cleaned_data):
        # 特徵集
        positive_features = [(self.bag_of_words(review), "Positive") for review in cleaned_data[0]]
        negative_features = [(self.bag_of_words(review), "Negative") for review in cleaned_data[1]]
        features = positive_features + negative_features

        # 測試集和訓練集
        train_set = features[:3000]
        test_set = features[3000:]

        # 構建樸素貝葉斯分類器
        classifier = NaiveBayesClassifier.train(train_set)

        # 測試集的精度
        print("Test accuracy:", nltk_accuracy(classifier, test_set))

        # 對測試集進行預測，並計算混淆矩陣
        y_true = [category for _, category in test_set]
        y_pred = [classifier.classify(features) for features, _ in test_set]
        precision, recall, fscore, support = score(y_true, y_pred, average="weighted")
        print("Precision: ", precision)
        print("Recall: ", recall)
        print("F-score: ", fscore)
        
# 加載電影評價數據集
positive_reviews = movie_reviews.fileids("pos")
negative_reviews = movie_reviews.fileids("neg")
print(f"num of pos reviews: {len(positive_reviews)}")
print(f"num of neg reviews: {len(negative_reviews)}")

# 加載並預處理數據集
reviews = [
    [clean_text(movie_reviews.raw(fileids=[id])) for id in positive_reviews],
    [clean_text(movie_reviews.raw(fileids=[id])) for id in negative_reviews],
]

# 創建特徵提取器對象並進行特徵提取
bow = BagOfWords(all_words=bow.all_words_cleaned(reviews))
bow.frequencies(bow.all_words)
bow.train_test(reviews)

五、結果和結論

通過運行上述代碼，將會輸出在測試集上的分類精度以及混淆矩陣中的準確率、召回率和F1分數。本實例中得到的分類精度為80.73%，表明樸素貝葉斯分類器在情感分析中具有一定的效果。

在本篇文章中，我們討論了如何使用Python中的NLTK模塊進行情感分析。我們詳細介紹了如何使用NLTK庫來加載數據集、進行數據清洗和預處理、提取特徵並構建分類器。通過最終的測試結果，我們可以看到情感分析在許多領域中的應用。為NLP做出關鍵的貢獻，有助於我們更好地理解和分析自然語言。

原創文章，作者：KEZY，如若轉載，請註明出處：https://www.506064.com/zh-hant/n/137525.html