本文為大家介紹了主題建模的概念、LDA算法的原理，示例了如何使用Python建立一個基礎的LDA主題模型，並使用pyLDAvis對主題進行可視化。

圖片來源：Kamil Polak

引言

主題建模包括從文檔術語中提取特徵，並使用數學結構和框架（如矩陣分解和奇異值分解）來生成彼此可區分的術語聚類（cluster）或組，這些單詞聚類繼而形成主題或概念。

主題建模是一種對文檔進行無監督分類的方法，類似於對數值數據進行聚類。

這些概念可以用來解釋語料庫的主題，也可以在各種文檔中一同頻繁出現的單詞之間建立語義聯繫。

主題建模可以應用於以下方面：

發現數據集中隱藏的主題；

將文檔分類到已經發現的主題中；

使用分類來組織/總結/搜索文檔。

有各種框架和算法可以用以建立主題模型：

潛在語義索引（Latent semantic indexing）

潛在狄利克雷分配（Latent Dirichlet Allocation，LDA）

非負矩陣分解（Non-negative matrix factorization，NMF）

在本文中，我們將重點討論如何使用Python進行LDA主題建模。具體來說，我們將討論：

什麼是潛在狄利克雷分配（LDA, Latent Dirichlet allocation）；

LDA算法如何工作；

如何使用Python建立LDA主題模型。

什麼是潛在狄利克雷分配（LDA, Latent Dirichlet allocation）？

潛在狄利克雷分配（LDA, Latent Dirichlet allocation）是一種生成概率模型（generative probabilistic model），該模型假設每個文檔具有類似於概率潛在語義索引模型的主題的組合。

簡而言之，LDA背後的思想是，每個文檔可以通過主題的分佈來描述，每個主題可以通過單詞的分佈來描述。

LDA算法如何工作？

LDA由兩部分組成：

我們已知的屬於文件的單詞；

需要計算的屬於一個主題的單詞或屬於一個主題的單詞的概率。

注意：LDA不關心文檔中單詞的順序。通常，LDA使用詞袋特徵（bag-of-word feature）表示來代表文檔。

以下步驟非常簡單地解釋了LDA算法的工作原理：

1. 對於每個文檔，隨機將每個單詞初始化為K個主題中的一個（事先選擇K個主題）；

2. 對於每個文檔D，瀏覽每個單詞w並計算：

P(T | D)：文檔D中，指定給主題T的單詞的比例；

P(W | T)：所有包含單詞W的文檔中，指定給主題T的比例。

3. 考慮所有其他單詞及其主題分配，以概率P(T | D)´ P(W | T) 將單詞W與主題T重新分配。

LDA主題模型的圖示如下。

圖片來源：Wiki

下圖直觀地展示了每個參數如何連接迴文本文檔和術語。假設我們有M個文檔，文檔中有N個單詞，我們要生成的主題總數為K。

圖中的黑盒代表核心算法，它利用前面提到的參數從文檔中提取K個主題。

圖片來源：Christine Doig

如何使用Python建立LDA主題模型

我們將使用Gensim包中的潛在狄利克雷分配（LDA）。

首先，我們需要導入包。核心包是re、gensim、spacy和pyLDAvis。此外，我們需要使用matplotlib、numpy和panases以進行數據處理和可視化。

1. import re
2. import numpy as np
3. import pandas as pd
4. from pprint import pprint
5. 
6. # Gensim
7. import gensim
8. import gensim.corpora as corpora
9. from gensim.utils import simple_preprocess
10. from gensim.models import CoherenceModel
11. 
12. # spacy for lemmatization
13. import spacy
14. 
15. # Plotting tools
16. import pyLDAvis
17. import pyLDAvis.gensim  # don't skip this
18. import matplotlib.pyplot as plt
19. %matplotlib inline
20. 
21. # Enable logging for gensim - optional
22. import logging
23. logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
24. 
25. import warnings
26. warnings.filterwarnings("ignore",category=DeprecationWarning)

像am/is/are/of/a/the/but/…這樣的詞不包含任何關於「主題」的信息。因此，作為預處理步驟，我們可以將它們從文檔中移除。

要做到這一點，我們需要從NLT導入停用詞。還可以通過添加一些額外的單詞來擴展原始的停用詞列表。

1.# NLTK Stop words
2. from nltk.corpus import stopwords
3. stop_words = stopwords.words('english')
4. stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

在本教程中，我們將使用20個新聞組數據集，其中包含來自20個不同主題的大約11k個新聞組帖子。這可以作為newsgroups.json獲得。

1. # Import Dataset
2. df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
3. print(df.target_names.unique())
4. df.head()

刪除電子郵件鏈接和換行符

在我們開始主題建模之前，需要清理數據集。首先，刪除電子郵件鏈接、多餘的空格和換行符。

1. # Convert to list
2. data = df.content.values.tolist()
3. 
4. # Remove Emails
5. data = [re.sub('S*@S*s?', '', sent) for sent in data]
6. 
7. # Remove new line characters
8. data = [re.sub('s+', ' ', sent) for sent in data]
9. 
10. # Remove distracting single quotes
11. data = [re.sub("'", "", sent) for sent in data]
12. 
13. pprint(data[:1])

標記（tokenize）單詞和清理文本

讓我們把每個句子標記成一個單詞列表，去掉標點符號和不必要的字符。

1. def sent_to_words(sentences):
2.     for sentence in sentences:
3.         yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations
4. 
5. data_words = list(sent_to_words(data))
6. 
7. print(data_words[:1])

創建二元（Bigram）模型和三元（Trigram）模型

1. # Build the bigram and trigram models
2. bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
3. trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  
4. 
5. # Faster way to get a sentence clubbed as a trigram/bigram
6. bigram_mod = gensim.models.phrases.Phraser(bigram)
7. trigram_mod = gensim.models.phrases.Phraser(trigram)
8. 
9. # See trigram example
10. print(trigram_mod[bigram_mod[data_words[0]]])

刪除停用詞（stopword），建立二元模型和詞形還原（Lemmatize）

在這一步中，我們分別定義了函數以刪除停止詞、建立二元模型和詞形還原，並且依次調用了這些函數。

1.# Define functions for stopwords, bigrams, trigrams and lemmatization
2. def remove_stopwords(texts):
3.     return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
4. 
5. def make_bigrams(texts):
6.     return [bigram_mod[doc] for doc in texts]
7. 
8. def make_trigrams(texts):
9.     return [trigram_mod[bigram_mod[doc]] for doc in texts]
10. 
11. def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
12.     """https://spacy.io/api/annotation"""
13.     texts_out = []
14.     for sent in texts:
15.         doc = nlp(" ".join(sent)) 
16.         texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
17.     return texts_out


1. # Remove Stop Words
2. data_words_nostops = remove_stopwords(data_words)
3. 
4. # Form Bigrams
5. data_words_bigrams = make_bigrams(data_words_nostops)
6. 
7. # Initialize spacy 'en' model, keeping only tagger component (for efficiency)
8. # python3 -m spacy download en
9. nlp = spacy.load('en', disable=['parser', 'ner'])
10. 
11. # Do lemmatization keeping only noun, adj, vb, adv
12. data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
13. 
14. print(data_lemmatized[:1])

創建主題建模所需的詞典和語料庫（corpus）

Gensim為文檔中的每個單詞創建一個唯一的id，但是在此之前，我們需要創建一個字典和語料庫作為模型的輸入。

1. # Create Dictionary
2. id2word = corpora.Dictionary(data_lemmatized)
3. 
4. # Create Corpus
5. texts = data_lemmatized
6. 
7. # Term Document Frequency
8. corpus = [id2word.doc2bow(text) for text in texts]
9. 
10. # View
11. print(corpus[:1])

建立主題模型

現在我們準備進入核心步驟，使用LDA進行主題建模。讓我們開始建立模型。我們將建立20個不同主題的LDA模型，其中每個主題都是關鍵字的組合，每個關鍵字在主題中都具有一定的權重（weightage）。

一些參數的解釋如下：

num_topics —需要預先定義的主題數量；

chunksize — 每個訓練塊（training chunk）中要使用的文檔數量；

alpha — 影響主題稀疏性的超參數；

passess — 訓練評估的總數。

1. # Build LDA model
2. lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
3.                                            id2word=id2word,
4.                                            num_topics=20, 
5.                                            random_state=100,
6.                                            update_every=1,
7.                                            chunksize=100,
8.                                            passes=10,
9.                                            alpha='auto',
10.                                            per_word_topics=True)

查看LDA模型中的主題

我們可以可視化每個主題的關鍵詞和每個關鍵詞的權重（重要性）。

1.# Print the Keyword in the 10 topics
2. pprint(lda_model.print_topics())
3. doc_lda = lda_model[corpus]

計算模型困惑度（Perplexity）和一致性分數（Coherence Score）

模型困惑度是對概率分佈或概率模型預測樣本好壞的一種度量。主題一致性通過測量主題中得分高的單詞之間的語義相似度來衡量單個主題的得分。

簡而言之，它們提供了一種方便的方法來判斷一個給定的主題模型有多好。

1. # Compute Perplexity
2. print('nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.
3. 
4. # Compute Coherence Score
5. coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
6. coherence_lda = coherence_model_lda.get_coherence()
7. print('nCoherence Score: ', coherence_lda)