word2vec代碼實現詳解

一、word2vec代碼實現生成器

在介紹word2vec代碼實現前，我們先來認識一個能夠生成word2vec代碼的生成器，它可以方便我們快速生成、修改和調試word2vec代碼。生成器的基本思路是根據word2vec原論文中的公式和參數來生成代碼，同時提供了一些可修改的參數和選項。

以下是一個簡單示例：

<html>
 <head>
   <title>word2vec代碼生成器</title>
 </head>
 <body>
   <form>
     <label>Embedding size:</label>
     <input type="text" name="size" value="300"><br>
     <label>Window size:</label>
     <input type="text" name="window" value="5"><br>
     <label>Negative samples:</label>
     <input type="text" name="neg" value="5"><br>
     <label>Epochs:</label>
     <input type="text" name="epochs" value="5"><br>
     <input type="submit" value="Generate">
   </form>
   <pre><code>
     # Here goes your generated code
   </code></pre>
 </body>
</html>

上面是一個基於HTML和Python的簡易生成器示例，通過修改輸入框中的參數，點擊「Generate」按鈕可以快速生成word2vec代碼，並在頁面中顯示。

二、word2vec代碼實現

在word2vec代碼實現中，主要包括以下幾個部分：

1. 數據預處理

在數據預處理階段，我們需要對原始文本進行分詞、建立詞表並將文本轉化為數值矩陣，以便後續神經網路模型訓練。

以下是一個簡單的數據預處理代碼示例：

import numpy as np
import pandas as pd
import jieba

def preprocess_text(text_path, stopwords_path):
    # Load text and stop words
    with open(text_path, 'r', encoding='utf-8') as f:
        text = f.read()
    with open(stopwords_path, 'r', encoding='utf-8') as f:
        stopwords = [line.strip() for line in f]
        
    # Cut text into words
    words = jieba.cut(text)
    words = [word for word in words if word not in stopwords]
    
    # Build word table
    word_set = set(words)
    word_dict = dict(zip(word_set, range(len(word_set))))
    
    # Convert text to matrix
    matrix = np.zeros((len(words), len(word_set)))
    for i, word in enumerate(words):
        matrix[i, word_dict[word]] = 1
    
    # Return word dictionary and matrix
    return word_dict, matrix

上面的代碼中使用了jieba庫進行中文分詞，建立了一個詞表並將文本轉化為矩陣。

2. Skip-gram模型

Skip-gram模型是word2vec最常用的模型之一，它的基本思路是通過一個中心詞預測周圍的詞語。在訓練過程中，我們使用神經網路來最大化預測正確詞語的概率。

以下是一個簡單的Skip-gram模型代碼示例：

import tensorflow as tf

class SkipGramModel:
    def __init__(self, vocab_size, embedding_size, num_sampled=5):
        self.vocab_size = vocab_size
        self.embedding_size = embedding_size
        self.num_sampled = num_sampled
        
        self.input_words = tf.placeholder(tf.int32, shape=[None])
        self.output_words = tf.placeholder(tf.int32, shape=[None, 1])
        
        with tf.variable_scope('embedding'):
            embedding_matrix = tf.get_variable('embedding_matrix', 
                                                shape=[self.vocab_size, self.embedding_size], 
                                                initializer=tf.contrib.layers.xavier_initializer())
            embedding = tf.nn.embedding_lookup(embedding_matrix, self.input_words)
        
        with tf.variable_scope('nsc_loss'):
            nce_weights = tf.get_variable('nce_weights', 
                                          shape=[self.vocab_size, self.embedding_size], 
                                          initializer=tf.contrib.layers.xavier_initializer())
            nce_biases = tf.get_variable('nce_biases',
                                         shape=[self.vocab_size],
                                         initializer=tf.zeros_initializer())
            self.loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
                                                      biases=nce_biases,
                                                      labels=self.output_words,
                                                      inputs=embedding,
                                                      num_sampled=self.num_sampled,
                                                      num_classes=self.vocab_size))
        self.optimizer = tf.train.AdamOptimizer().minimize(self.loss)

上面的代碼中，SkipGramModel類接收詞表大小、嵌入維度和負樣本數量作為參數，定義了兩個佔位符(input_words和output_words)，使用tf.nn.embedding_lookup建立嵌入矩陣，同時使用tf.nn.nce_loss計算最終的損失，並採用Adam優化器進行參數更新。

3. 訓練模型

訓練模型時，我們需要使用預處理過的文本數據和Skip-gram模型進行訓練，以最終獲得每個單詞的嵌入向量。

以下是一個簡單的訓練代碼示例：

def train_model(word_dict, matrix, embedding_size=300, window_size=5, 
                num_epochs=5, num_neg_samples=5, batch_size=128, learning_rate=0.01):
    # Initialize model
    model = SkipGramModel(len(word_dict), embedding_size, num_neg_samples)
    session = tf.Session()
    session.run(tf.global_variables_initializer())
    
    # Train model
    for epoch in range(num_epochs):
        for i in range(0, len(matrix), batch_size):
            batch_matrix = matrix[i:i+batch_size]
            pos_samples = []
            neg_samples = []
            
            for j in range(len(batch_matrix)):
                input_word_idx = np.where(batch_matrix[j]==1)[0][0]
                output_word_indices = []
                
                for k in range(max(0, j-window_size), min(j+window_size+1, len(batch_matrix))):
                    if k != j:
                        output_word_indices.append(np.where(batch_matrix[k]==1)[0][0])
                
                for output_word_idx in output_word_indices:
                    pos_samples.append((input_word_idx, output_word_idx))
                    
                    for _ in range(num_neg_samples):
                        neg_samples.append((input_word_idx, np.random.randint(0, len(word_dict))))
            
            np.random.shuffle(pos_samples)
            np.random.shuffle(neg_samples)
            input_words = [sample[0] for sample in pos_samples+neg_samples]
            output_words = [[sample[1]] for sample in pos_samples+neg_samples]
            
            feed_dict = {model.input_words: input_words, model.output_words: output_words}
            loss, _ = session.run([model.loss, model.optimizer], feed_dict=feed_dict)
            
        print('Epoch {}/{}: Loss = {:.5f}'.format(epoch+1, num_epochs, loss))
            
    # Get embeddings
    embeddings = session.run(tf.get_default_graph().get_tensor_by_name('embedding/embedding_matrix:0'))
    session.close()
    
    # Return word embeddings and dictionary
    return embeddings, word_dict

上面的代碼中，train_model函數採用批量學習的方式進行訓練，每次從所有單詞中隨機選擇一個並生成正樣本和負樣本，使用SkipGramModel類計算損失並更新參數。訓練結束後，通過調用embedding_lookup函數獲取每個單詞的嵌入向量。

三、word2vec相關參數介紹

除了以上的核心代碼實現外，word2vec還涉及到一些常用參數，下面我們將對這些參數進行介紹。

1. 嵌入維度

嵌入維度指的是每個單詞嵌入向量的維度，通常在100~300之間。較小的維度可能無法完全表達一個單詞的語義信息，而過大的維度則容易導致模型過擬合。

2. 窗口大小

窗口大小指的是在Skip-gram模型中，中心詞左右各取幾個詞作為上下文。通常取值為5~10之間，太小會導致模型無法捕捉到更多上下文信息，太大則容易導致噪音的引入。

3. 負採樣數

負採樣數指的是在Skip-gram模型中，每個正樣本要採樣多少個負樣本。通常取值為5~20之間。

四、結語

通過本文詳細介紹，我們了解了word2vec的基本思路和代碼實現，並介紹了一些常用參數和選項。在使用word2vec時，我們可以基於相關工具和代碼進行快速生成和調整，以便獲得更好的結果。

原創文章，作者：JDCD，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/145998.html