深入解析onehot

一、onehot向量

1、onehot向量，也叫做one-of-k編碼，將一個離散的類別變數映射為一個高維向量，具體來說就是將一個類別變數映射為一個只有一個元素為1，其他元素都為0的向量。

2、舉個例子，假如現在有一個屬性為{red, green, blue}，那麼它可以被映射為如下三個向量：[1, 0, 0], [0, 1, 0], [0, 0, 1]。

3、onehot向量常用於表示分類變數，如性別和學歷等，方便模型計算。

二、one shot

1、與onehot向量類似的概念是one shot，它也是用來將類別變數映射成向量的方法。不同的是，onehot向量只有一個元素為1，其他為0，而one shot中，每個元素都可以是1或0。

2、繼續以上面的顏色屬性為例，one shot可以將紅色映射為[1,0,0]，淺紅色映射為[1,1,0]，黃色映射為[0,1,1]。

三、onehotencoder 用法

1、onehotencoder是sklearn里一個用來進行onehot編碼的類。其用法如下：

from sklearn.preprocessing import OneHotEncoder

# 假設我們的數據長這樣
x = [['青島', 22], ['濟南', 21], ['青島', 23], ['煙台', 20]]

# 我們先對城市進行標籤編碼
from sklearn.preprocessing import LabelEncoder
city_encoder = LabelEncoder()
city_labels = city_encoder.fit_transform([row[0] for row in x])

# 再對城市進行onehot編碼，要注意reshape一下
ohe = OneHotEncoder(categories='auto')
x_city = ohe.fit_transform(city_labels.reshape(-1, 1))

# 將年齡和城市合併為一個矩陣
x_age = np.array([row[1] for row in x]).reshape(-1, 1)
x1 = np.hstack((x_city.toarray(), x_age))

2、上面的代碼中，我們先用LabelEncoder對青島、濟南、煙台進行了標籤編碼，將它們分別編碼為0、1、2。接著我們用OneHotEncoder對城市進行了onehot編碼，得到如下編碼結果：

| 青島 | 濟南 | 煙台 |
| :—–: | :—–: | :—–: |
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 0 | 0 | 1 |

3、最後，我們把城市和年齡合併為一個矩陣，得到完整的樣本：

| 青島 | 濟南 | 煙台 | 年齡 |
| :—: | :—: | :—: | :—: |
| 1 | 0 | 0 | 22 |
| 0 | 1 | 0 | 21 |
| 1 | 0 | 0 | 23 |
| 0 | 0 | 1 | 20 |

四、onehot code

1、實現一個簡單的onehot編碼的代碼如下：

def one_hot_encoding(x):
    """
    :param x: 輸入列表
    :return: onehot編碼後的二維列表
    """
    unique_x = list(set(x))
    encoding_map = {unique_x[i]: [0] * i + [1] + [0] * (len(unique_x)-i-1) for i in range(len(unique_x))}
    return [encoding_map[e] for e in x]

2、此代碼可以輸入任意一個列表，返回它的onehot編碼結果。大致思路是先找出列表裡的所有唯一值，然後用一個字典生成式來生成每個唯一值的對應編碼，最後把整個列表每個元素都進行編碼。

五、onehot狀態

1、在實際應用中，onehot編碼可能會因為離散類別數量過多而導致維度過高，進而影響模型效果。

2、還有一種情況是，如果本應該被當做一個整體來考慮的類別變數被拆開編碼之後，可能會失去一些本應有的信息。

3、因此，onehot編碼需要根據具體情況來判斷是否適用，以及在使用時需要注意類別數量和編碼方式的選擇，既要保證有效表示信息，也要盡量避免維度災難的影響。

六、onehot可以轉換string

1、在sklearn中，我們可以用LabelEncoder將離散類別映射成數字，然後使用OneHotEncoder將數字轉換成onehot向量；而在tensorflow中，我們可以使用tf.feature_column.categorical_column_with_vocabulary_list將離散類別映射成onehot向量。

2、例：

import tensorflow as tf

# 假設我們有一個字元串向量
colors = tf.constant(['red', 'blue', 'green', 'green', 'yellow'])

# 我們使用categorical_column_with_vocabulary_list生成一個特徵列
colors_column = tf.feature_column.categorical_column_with_vocabulary_list(key='colors', vocabulary_list=['red', 'blue', 'green', 'yellow'])

# 使用onehot_column將特徵列轉換成onehot向量
colors_onehot = tf.feature_column.indicator_column(colors_column)

# 將向量輸入到神經網路中
net = tf.layers.dense(inputs=tf.feature_column.input_layer({'colors': colors}), units=10)

3、上面的代碼中，我們使用了categorical_column_with_vocabulary_list將字元串向量映射成了onehot向量，得到的結果與前面LabelEncoder和OneHotEncoder所得到的結果相同。

總結

本文對onehot從多個方面進行了深入闡述，包括onehot向量、one shot、onehotencoder 用法、onehot code、onehot狀態和onehot可以轉換string等方面。其中，onehot向量和onehotencoder是常用的應用方式，而onehot狀態需要根據具體情況進行選擇和優化。

原創文章，作者：小藍，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/249539.html