本文目錄一覽:
- 1、用Python統計詞頻
- 2、Python編程實現csv文件某一列的詞頻統計
- 3、如何用python和jieba分詞,統計詞頻?
- 4、python問題,我運用python做中文詞頻分析的時候總是顯示UnicodeDecodeError: ‘utf-8’問題?
用Python統計詞頻
def statistics(astr):
# astr.replace(“\n”, “”)
slist = list(astr.split(“\t”))
alist = []
[alist.append(i) for i in slist if i not in alist]
alist[-1] = alist[-1].replace(“\n”, “”)
return alist
if __name__ == “__main__”:
code_doc = {}
with open(“test_data.txt”, “r”, encoding=’utf-8′) as fs:
for ln in fs.readlines():
l = statistics(ln)
for t in l:
if t not in code_doc:
code_doc.setdefault(t, 1)
else:
code_doc[t] += 1
for keys in code_doc.keys():
print(keys + ‘ ‘ + str(code_doc[keys]))
Python編程實現csv文件某一列的詞頻統計
import re
import collections
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
#為避免出問題,文件名使用全路徑
data = pd.read_csv(‘XXX.csv’)
trainheadlines = []
for row in range(0, len(data.index)):
trainheadlines.append(‘ ‘.join(str(x) for x in data.iloc[row, m:n]))
#上面的m:n代表取那一列,或者那幾列。
advancedvectorizer = TfidfVectorizer(
min_df=0, max_df=1, max_features=20000, ngram_range=(1, 1))
advancedtrain = advancedvectorizer.fit_transform(trainheadlines)
print(advancedtrain.shape)
如何用python和jieba分詞,統計詞頻?
#! python3
# -*- coding: utf-8 -*-
import os, codecs
import jieba
from collections import Counter
def get_words(txt):
seg_list = jieba.cut(txt)
c = Counter()
for x in seg_list:
if len(x)1 and x != ‘\r\n’:
c[x] += 1
print(‘常用詞頻度統計結果’)
for (k,v) in c.most_common(100):
print(‘%s%s %s %d’ % (‘ ‘*(5-len(k)), k, ‘*’*int(v/3), v))
if __name__ == ‘__main__’:
with codecs.open(’19d.txt’, ‘r’, ‘utf8’) as f:
txt = f.read()
get_words(txt)
python問題,我運用python做中文詞頻分析的時候總是顯示UnicodeDecodeError: ‘utf-8’問題?
出現原因:文件不是 UTF8 編碼的,而系統默認採用 UTF8 解碼。
解決方法是改為對應的解碼方式。
解決辦法:
「文件–》另存為」,可以看到文件的默認編碼格式為ANSI,改為編碼格式UTF8,保存
原創文章,作者:小藍,如若轉載,請註明出處:https://www.506064.com/zh-tw/n/234031.html