詳解elasticsearch分詞器

一、簡介

隨著雲計算和大數據的普及，搜索引擎已經成為當今互聯網技術的重要組成部分。elasticsearch作為開源全文搜索引擎，其分詞器作為搜索引擎的核心組件，具有重要的作用。elasticsearch的分詞器主要有以下三個部分：

字元過濾器：對原始輸入和標記化的術語進行字元級處理，比如刪除HTML標籤、轉換字元編碼
分詞器：將輸入的文本拆分成單個單詞（詞項）的過程
標記過濾器：修改、刪除或添加特定標記，比如stemming、lowercasing、stopwords

二、分詞器的分類

elasticsearch分詞器按照不同的演算法可分為五種：

Standard Analyzer（標準分詞器）：按照非字母符號或空格分詞
Simple Analyzer（簡單分詞器）：按照非字母符號或空格分詞，並忽略大小寫
Whitespace Analyzer（空格分詞器）：按照空格分詞
Keyword Analyzer（關鍵字分詞器）：將輸入視作一個單一術語，常用於過濾或精確匹配查詢
Language-specific Analyzers（特定語言分詞器）：基於不同語言的特點進行分詞，如中文分詞、德語分詞、法語分詞等

三、中文分詞器的使用

中文分詞是一項複雜的任務，elasticsearch提供了多種中文分詞器，比如IK分詞器和smartcn分詞器。下面是使用IK分詞器進行中文分詞的代碼示例：

$ curl -X PUT "localhost:9200/test" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ik_max_word": { 
          "type": "custom",
          "tokenizer": "ik_max_word"
        },
        "ik_smart": {
          "type": "custom",
          "tokenizer": "ik_smart"
        }
      }
    }
  }
}
'

$ curl -X GET "localhost:9200/test/_analyze?pretty=true" -H 'Content-Type: application/json' -d'
{
  "analyzer": "ik_max_word",
  "text": "我是一名全能編程開發工程師"
}
'

// 返回結果：
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "一名",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "全能",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "編程",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "開發",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "工程師",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

四、標記過濾器的使用

標記過濾器用於修改、刪除或添加特定標記，比如stemming、lowercasing、stopwords。下面是使用stopwords標記過濾器過濾「我是一個全能編程開發工程師」中停用詞的代碼示例：

PUT /stopwords_test
{
  "settings": {
    "analysis": {
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": ["我", "是", "一個"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stopwords"]
        }
      }
    }
  }
}

GET /stopwords_test/_analyze
{
  "analyzer": "my_analyzer",
  "text": "我是一個全能編程開發工程師"
}

// 返回結果：
{
  "tokens": [
    {
      "token": "全能",
      "start_offset": 5,
      "end_offset": 7,
      "type": "",
      "position": 3
    },
    {
      "token": "編程",
      "start_offset": 7,
      "end_offset": 9,
      "type": "",
      "position": 4
    },
    {
      "token": "開發",
      "start_offset": 9,
      "end_offset": 11,
      "type": "",
      "position": 5
    },
    {
      "token": "工程師",
      "start_offset": 11,
      "end_offset": 14,
      "type": "",
      "position": 6
    }
  ]
}

五、結尾

以上是elasticsearch分詞器的基本介紹和使用方法。不同的分詞器和標記過濾器的使用場景不同，可以根據不同情況進行選擇使用。希望這篇文章能夠幫助到使用elasticsearch的開發工程師。

原創文章，作者：KDQKF，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/371741.html