一、搜索引擎結果展示的基本原理
搜索引擎的主要功能是根據用戶輸入的關鍵詞,從眾多網頁中篩選出最相關的內容,並將這些內容顯示在搜索結果頁面上。這個過程可以分為三個基本步驟:
1、爬蟲抓取:搜索引擎利用爬蟲程序(也稱為蜘蛛)從互聯網上抓取網頁。
public class Spider {
private static final int MAX_PAGES_TO_SEARCH = 10;
private Set pagesVisited = new HashSet();
private List pagesToVisit = new LinkedList();
/**
* Our main launching point for the Spider's functionality. Internally it creates spider legs
* that make an HTTP request and parse the response (the web page).
*
* @param url
* - The starting point of the spider
* @param searchWord
* - The word or string that you are searching for
*/
public void search(String url, String searchWord)
{
while(this.pagesVisited.size() < MAX_PAGES_TO_SEARCH)
{
String currentUrl;
SpiderLeg leg = new SpiderLeg();
if(this.pagesToVisit.isEmpty())
{
currentUrl = url;
this.pagesVisited.add(url);
}
else
{
currentUrl = this.nextUrl();
}
leg.crawl(currentUrl); // Lots of stuff happening here. Look at the crawl method in
// SpiderLeg
boolean success = leg.searchForWord(searchWord);
if(success)
{
System.out.println(String.format("**Success** Word %s found at %s", searchWord, currentUrl));
break;
}
this.pagesToVisit.addAll(leg.getLinks());
}
System.out.println(String.format("**Done** Visited %s web page(s)", this.pagesVisited.size()));
}
/**
* Returns the next URL to visit (in the order that they were found). We also do a check to
* make sure this method doesn't return a URL that has already been visited.
*
* @return
*/
private String nextUrl()
{
String nextUrl;
do
{
nextUrl = this.pagesToVisit.remove(0);
} while(this.pagesVisited.contains(nextUrl));
this.pagesVisited.add(nextUrl);
return nextUrl;
}
}
2、索引處理:搜索引擎將爬蟲抓取到的網頁存儲進索引庫中,同時對網頁的主要內容進行索引處理。索引處理的目的是能夠快速準確地找到包含關鍵字的網頁,這個過程主要是通過計算TF-IDF等演算法來實現。
public class Indexer {
private WebCrawler spider;
private Map frequencyToUrlMap;
private Map<String, Map> wordUrlsMap;
public Indexer(WebCrawler spider) {
this.spider = spider;
frequencyToUrlMap = new HashMap();
wordUrlsMap = new HashMap<String, Map>();
}
/**
* Index a page by its URL
*
* @param url
* - The URL of the page to be indexed
*/
public void indexPage(String url) {
System.out.println("Indexing " + url);
Document document = spider.getDocument(url);
String text = document.text();
List words = spider.getWordsFromDocument(text);
// Count frequency of web pages
for (String word : words) {
if (!wordUrlsMap.containsKey(word)) {
wordUrlsMap.put(word, new HashMap());
}
Map urlToCountMap = wordUrlsMap.get(word);
if (!urlToCountMap.containsKey(url)) {
urlToCountMap.put(url, 0);
}
urlToCountMap.put(url, urlToCountMap.get(url) + 1);
}
// Map frequency to URL
int frequency = 0;
for (Map.Entry<String, Map> entry : wordUrlsMap.entrySet()) {
String word = entry.getKey();
Map urlToCountMap = entry.getValue();
frequency = 0;
for (Map.Entry urlEntry : urlToCountMap.entrySet()) {
frequency += urlEntry.getValue();
}
// TODO: implement sorting by frequency
frequencyToUrlMap.put(word, url);
System.out.println("indexing " + word + ", " + url);
}
}
}
3、結果展示:當用戶輸入關鍵字之後,搜索引擎會將包含關鍵字的網頁內容展示在搜索結果頁面上,同時還會對這些網頁進行排名,將最相關的網頁排名靠前。
二、搜索引擎結果展示的主要形式
搜索引擎結果通常有以下幾種主要形式:
1、藍色鏈接+標題+描述:這是搜索結果最常見的展示形式,用戶在搜索後會看到一系列鏈接,每個鏈接後面跟著網頁的標題和描述信息,讓用戶可以對結果進行初步篩選。
Example DomainThis domain is established to be used for illustrative examples in documents. You may use this domain in examples without prior coordination or asking for permission.
2、圖文結果:對於一些特定的搜索,搜索引擎也會展示圖文結果,包括圖片、視頻、新聞等信息。這種展示形式更加直觀且易於用戶理解。
This is an example description of the linked page.
原創文章,作者:EBTB,如若轉載,請註明出處:https://www.506064.com/zh-tw/n/143100.html
微信掃一掃
支付寶掃一掃