一、搜索引擎结果展示的基本原理
搜索引擎的主要功能是根据用户输入的关键词,从众多网页中筛选出最相关的内容,并将这些内容显示在搜索结果页面上。这个过程可以分为三个基本步骤:
1、爬虫抓取:搜索引擎利用爬虫程序(也称为蜘蛛)从互联网上抓取网页。
public class Spider {
private static final int MAX_PAGES_TO_SEARCH = 10;
private Set pagesVisited = new HashSet();
private List pagesToVisit = new LinkedList();
/**
* Our main launching point for the Spider's functionality. Internally it creates spider legs
* that make an HTTP request and parse the response (the web page).
*
* @param url
* - The starting point of the spider
* @param searchWord
* - The word or string that you are searching for
*/
public void search(String url, String searchWord)
{
while(this.pagesVisited.size() < MAX_PAGES_TO_SEARCH)
{
String currentUrl;
SpiderLeg leg = new SpiderLeg();
if(this.pagesToVisit.isEmpty())
{
currentUrl = url;
this.pagesVisited.add(url);
}
else
{
currentUrl = this.nextUrl();
}
leg.crawl(currentUrl); // Lots of stuff happening here. Look at the crawl method in
// SpiderLeg
boolean success = leg.searchForWord(searchWord);
if(success)
{
System.out.println(String.format("**Success** Word %s found at %s", searchWord, currentUrl));
break;
}
this.pagesToVisit.addAll(leg.getLinks());
}
System.out.println(String.format("**Done** Visited %s web page(s)", this.pagesVisited.size()));
}
/**
* Returns the next URL to visit (in the order that they were found). We also do a check to
* make sure this method doesn't return a URL that has already been visited.
*
* @return
*/
private String nextUrl()
{
String nextUrl;
do
{
nextUrl = this.pagesToVisit.remove(0);
} while(this.pagesVisited.contains(nextUrl));
this.pagesVisited.add(nextUrl);
return nextUrl;
}
}
2、索引处理:搜索引擎将爬虫抓取到的网页存储进索引库中,同时对网页的主要内容进行索引处理。索引处理的目的是能够快速准确地找到包含关键字的网页,这个过程主要是通过计算TF-IDF等算法来实现。
public class Indexer {
private WebCrawler spider;
private Map frequencyToUrlMap;
private Map<String, Map> wordUrlsMap;
public Indexer(WebCrawler spider) {
this.spider = spider;
frequencyToUrlMap = new HashMap();
wordUrlsMap = new HashMap<String, Map>();
}
/**
* Index a page by its URL
*
* @param url
* - The URL of the page to be indexed
*/
public void indexPage(String url) {
System.out.println("Indexing " + url);
Document document = spider.getDocument(url);
String text = document.text();
List words = spider.getWordsFromDocument(text);
// Count frequency of web pages
for (String word : words) {
if (!wordUrlsMap.containsKey(word)) {
wordUrlsMap.put(word, new HashMap());
}
Map urlToCountMap = wordUrlsMap.get(word);
if (!urlToCountMap.containsKey(url)) {
urlToCountMap.put(url, 0);
}
urlToCountMap.put(url, urlToCountMap.get(url) + 1);
}
// Map frequency to URL
int frequency = 0;
for (Map.Entry<String, Map> entry : wordUrlsMap.entrySet()) {
String word = entry.getKey();
Map urlToCountMap = entry.getValue();
frequency = 0;
for (Map.Entry urlEntry : urlToCountMap.entrySet()) {
frequency += urlEntry.getValue();
}
// TODO: implement sorting by frequency
frequencyToUrlMap.put(word, url);
System.out.println("indexing " + word + ", " + url);
}
}
}
3、结果展示:当用户输入关键字之后,搜索引擎会将包含关键字的网页内容展示在搜索结果页面上,同时还会对这些网页进行排名,将最相关的网页排名靠前。
二、搜索引擎结果展示的主要形式
搜索引擎结果通常有以下几种主要形式:
1、蓝色链接+标题+描述:这是搜索结果最常见的展示形式,用户在搜索后会看到一系列链接,每个链接后面跟着网页的标题和描述信息,让用户可以对结果进行初步筛选。
Example DomainThis domain is established to be used for illustrative examples in documents. You may use this domain in examples without prior coordination or asking for permission.
2、图文结果:对于一些特定的搜索,搜索引擎也会展示图文结果,包括图片、视频、新闻等信息。这种展示形式更加直观且易于用户理解。
This is an example description of the linked page.
原创文章,作者:EBTB,如若转载,请注明出处:https://www.506064.com/n/143100.html
微信扫一扫
支付宝扫一扫