一、搜索引擎结果展示的基本原理
搜索引擎的主要功能是根据用户输入的关键词,从众多网页中筛选出最相关的内容,并将这些内容显示在搜索结果页面上。这个过程可以分为三个基本步骤:
1、爬虫抓取:搜索引擎利用爬虫程序(也称为蜘蛛)从互联网上抓取网页。
public class Spider { private static final int MAX_PAGES_TO_SEARCH = 10; private Set pagesVisited = new HashSet(); private List pagesToVisit = new LinkedList(); /** * Our main launching point for the Spider's functionality. Internally it creates spider legs * that make an HTTP request and parse the response (the web page). * * @param url * - The starting point of the spider * @param searchWord * - The word or string that you are searching for */ public void search(String url, String searchWord) { while(this.pagesVisited.size() < MAX_PAGES_TO_SEARCH) { String currentUrl; SpiderLeg leg = new SpiderLeg(); if(this.pagesToVisit.isEmpty()) { currentUrl = url; this.pagesVisited.add(url); } else { currentUrl = this.nextUrl(); } leg.crawl(currentUrl); // Lots of stuff happening here. Look at the crawl method in // SpiderLeg boolean success = leg.searchForWord(searchWord); if(success) { System.out.println(String.format("**Success** Word %s found at %s", searchWord, currentUrl)); break; } this.pagesToVisit.addAll(leg.getLinks()); } System.out.println(String.format("**Done** Visited %s web page(s)", this.pagesVisited.size())); } /** * Returns the next URL to visit (in the order that they were found). We also do a check to * make sure this method doesn't return a URL that has already been visited. * * @return */ private String nextUrl() { String nextUrl; do { nextUrl = this.pagesToVisit.remove(0); } while(this.pagesVisited.contains(nextUrl)); this.pagesVisited.add(nextUrl); return nextUrl; } }
2、索引处理:搜索引擎将爬虫抓取到的网页存储进索引库中,同时对网页的主要内容进行索引处理。索引处理的目的是能够快速准确地找到包含关键字的网页,这个过程主要是通过计算TF-IDF等算法来实现。
public class Indexer { private WebCrawler spider; private Map frequencyToUrlMap; private Map<String, Map> wordUrlsMap; public Indexer(WebCrawler spider) { this.spider = spider; frequencyToUrlMap = new HashMap(); wordUrlsMap = new HashMap<String, Map>(); } /** * Index a page by its URL * * @param url * - The URL of the page to be indexed */ public void indexPage(String url) { System.out.println("Indexing " + url); Document document = spider.getDocument(url); String text = document.text(); List words = spider.getWordsFromDocument(text); // Count frequency of web pages for (String word : words) { if (!wordUrlsMap.containsKey(word)) { wordUrlsMap.put(word, new HashMap()); } Map urlToCountMap = wordUrlsMap.get(word); if (!urlToCountMap.containsKey(url)) { urlToCountMap.put(url, 0); } urlToCountMap.put(url, urlToCountMap.get(url) + 1); } // Map frequency to URL int frequency = 0; for (Map.Entry<String, Map> entry : wordUrlsMap.entrySet()) { String word = entry.getKey(); Map urlToCountMap = entry.getValue(); frequency = 0; for (Map.Entry urlEntry : urlToCountMap.entrySet()) { frequency += urlEntry.getValue(); } // TODO: implement sorting by frequency frequencyToUrlMap.put(word, url); System.out.println("indexing " + word + ", " + url); } } }
3、结果展示:当用户输入关键字之后,搜索引擎会将包含关键字的网页内容展示在搜索结果页面上,同时还会对这些网页进行排名,将最相关的网页排名靠前。
二、搜索引擎结果展示的主要形式
搜索引擎结果通常有以下几种主要形式:
1、蓝色链接+标题+描述:这是搜索结果最常见的展示形式,用户在搜索后会看到一系列链接,每个链接后面跟着网页的标题和描述信息,让用户可以对结果进行初步筛选。
Example DomainThis domain is established to be used for illustrative examples in documents. You may use this domain in examples without prior coordination or asking for permission.
2、图文结果:对于一些特定的搜索,搜索引擎也会展示图文结果,包括图片、视频、新闻等信息。这种展示形式更加直观且易于用户理解。
This is an example description of the linked page.
原创文章,作者:EBTB,如若转载,请注明出处:https://www.506064.com/n/143100.html