Crawl the Web.

Today everyone uses the Google/Yahoo/Bing search engines. But how this search engine knows where the data is on the web? Yes this is where the web crawler comes in.

In this post i will not go into how Search Engine works but will reserve this talk to web crawlers ,how search engine use it and a simple web crawler written in java.

Web Crawler

Web crawler is a computer program  which browses the web  in a orderly defined manner to get the information and then that information is stored in database .It is called by many names like web spider, bots, web robots , web scutters. While google crawler is called as googlbot. More insight on it can be found at this location-http://support.google.com/webmasters/bin/answer.py?hl=en&answer=182072.

How does it work?

It visits to list of urls which we give it to browse and in each page it  try’s to find any  links in the page  and if it finds them then it  adds it to the queue to visit ,also gathers the data from each web page , downloads it and stores it in the database if required.Using the information gathered from this crawler ,the  search engine will determine what is the site about and then index the info and store it in search engine database.

So we can say search Engine to be made of Web crawler with the Search algorithm, data base and a system which binds all of them together . The web crawler with search alogo forms the important part of search engines as it is defines  on which links to search  and till what depth it needs to find the links and data,when to stop searching and when to again start searching .If any data is changed then how to update it in its database.

Now lets go into writing a simple web crawler and more info about Web Crawler and the search techniques  can be found at wiki.

Below Simple web crawler written in java does two things .

1) It visits the urls and finds the links on the page and adds them on to the queue ,also it keeps the count of number of links found.

2)It visits the urls which are in queue and finds  the word on those pages, also keeps the count of number of time the word was found.

/**
* @authors Shailesh Gobburker
* @year 2012
*/
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.StringTokenizer;

/*
* Simple Web crawler doing 2 things.
* 1)Find the number of links on the page ,extract them and add it in the queue.
* 2)Find the word in all web pages it has extracted.
*/

public class WebCrawler {
private static int linksCount=0;
private static int countOfWord=0;

/**
* @param args
*/
public static void main(String[] args) {
try{
String mainUrl=”https://shaileshgobburker.wordpress.com/”;
URL url = new URL(mainUrl);
BufferedReader br = new BufferedReader(new   InputStreamReader(url.openStream()));
StringWriter pageData = new StringWriter();
/*
* Extracting the page data of the mainUrl.
*/
for (int charIndex = br.read(); charIndex != -1; charIndex = br.read()) {
pageData.write(charIndex);
}
/*
* Extracting the links from the page.
*/
ArrayList urlList  =extractLinks((pageData.toString()).toLowerCase().replaceAll(“\\s”, ” “),mainUrl);
System.out.println(“Total Links Count is “+linksCount);
/*
* Searching for word – Shailesh from the pages of the urlList.
*/
searchForWord(urlList,”SHAILESH”);
System.out.println(“Number of time the word was found”+countOfWord);

}catch(Exception e){
e.printStackTrace();
}
}
/*
* @param Page – From which we need to extract the links.
* @param mainUrl- The main url where the Extracting of links is going on.
* return Type ArrayList (links)- Contains the list of extracted urls.
*
* This method does
* From the web page , it will find links i,e where <a href tags is present
* and extract them and adds to list.
* Sometimes extracted links can be sub link not a complete link that time it adds
* main part to it.
* Example :- Suppose main page url is http://www.google.com and we find link on the     page as <a href=”/index/home.html”>
* then we add main part to it and it becomes http://www.google.com/index/home.html.          *
*/

public static ArrayList extractLinks(String page,String mainUrl) {
int index = 0;
ArrayList links = new ArrayList();

while ((index = page.indexOf(“<a “, index)) != -1){
if ((index = page.indexOf(“href”, index)) == -1) break;
if ((index = page.indexOf(“=”, index)) == -1) break;

String remainingPage = page.substring(++index);
StringTokenizer st = new StringTokenizer(remainingPage, “\t\n\r\”‘>#”);
if(st.hasMoreTokens()){
String link=st.nextToken();
if(link.charAt(0)==’/’||link.contains(“http”)){
/*
* if extracted link is sublink that is half part then
* we add main url to it.
*/
if(link.charAt(0)==’/’){
link=mainUrl+link;
}
if (! links.contains(link)) {
linksCount++;
System.out.println(“Links are ——–“+link);
links.add(link);

}
}
}
}
return links;
}
/*
* @param urlList – List containg url where the word needs to be searched.
* @param word – this is word which needs to be searched.
* SearchForWord method does
* It goes to each page , extracts it data and searched for the word.
* If found prints the line where it is found and increments the count.
*
*/
public static void searchForWord(ArrayList urlList, String word) throws   Exception{
try{
if(urlList!=null){
Iterator it = urlList.iterator();
while(it.hasNext()){
String mainUrl= (String)it.next();
URL url = new URL(mainUrl);
BufferedReader br = new BufferedReader(new               InputStreamReader(url.openStream()));
String strTemp=””;
while(null != (strTemp = br.readLine())){
if(strTemp.toUpperCase().contains(word)){
System.out.println(strTemp+”\n”);
countOfWord++;
}
}
}
}
}catch(Exception e){
e.printStackTrace();
throw new Exception(e);
}
}

}