Design a Web Crawler


JADM_Volume 2_Issue 1_Pages 25   – Page rank algo.

-31Parallel_crawler_architecture_and_web_page_change_- How to detect change in web page change in content.

The internet is basically a directed graph – nodes are website and link are edges.

To take note of 
– We need to respect the robot.txt on the website.
– We have to also limit the frequency of crawling especially for small website.
– We can’t crawl all the pages. We can get this information from robot.txt.
–  Use machine learning to predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.
–  DNS resolving: This minor operation, when crawling on a large scale adds up to a big bottleneck. If not tackled, your system might spend more time on waiting to resolve domain names than on fetching and parsing. It is advisable to maintain local caches to avoid repeated requests.

We will have two main entity – crawler and parser.
– Having a few websites in your crawler’s most frequent list(PQ) to crawl, such as some authoritative news website, etc. We start the crawler with some seed data where each url has the same credit score to begin with. All these url’s are maintained in a PQ.
– It deques a url from priority queue(PQ)(called frontier).
( We need to check for duplicate that the same url should not be added to the frontier and should not parsed again very frequently. Consider the each top seeder adds a link to the page in the seeder. In this case we don’t want to parse a page that was parsed recently. Although AOPIC algo ensures that the url that was visited is not dequeed from the queue again as it’s priority will be less. Google created a cityhash for this hashing purpose).
– It extracts all the url from the current page(not indexing any other data now).To rank the url(which is basically called passing the juice on to other pages in SEO) we use the AOPIC algorithm. We also maintain a count of backlink to each url(or better to the website – this helps us determine it’s page rank in may be a separate hashmap).We add the urls extracted back to the frontier.
So we extracted all the url. Then using AOPIC algo, we give credit to those and deduct the credit from the main url from which all the urls were extracted.  We need to add these urls back to the PQ. It’s possible that some of these urls will already be present in the PQ with some other score. We will have to update there score in PQ. We can have a HashMap which keeps track of the element in the PQ with current score and when those are updated we again update those in PQ and Map. We maintain a copy in Map as we don’t want to do a lookup in the PQ.

– The url that was crawled is added to a different DS, may be min heap. This is so that in future we need to again parse the same url for update. So this DS will be sorted according to the time(we need to set a time when we need to visit a particular url again – either it can be based on some ML algo or predictive algo or we can have our own policy – the popular the website the more sooner it is visited). This additional DS also allow us to mark the website for which there webmaster have submitted there sitemap. The url from this DS is added to the frontier when the time for queuing them is detected by some monitoring system.
– The frontier dequeue item based on the credit of the url.The crawler add downloaded page to the disk( so that a parser can parse it for keywords and updates offline). These keyword can be index in elastic or solr search for searching. (We already maintain the count of backlink to a website which helps determine it’s page rank along with keyword relevance – the APOIC algo takes this as the credit score passes from other page to this page).

Why BFS –
This would probably be a lot more efficient if it were rewritten as a modified BFS that put unexplored pages into a worklist and had a bunch of threads that grabbed unexplored pages and processed them. It’s also not a good idea to use recursion when exploring web links. You’ll easily blow out the call stack if you try to do this, since any sufficiently large website will have links that go all over the place. I figured this out from experience when trying to do a DFS on Wikipedia.

Given the current size of the Web, even large search engines cover only a portion of the publicly available part. A 2009 study showed even large-scale search engines index no more than 40-70% of the indexable Web. So don’t think  of crawling everything. Start small.

If a url that was parsed and it now ready to be crawled again we will have to update it’s index. We need to check if the data was updated. It’s inefficient to compare the text of old and new page(saving it is also an overhead)Updating Logic –
1. We keep a hashed snapshot of the old page and not the complete content. Now to check if something has changed we compare the old and new snapshot.
2. Maintain a tree structure of the page. – for each structure maintain a value – calculated using the ascii of the content.Check the pdf for details.
SO  If one uses the AOPIC algorithm for selecting pages, then it’s fairly easy to avoid bot-traps of the infinite loop kind. Here is a summary of how AOPIC works:
1. Get a set of N seed pages.
2. Allocate X amount of credit to each page, such that each page has X/N credit (i.e. equal amount of credit) before crawling has started.
3. Select a page P, where the P has the highest amount of credit (or if all pages have the same amount of credit, then crawl a random page).
4. Crawl page P (let’s say that P had 100 credits when it was crawled).
5. Extract all the links from page P (let’s say there are 10 of them).
6. Set the credits of P to 0.
7. Take a 10% “tax” and allocate it to a Lambda page.
8. Allocate an equal amount of credits each link found on page P from P’s original credit – the tax: so  (100 (P credits) – 10 (10% tax))/10 (links) = 9 credits per each link.
9. Repeat from step 3.
Since the Lambda page continuously collects tax, eventually it will be the page with the largest amount of credit and we’ll have to “crawl” it. I say “crawl” in quotes, because we don’t actually make an HTTP request for the Lambda page, we just take its credits and distribute them equally to all of the pages in our database.
It maintains the URLs in the frontier and regurgitates them in some order whenever a crawler thread seeks a URL.
Two important considerations govern the order in which URLs are returned by the frontier. First, high-quality pages that change frequently should be prioritized for frequent crawling. Thus, the priority of a page should be a function of both its change rate and its quality (using some reasonable quality estimate). The combination is necessary because a large number of spam pages change completely on every fetch.
Since bot traps only give internal links credits and they rarely get credit from the outside, they will continually leak credits (from taxation) to the Lambda page. The Lambda page will distribute that credits out to all of the pages in the database evenly and upon each cycle the bot trap page will lose more and more credits, until it has so little credits that it almost never gets crawled again. This will not happen with good pages, because they often get credits from back-links found on other pages. This also results in a dynamic page rank and what you will notice is that any time you take a snapshot of your database, order the pages by the amount of credits they have, then they will most likely be ordered roughly according to their true page rank.
The second consideration is politeness: we must avoid repeated fetch requests to a host within a short time span. The likelihood of this is exacerbated because of a form of locality of reference: many URLs link to other URLs at the same host. As a result, a URL frontier implemented as a simple priority queue might result in a burst of fetch requests to a host. This might occur even if we were to constrain the crawler so that at most one thread could fetch from any single host at any time. A common heuristic is to insert a gap between successive fetch requests to a host that is an order of magnitude larger than the time taken for the most recent fetch from that host.

Priority Queue Dequeue Logic –

1. Count the back link – this will decide the page rank also. Combine this with  AOPIC algorithm.
For example – We may carry out a scrape and decide that all PDF file links are more important than DOCx files and thus get a higher priority. We may decide that top level domain links are more important than second level domain links .. e.g.:
Reference –
How to update the score for a page for which the link has been removed. This will hamper our page rank(although the crawler remains unaffected).


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s