How to write a simple web crawler in PHP

These days more and more websites use information from other sites to populate their content.
One of the best ways to do it is a web crawler. Web crawler is a script that can browse thousands of pages automatically, parse out the information you need and put it into your DB.

Here is an easy way to write a simple web crawler in PHP.

Step 1. You will need CURL. I do not recommend using functions such as file_get_html or file_get_contents. Your crawler will probably have to query thousands of pages and connection is a bottleneck here. I’ve made several tests and CURL works significantly faster.

Step 2. You will need a list of pages that you need to query. Very often if you need to scrap the information from one website – you will need to write 2 crawlers: One will get all the links that you need and the other will go through all the links to get and parse the information. One of the best ways to get the list of links is to look at the sitemap. Sitemaps have 2 huge advantages:

  1. They are very usually located in http://yourwebsite/sitemap.xml – so no problem to find them
  2. They are in XML format – it’s very easy to parse XML using for example PHP’s internal library.

Let’s use some example to make things simpler. Let’s say I want to get all authors that ever posted something on TechCrunch. My first destination is: http://techcrunch.com/sitemap.xml. As was mentioned above, XML is really easy to parse, so now we have the list of all pages on TechCrunch.

Step 3. You need to write a function that could return CURL output. Luckily I did that for you:


function getUrl($url) {
	$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml, text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; 
	$header[] = "Cache-Control: max-age=0"; 
	$header[] = "Connection: keep-alive"; 
	$header[] = "Keep-Alive: 300"; 
	$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; 
	$header[] = "Accept-Language: en-us,en;q=0.5"; 

	curl_setopt($curl, CURLOPT_URL, $url); 
	curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Ubuntu/10.04 Chromium/6.0.472.53 Chrome/6.0.472.53 Safari/534.3'); 
	curl_setopt($curl, CURLOPT_HTTPHEADER, $header); 
	curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate'); 
	curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); // very important to set it to true, otherwise the content will be not be saved to string
	$html = curl_exec($curl); // execute the curl command
	return $html;
}

NOTE: more often than not people want to stay invisible during crawling. It’s understandable – nobody wants their content to be stolen without their permission. If you want to be in a ‘stealth mode’ – you need to use special headers.

Step 5. Now you need to go through all the pages and get the authors.
Let’s say all my links from Step 2 are saved in $links array. Now we do this:


foreach($links as $url) {
	$html = getUrl($url); // the function from Step 3
	$author = getAuthor($html); // getAuthor is the function that parses the HTML and returns the name of the author.
	addAuthorToDB($author); // put it to your DataBase
	sleep(1); // one second break
	echo $author."\n"; // it's good to see the output while the screen is running
}

Many developers make a mistake running the script from a browser. There are several reasons not to do that and first of them is that your browser will almost certainly timeout. In order to avoid it run your PHP script from command line. If you use my example below you can enjoy the process by seeing another author’s name on a new line each second.

This is it! All you need to do is write two functions: getAuthor($html) – the function will parse the HTML and return the author name. I will show you how to do that in one of my next posts. The second function addAuthorToDB($author) is a simple DB insert. You can have whatever you want here instead. The basic rule is that you don’t want to work with the data coming from the crawler immediately. Save it to your DB first.

Comments? Questions? Please post it in the comments section below.

Also read...

Comments

  1. Hi Ilya ,

    Thank you so much for this post. I really appreciated for that.
    i am thinking about crawl based search engine for a initial stage. is this possible to retrieve the title, meta tags , keywords and content from url ? . then we need to store in our own database, then search from database. llya is this possible using this way ?
    Hope your help

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *