Pages

Web scrapping in php using curl


What is web scrapping ? How to do it in php ? Why do we need it ? which method is suitable for web scrapping ?

Does all these quires are in the mind ? I’m here to explain it, web scrapping is technique to grab the web page elements on basis of DOM or Xpath. Web scrapping technique is used by many search engines such as google for indexing purpose.

We need it in some circumstance like to index the page or website contents and in other cases to grab the products or contents from
other websites using same on own websites or for the purpose of marketing ,

 Actually web scrapping is illegal you need to obtain the permission from the owner of the website but however in these days it becomes normal operation , but google will index our sites upon submit the site to google indexing service .

 In php web scrapping is done using Curl or by using simple html dom script .
Below is the example script that how to do it in php using simple_html_dom.php .
<?php
<?php
    include('simple_html_dom.php');
    webelements (‘www.example.com');

function webelements ($webpage) {   
    $html = new simple_html_dom();
    $html->load_file($webpage);
   
    $elements = $html->find('div[class=classname]'); 
    $i = 0 ;
    foreach($elements as $grab) {
echo  $grab ;
echo $grab->children($i) ;
Si++;
}
        $html->clear();
unset($html);
    }
}

?>

Note down you should include simple_html_dom script in the code , click here to download  simple_html_dom.php script

Curl is suitable fro web scrapping  as it supports many protocols  and very easy  to extract elements especially images and videos .
Below is the bit example using curl  
function example($url)
                {

                $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// make the cURL request to $target_url

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
$html= curl_exec($ch);
if (!$html) {
                echo "<br />cURL error number:" .curl_errno($ch);
                echo "<br />cURL error:" . curl_error($ch);
                exit;
}

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$elements = $xpath->evaluate("/html/body//td[@id=idname]");
    $i = 0 ;
    foreach($elements as $grab) {
echo  $grab ;
echo $grab->nodeValue($i) ;
}


}
That’s it,

If you have any question or queries please ask it in comment section I will answer it within 1-2hours.

Good day.

2 comments:

  1. Hi I am new to PHP and scripting in general. I've worked with JS, PHP, JSP and .Net engineers but primarily as a HTML/CSS coder. I am tasked to research CURL on my own and am wondering how do I apply your code samples. I can see the obvious parts in the code that I should change on my own but you post 2 sections and I am not sure if these are to be 2 different files (or 3) or is all the code combined into the one simple_html_dom.php file? I hope my question makes sense.

    Thanks in advance for any guidance you can provide me in learning cURL scripting with PHP.

    H.M.

    ReplyDelete
  2. sorry , are you there ?.

    ReplyDelete