What is web scrapping ? How to do it in php ? Why do we need it ? which method is suitable for web scrapping ?
Does all these quires are in the mind ? I’m here to explain it, web scrapping is technique to grab the web page elements on basis of DOM or Xpath. Web scrapping technique is used by many search engines such as google for indexing purpose.
We need it in some circumstance like to index the page or website contents and in other cases to grab the products or contents from
other websites using same on own websites or for the purpose of marketing ,
Actually web scrapping is illegal you need to obtain the permission from the owner of the website but however in these days it becomes normal operation , but google will index our sites upon submit the site to google indexing service .
In php web scrapping is done using Curl or by using simple html dom script .
Below is the example script that how to do it in php using simple_html_dom.php .
<?php
<?php
include('simple_html_dom.php');
webelements (‘www.example.com');
function webelements ($webpage) {
$html = new
simple_html_dom();
$html->load_file($webpage);
$elements =
$html->find('div[class=classname]');
$i = 0 ;
foreach($elements
as $grab) {
echo $grab ;
echo $grab->children($i) ;
Si++;
}
$html->clear();
unset($html);
}
}
?>
Note down you should include simple_html_dom script in the
code , click here to download simple_html_dom.php
script
Curl is suitable fro web scrapping as it supports many protocols and very easy to extract elements especially images and
videos .
Below is the bit example using curl
function example($url)
{
$userAgent
= 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
$html= curl_exec($ch);
if (!$html) {
echo
"<br />cURL error number:" .curl_errno($ch);
echo
"<br />cURL error:" . curl_error($ch);
exit;
}
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$elements = $xpath->evaluate("/html/body//td[@id=idname]");
$i = 0 ;
foreach($elements
as $grab) {
echo $grab ;
echo $grab->nodeValue($i) ;
}
}
That’s it,
If you have any question or queries please ask it in comment
section I will answer it within 1-2hours.
Good day.
Hi I am new to PHP and scripting in general. I've worked with JS, PHP, JSP and .Net engineers but primarily as a HTML/CSS coder. I am tasked to research CURL on my own and am wondering how do I apply your code samples. I can see the obvious parts in the code that I should change on my own but you post 2 sections and I am not sure if these are to be 2 different files (or 3) or is all the code combined into the one simple_html_dom.php file? I hope my question makes sense.
ReplyDeleteThanks in advance for any guidance you can provide me in learning cURL scripting with PHP.
H.M.
sorry , are you there ?.
ReplyDelete