Tuesday, 11 June 2013

Writing a PHP/cURL Web Bot Class from scratch

Hey guys, for todays tutorial I'm going to discuss some aspects of building web bots What we're going to do is create a class to use to instantiate an instance of a web bot capable of making POST and GET requests and returning the output, it will also include several subroutines for parsing the data provided from Michael Schrenks book 'Webbots Spiders and Screenscrapers' The bot will allow you to specify a custom user agent and have proxy support, with this class it will make the ability to write much more complex scrapers extremely easy. I've posted something similar in the pastebin before but in this tutorial we'll go through creating it step by step so that it makes a little bit more sense. I always write bots as CLI scripts so reading my previous tutorial on the basics of PHP CLI can be a benefit but isnt required.

Creating our class

If you're not familiar with Object Oriented Programming some of these concepts may seem a little weird at first but just stick with it it'll make sense. Think of a class as being a custom data type that you define. This isn't exactly true but it helps to visualize it, as you will be creating a variable of the type of your class which then has access to all of the properties(variables) and methods(functions) that the class holds. This isn't meant to be a tutorial on OOP so I wont go into a lot of detail but things should be fairly straight forward with some examples.

The first thing we'll need to do after declaring our class is to declare our basic properties:



class webBot

{
    private $agent;
    private $timeout;   

    private $cook;
    private $curl;
    private $proxy;
    private $credentials;    // Specifies if parse $this->INCLudes the delineator   
    private $EXCL;    
    private $INCL;    // Specifies if parse returns the text $this->BEFORE or $this->AFTER the delineator
    private $BEFORE;
    private $AFTER;
}


The variables we're declaring here will define the behaviour of our webbot, the agent is the useragent it will display, in this example we will only use one generic user agent but it could be easily extended to include a list of useragents to switch between. The timeout tells the bot how long to keep connections open. cook will be the file where our cookies are stored. the $curl variable contains the actual curl instance. Proxy will either contain an ip:port pair or it will be null and it has an additional optional variable $credentials in case the proxy requires creds. The next four variables are constants for the parsing subroutines.

Now that we've defined all of these variables we will need to initialize them with values, we do this inside of the __construct() method which is automatically executed any time an instance of the webBot() object is created.


public function __construct($proxy=null, $credentials=null)
{
    $this->proxy = $proxy;//set proxy
    $this->timeout = 30;//set Timeout for curl requests
    $this->agent = 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16';//set default agent
    $this->cook = 'cookies.txt';//set cookie file
    $this->curl = $this->setupCURL($this->proxy);//initiate curl
    $this->EXCL = true;
    $this->INCL = false;
    $this->BEFORE = true;
    $this->AFTER = false;
    $this->credentials = $credentials;//set credentials for proxy}

this is all fairly straight forward, we see when the bot is instantiated it will be in the format: "$bot = webBot();" however webBot() has two additional parameters so if we were using a proxy without credentials (an example using tor through a local system) "$bot = webBot('127.0.0.1:9050');" or a proxy with credentials: "$bot = webBot('232.222.111.141:8080', 'username:password');" from thereonout any time we use the bot it will route through our proxy. We'll set the timeout to 30 seconds the useragent in this instance is for an iphone but you can literally put any/everything you want in there I'd recommend spoofing an actual user agent though. Im going to be saving the cookies to cookies.txt this file will be created in the same folder that our script is running in. next we see "$this->curl = $this->setupCURL($this->proxy);" setupCURL() will be the next method we take a look at, it instantiates an instance of curl for us to use and assigns it to a class property so that its accessible to the rest of the class. After that we see the assignments for the parsing routines and our credentials. Both credentials and proxy will default to null if they are not specified. Next lets take a look at the setupCURL() method:// Sets up cURL, enables keep-alive, spoofs referer


private function setupCURL($py)
{
    $ck = $this->cook;
    $creds = $this->credentials;
    $ch = curl_init();
    if($py)
    {
        curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
        curl_setopt($ch, CURLOPT_PROXY, $py);
        if($creds)
            curl_setopt($ch, CURLOPT_PROXYUSERPWD, $creds);
        print("Using Proxy: $py\n");
    }
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_HEADER, 1);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $ck);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $ck);

    return $ch;
}


What we're doing in this class is assigning our default values that will be used in both POST and GET requests, since these values will be added to $this->curl, when we create methods to handle POST and GET requests they will both use $this->curl as a base. As a result of this you can modify these values to fit your specific needs the default values I provided should work but always remember to be creative and hack shit up. inside of it we check if($py) if py is set to null then this will equate to false and we will skip setting a proxy otherwise we set the basic proxy settings and then check to see if there are also credentials set and if so we set those as well. After that we define how we want the bot to behave. Now that we have a base to work with lets take a look at how we'll setup our GET method. POST will be very similar to GET but with a few minor changes so once we have GET making POST will be a lot easier.// Gets the contents of $url and returns the result


public function get_contents($url, $ref='')
{
    if($ref == '')
        $ref = $url;

    $ch = $this->curl;
    $hd = array("Connection: Keep-alive", "Keep-alive: 300", "Expect:", "Referer: $ref", "User-Agent: {$this->agent}");
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_POST, 0);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $hd);
    $x = curl_exec($ch);

    return $x;
}

we see here that the get_contents() method has two parameters one of which is optional. We for obvious reasons will require a URL but we can also submit a referer to spoof, some examples using the $bot analogy from before would look like: "$bot->get_contents("http://www.google.com");" or "$bot->get_contents("http://www.reddit.com", "http://www.cia.gov"); We define an array of headers to send, this can also be modified to meet your specific needs, the default i've presented includes the referer(if no referer is specified it will default to using the url submitted simulating a refresh) The user agent we specified earlier as well as a keep alive, we then tell curl the URL we want to request, set POST to 0 and submit the headers, after that curl_exec() is run on our settings and the output is returned. As you can see by setting POST to 0 we can run a GET command, so if we set POST to 1 guess what we can do now?// Posts $pdata to $purl and returns the result
public function post_contents($purl, $pdata, $ref='')
{
    if($ref == '')
        $ref = $purl;

    $ch = $this->curl;
    $hd = array("Connection: Keep-alive", "Keep-alive: 300", "Expect:", "Referer: $ref", "User-Agent: {$this->agent}");
    curl_setopt($ch, CURLOPT_URL, $purl);
    curl_setopt($ch, CURLOPT_POST, 1);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $pdata);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $hd);
    $x = curl_exec($ch);
    curl_setopt($ch, CURLOPT_POST, 0);

    return $x;
}

Notice except for a few settings this is almost identical to the get_contents method. In this we simply set CURLOPT_POST to 1 we submit the POST parameters into CURLOPT_POSTFIELDS as $pdata and after we execute we reset CURLOPT_POST to 0 just in case our next request is get_contents(the get_contents() method also resets it to 0 so this is slightly redundant) And thats it right now we have a fully functional web bot capable of making POST and GET requests, using this technique you can create bots that can scrape all manner of websites, you can create accounts on website, login to websites, send messages and post content, all by scripting it into a bot. However scraping the data is seldom enough, usually thats the means to an end which is processing the data you want to scrape. What good is half-automating something? Lets dig a little deeper and include some parsing routines. I stress heavily that I did not write these methods, I did adapt them into an OOP framework so they could be implemented in this class but the code was written by Michael Schrenk in his book 'WebBots Spiders and Screensavers' nostarch press http://webbotsspidersscreenscrapers.com/ which is an awesome resource for people writing bots in PHP


public function split_string($string, $delineator, $desired, $type)
{
    // Case insensitive parse, convert string and delineator to lower case
    $lc_str = strtolower($string);
    $marker = strtolower($delineator);

    // Return text $this->BEFORE the delineator
    if($desired == $this->BEFORE)
    {
        if($type == $this->EXCL) // Return text ESCL of the delineator
            $split_here = strpos($lc_str, $marker);
        else // Return text $this->INCL of the delineator
            $split_here = strpos($lc_str, $marker)+strlen($marker);

        $parsed_string = substr($string, 0, $split_here);
    }
    // Return text $this->AFTER the delineator
    else
    {
        if($type==$this->EXCL) // Return text ESCL of the delineator
            $split_here = strpos($lc_str, $marker) + strlen($marker);
        else // Return text $this->INCL of the delineator
            $split_here = strpos($lc_str, $marker) ;

        $parsed_string = substr($string, $split_here, strlen($string));
    }
    return $parsed_string;
}

public function return_between($string, $start, $stop, $type)
{
    $temp = $this->split_string($string, $start, $this->AFTER, $type);
    return $this->split_string($temp, $stop, $this->BEFORE, $type);
}

public function parse_array($string, $beg_tag, $close_tag)
{
    preg_match_all("($beg_tag(.*)$close_tag)siU", $string, $matching_data);
    return $matching_data[0];
}

public function get_attribute($tag, $attribute)
{
    // Use Tidy library to 'clean' input
    $cleaned_html = tidy_html($tag);
    // Remove all line feeds from the string
    $cleaned_html = str_replace("\r", "", $cleaned_html);
    $cleaned_html = str_replace("\n", "", $cleaned_html);
    // Use return_between() to find the properly quoted value for the attribute
    return return_between($cleaned_html, strtoupper($attribute)."=\"", "\"", $this->EXCL);
}

public function remove($string, $open_tag, $close_tag)
{
    // Get array of things that should be removed from the input string
    $remove_array = $this->parse_array($string, $open_tag, $close_tag);

    // Remove each occurrence of each array element from string;
    for($xx=0; $xx<count($remove_array); $xx++)
        $string = str_replace($remove_array, "", $string);

    return $string;
}

public function tidy_html($input_string)
{
// Detect if Tidy is in configured
    if( function_exists('tidy_get_release') )    {
        // Tidy for PHP version 4        if(substr(phpversion(), 0, 1) == 4)
        {
            tidy_setopt('uppercase-attributes', TRUE);            tidy_setopt('wrap', 800);
            tidy_parse_string($input_string);
            $cleaned_html = tidy_get_output();        }
        // Tidy for PHP version 5        if(substr(phpversion(), 0, 1) == 5)
        {
            $config = array(
                    'uppercase-attributes' => true,
                    'wrap' => 800);            $tidy = new tidy;
            $tidy->parseString($input_string, $config, 'utf8');
            $tidy->cleanRepair();            $cleaned_html = tidy_get_output($tidy);
        }  

    }   
    else
    {        
        // Tidy not configured for this computer        
        $cleaned_html = $input_string;
    }
    return $cleaned_html;
}

public function validateURL($url)
{    

    $pattern = '/^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?@)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&amp;?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/';
    return preg_match($pattern, $url);
}

I'll do a brief overview of these and highlight some of the more useful ones that you can use:

split_string()
this method takes a string and a delineator and returns either all of the text before the marker or all of the text after the marker and can be set to include or exclude the marker by adjusting the INCL/EXCL constants

return_between()
this method accepts a string, two anchors, and a boolean value as parameters, the string will be parsed and the first instance of the opening anchor/closing anchor set will be parsed and returned. The boolean value at the end determines whether the anchors are included in the returned string or not a value of True will exclude the anchor and a value of false will include it.

parse_array()
similar to return_between parse_array accepts a string and two anchors as parameters, it returns an array of each occurrence of that phrase, it always includes the anchors. Using parse_array() to scrape a page and looping through the array using return_between() to clean and extract the data can be very effective. These two methods are used by myself more often than any other parsing methods.

get_attributes()
accepts a tag in html format and attempts to extract all of its attributes.

remove()
remove accepts 3 parameters a string and two anchors it similar to parse_array() with the exception that it will remove all occurrences of the anchors and text between them from the string

tidy_html()
accepts a single string, this method requires the tidy_html library be installed, if it is not it will simply return the same string it was submitted it won't return an error.

validate_url()
this is the one method i wrote for parsing it just uses a regex i googled to validate a URL, pretty self explanitory.

With this you should be able to start writing some fairly non-trivial webBots. I'm going to follow up with tutorials on using the bot to do basic GET requests and a second on how to use it for POST requests(logging into a website). After that if its received well I intend to write a complementary set of tutorials covering the same content in python. Have fun guys if you keep readin em ill keep writing em!

Adios

Link to complete code: http://privatepaste.com/3e4fac4540

5 comments:

  1. Nice post... something that i look for long.. hope there will be an update to create powerful bots

    ReplyDelete
    Replies
    1. I have an updated version on github now: https://github.com/Durendal/webBot and plan on making some minor updates to it soon(so that requests return a response object that makes accessing status code, headers, content, etc... easier) Hope its of help :)

      Delete
    2. Private code link is unavailable ... could you reupload?

      Delete
    3. Private code link is unavailable ... could you reupload?

      Delete
    4. The latest version of the code is available at https://github.com/Durendal/webBot I will be including some updates to it in the near future that should make things significantly simpler to use.

      Delete