Hey guys, for todays tutorial I'm going to discuss some aspects of building web bots What we're going to do is create a class to use to instantiate an instance of a web bot capable of making POST and GET requests and returning the output, it will also include several subroutines for parsing the data provided from Michael Schrenks book 'Webbots Spiders and Screenscrapers' The bot will allow you to specify a custom user agent and have proxy support, with this class it will make the ability to write much more complex scrapers extremely easy. I've posted something similar in the pastebin before but in this tutorial we'll go through creating it step by step so that it makes a little bit more sense. I always write bots as CLI scripts so reading my previous tutorial on the basics of PHP CLI can be a benefit but isnt required.
Creating our class
If you're not familiar with Object Oriented Programming some of these concepts may seem a little weird at first but just stick with it it'll make sense. Think of a class as being a custom data type that you define. This isn't exactly true but it helps to visualize it, as you will be creating a variable of the type of your class which then has access to all of the properties(variables) and methods(functions) that the class holds. This isn't meant to be a tutorial on OOP so I wont go into a lot of detail but things should be fairly straight forward with some examples.
The first thing we'll need to do after declaring our class is to declare our basic properties:
class webBot
{
private $agent;
private $timeout;
private $cook;
private $curl;
private $proxy;
private $credentials; // Specifies if parse $this->INCLudes the delineator
private $EXCL;
private $INCL; // Specifies if parse returns the text $this->BEFORE or $this->AFTER the delineator
private $BEFORE;
private $AFTER;
}
The variables we're declaring here will define the behaviour of our webbot, the agent is the useragent it will display, in this example we will only use one generic user agent but it could be easily extended to include a list of useragents to switch between. The timeout tells the bot how long to keep connections open. cook will be the file where our cookies are stored. the $curl variable contains the actual curl instance. Proxy will either contain an ip:port pair or it will be null and it has an additional optional variable $credentials in case the proxy requires creds. The next four variables are constants for the parsing subroutines.
Now that we've defined all of these variables we will need to initialize them with values, we do this inside of the __construct() method which is automatically executed any time an instance of the webBot() object is created.
public function __construct($proxy=null, $credentials=null)
{
$this->proxy = $proxy;//set proxy
$this->timeout = 30;//set Timeout for curl requests
$this->agent = 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16';//set default agent
$this->cook = 'cookies.txt';//set cookie file
$this->curl = $this->setupCURL($this->proxy);//initiate curl
$this->EXCL = true;
$this->INCL = false;
$this->BEFORE = true;
$this->AFTER = false;
$this->credentials = $credentials;//set credentials for proxy}
this is all fairly straight forward, we see when the bot is instantiated it will be in the format: "$bot = webBot();" however webBot() has two additional parameters so if we were using a proxy without credentials (an example using tor through a local system) "$bot = webBot('127.0.0.1:9050');" or a proxy with credentials: "$bot = webBot('232.222.111.141:8080', 'username:password');" from thereonout any time we use the bot it will route through our proxy. We'll set the timeout to 30 seconds the useragent in this instance is for an iphone but you can literally put any/everything you want in there I'd recommend spoofing an actual user agent though. Im going to be saving the cookies to cookies.txt this file will be created in the same folder that our script is running in. next we see "$this->curl = $this->setupCURL($this->proxy);" setupCURL() will be the next method we take a look at, it instantiates an instance of curl for us to use and assigns it to a class property so that its accessible to the rest of the class. After that we see the assignments for the parsing routines and our credentials. Both credentials and proxy will default to null if they are not specified. Next lets take a look at the setupCURL() method:// Sets up cURL, enables keep-alive, spoofs referer
private function setupCURL($py)
{
$ck = $this->cook;
$creds = $this->credentials;
$ch = curl_init();
if($py)
{
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
curl_setopt($ch, CURLOPT_PROXY, $py);
if($creds)
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $creds);
print("Using Proxy: $py\n");
}
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ck);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ck);
return $ch;
}
What we're doing in this class is assigning our default values that will be used in both POST and GET requests, since these values will be added to $this->curl, when we create methods to handle POST and GET requests they will both use $this->curl as a base. As a result of this you can modify these values to fit your specific needs the default values I provided should work but always remember to be creative and hack shit up. inside of it we check if($py) if py is set to null then this will equate to false and we will skip setting a proxy otherwise we set the basic proxy settings and then check to see if there are also credentials set and if so we set those as well. After that we define how we want the bot to behave. Now that we have a base to work with lets take a look at how we'll setup our GET method. POST will be very similar to GET but with a few minor changes so once we have GET making POST will be a lot easier.// Gets the contents of $url and returns the result
public function get_contents($url, $ref='')
{
if($ref == '')
$ref = $url;
$ch = $this->curl;
$hd = array("Connection: Keep-alive", "Keep-alive: 300", "Expect:", "Referer: $ref", "User-Agent: {$this->agent}");
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 0);
curl_setopt($ch, CURLOPT_HTTPHEADER, $hd);
$x = curl_exec($ch);
return $x;
}
we see here that the get_contents() method has two parameters one of which is optional. We for obvious reasons will require a URL but we can also submit a referer to spoof, some examples using the $bot analogy from before would look like: "$bot->get_contents("http://www.google.com");" or "$bot->get_contents("http://www.reddit.com", "http://www.cia.gov"); We define an array of headers to send, this can also be modified to meet your specific needs, the default i've presented includes the referer(if no referer is specified it will default to using the url submitted simulating a refresh) The user agent we specified earlier as well as a keep alive, we then tell curl the URL we want to request, set POST to 0 and submit the headers, after that curl_exec() is run on our settings and the output is returned. As you can see by setting POST to 0 we can run a GET command, so if we set POST to 1 guess what we can do now?// Posts $pdata to $purl and returns the result
public function post_contents($purl, $pdata, $ref='')
{
if($ref == '')
$ref = $purl;
$ch = $this->curl;
$hd = array("Connection: Keep-alive", "Keep-alive: 300", "Expect:", "Referer: $ref", "User-Agent: {$this->agent}");
curl_setopt($ch, CURLOPT_URL, $purl);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $pdata);
curl_setopt($ch, CURLOPT_HTTPHEADER, $hd);
$x = curl_exec($ch);
curl_setopt($ch, CURLOPT_POST, 0);
return $x;
}
Notice except for a few settings this is almost identical to the get_contents method. In this we simply set CURLOPT_POST to 1 we submit the POST parameters into CURLOPT_POSTFIELDS as $pdata and after we execute we reset CURLOPT_POST to 0 just in case our next request is get_contents(the get_contents() method also resets it to 0 so this is slightly redundant) And thats it right now we have a fully functional web bot capable of making POST and GET requests, using this technique you can create bots that can scrape all manner of websites, you can create accounts on website, login to websites, send messages and post content, all by scripting it into a bot. However scraping the data is seldom enough, usually thats the means to an end which is processing the data you want to scrape. What good is half-automating something? Lets dig a little deeper and include some parsing routines. I stress heavily that I did not write these methods, I did adapt them into an OOP framework so they could be implemented in this class but the code was written by Michael Schrenk in his book 'WebBots Spiders and Screensavers' nostarch press http://webbotsspidersscreenscrapers.com/ which is an awesome resource for people writing bots in PHP
public function split_string($string, $delineator, $desired, $type)
{
// Case insensitive parse, convert string and delineator to lower case
$lc_str = strtolower($string);
$marker = strtolower($delineator);
// Return text $this->BEFORE the delineator
if($desired == $this->BEFORE)
{
if($type == $this->EXCL) // Return text ESCL of the delineator
$split_here = strpos($lc_str, $marker);
else // Return text $this->INCL of the delineator
$split_here = strpos($lc_str, $marker)+strlen($marker);
$parsed_string = substr($string, 0, $split_here);
}
// Return text $this->AFTER the delineator
else
{
if($type==$this->EXCL) // Return text ESCL of the delineator
$split_here = strpos($lc_str, $marker) + strlen($marker);
else // Return text $this->INCL of the delineator
$split_here = strpos($lc_str, $marker) ;
$parsed_string = substr($string, $split_here, strlen($string));
}
return $parsed_string;
}
public function return_between($string, $start, $stop, $type)
{
$temp = $this->split_string($string, $start, $this->AFTER, $type);
return $this->split_string($temp, $stop, $this->BEFORE, $type);
}
public function parse_array($string, $beg_tag, $close_tag)
{
preg_match_all("($beg_tag(.*)$close_tag)siU", $string, $matching_data);
return $matching_data[0];
}
public function get_attribute($tag, $attribute)
{
// Use Tidy library to 'clean' input
$cleaned_html = tidy_html($tag);
// Remove all line feeds from the string
$cleaned_html = str_replace("\r", "", $cleaned_html);
$cleaned_html = str_replace("\n", "", $cleaned_html);
// Use return_between() to find the properly quoted value for the attribute
return return_between($cleaned_html, strtoupper($attribute)."=\"", "\"", $this->EXCL);
}
public function remove($string, $open_tag, $close_tag)
{
// Get array of things that should be removed from the input string
$remove_array = $this->parse_array($string, $open_tag, $close_tag);
// Remove each occurrence of each array element from string;
for($xx=0; $xx<count($remove_array); $xx++)
$string = str_replace($remove_array, "", $string);
return $string;
}
public function tidy_html($input_string)
{
// Detect if Tidy is in configured
if( function_exists('tidy_get_release') ) {
// Tidy for PHP version 4 if(substr(phpversion(), 0, 1) == 4)
{
tidy_setopt('uppercase-attributes', TRUE); tidy_setopt('wrap', 800);
tidy_parse_string($input_string);
$cleaned_html = tidy_get_output(); }
// Tidy for PHP version 5 if(substr(phpversion(), 0, 1) == 5)
{
$config = array(
'uppercase-attributes' => true,
'wrap' => 800); $tidy = new tidy;
$tidy->parseString($input_string, $config, 'utf8');
$tidy->cleanRepair(); $cleaned_html = tidy_get_output($tidy);
}
}
else
{
// Tidy not configured for this computer
$cleaned_html = $input_string;
}
return $cleaned_html;
}
public function validateURL($url)
{
$pattern = '/^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?@)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/';
return preg_match($pattern, $url);
}
I'll do a brief overview of these and highlight some of the more useful ones that you can use:
split_string()
this method takes a string and a delineator and returns either all of the text before the marker or all of the text after the marker and can be set to include or exclude the marker by adjusting the INCL/EXCL constants
return_between()
this method accepts a string, two anchors, and a boolean value as parameters, the string will be parsed and the first instance of the opening anchor/closing anchor set will be parsed and returned. The boolean value at the end determines whether the anchors are included in the returned string or not a value of True will exclude the anchor and a value of false will include it.
parse_array()
similar to return_between parse_array accepts a string and two anchors as parameters, it returns an array of each occurrence of that phrase, it always includes the anchors. Using parse_array() to scrape a page and looping through the array using return_between() to clean and extract the data can be very effective. These two methods are used by myself more often than any other parsing methods.
get_attributes()
accepts a tag in html format and attempts to extract all of its attributes.
remove()
remove accepts 3 parameters a string and two anchors it similar to parse_array() with the exception that it will remove all occurrences of the anchors and text between them from the string
tidy_html()
accepts a single string, this method requires the tidy_html library be installed, if it is not it will simply return the same string it was submitted it won't return an error.
validate_url()
this is the one method i wrote for parsing it just uses a regex i googled to validate a URL, pretty self explanitory.
With this you should be able to start writing some fairly non-trivial webBots. I'm going to follow up with tutorials on using the bot to do basic GET requests and a second on how to use it for POST requests(logging into a website). After that if its received well I intend to write a complementary set of tutorials covering the same content in python. Have fun guys if you keep readin em ill keep writing em!
Adios
Link to complete code: http://privatepaste.com/3e4fac4540
Tuesday, 11 June 2013
Introduction to Python Programming
Hey guys, this is the first tutorial I've written in some time so bear with me. Today I'll be doing a basic introduction to programming with Python. We'll briefly go over some of the syntax differences between python and other languages that you may be more familiar with. After that I'll go over Modules which are pythons built in libraries that you can utilise. Once we have the basics down we'll get into some more tangible stuff like variables, conditionals, and loops. Finially we'll take a brief look at functions and the Python Interactive Interpreter. Grab a cup of coffee cause this might be a long tutorial. Also note this tutorial is based on the Python 2.x branch, particularily 2.7. If you code in 3.x and have any questions feel free to ask and I'll get an answer back to you as soon as I can. This isn't an overly thorough tutorial, python is an awesome language and as a result its very extensive, i've tried to cover most of the basics here so that by the time you're finished you can sit down and start writing some code, if you have any questions feel free to contact me and stackoverflow is a godsend.
Common Phrases:
Python Syntax:
The syntax in python is different from what you might be familiar with in languages like PHP or C++. Lines do not need to be terminated by a semi-colon in python(although you can still use them and it will execute fine, the standard convention is to omit them) There are also no curly braces in use in python to denote a block of code. As such code NEEDS to be indented properly in python or it will not run. Indentation needs to be consistent as well, you can for example use 4 spaces as an indent or one tab but you can not interchangeably use them both throughout the same program(as a rule of thumb when i copy python code i immediately do a find and replace of 4 spaces to \t to make life easier, if you port code from windows to python you may find you need to use dos2unix or unix2dos for the vise versa) If you are familiar with how indentation works then this shouldn't be a problem, if you aren't aware what I'm referring to we'll get to that shortly.
Python is also distinct from other languages in the amount of english used in its syntax, where other languages use the || and && logical 'or' and 'and' operators respectively in python you can simply write the words: 'or' and 'and' in conditional statements. This english based syntax mixed with strongly enforced indentation makes python code quite easy to read at initial glance which is one of the many benefits of Python.(there is also a keyword 'not' in place of ! however you can still use ! as an operator as in the example: 1 != 2)
Modules:
If you're wondering if you can do something in python the answer is almost invariably yes, there is almost certainly already a module in existance that will make your life easier. Python comes pre-packaged with a ton of useful modules of all types that you can use and theres just as many or more third party modules available that usually require only a simple install script be run to make them accessible. To import a module in essence the same as including a header file in C/C++ or doing an include/require in php or whatever the equivalency is in any other language, same deal. The syntax to do so is:
theres also a few spins on this, you can import multiple modules in one line like:
or you can choose to import only a sepcific class or method from a module
if you use import modulename you will find often that you have to declare modulename.classname() when you instantiate an instance, I forget the exact reasons but it has to do with how it is imported into the namespace, to avoid this you can write:
which will import everything into the global namespace, be warned though this can be exhaustive on memory and a bad idea for several reasons, its not against standard procedures to do it but dont use it in a sloppy manner be sure theres a reason for it.
you can also import other scripts you have written, if they are in the same cwd as the script you are executing you can simply type
if your script is named 'myscript.py' it will now be imported, besure to omit the .py in the import call.
you can also import local files in a subdirectory provided you use:
NOTE: to do this you will likely need an empty file in the directory called __init__.py inside it simply write: 'pass' and include it in the directory 'folder' or whatever the case is.
Variables
Python uses the Duck Typing paradigm, in which if a variable walks like a duck and talks like a duck its probably a duck. What this means in english is that when you declare a variable you do not need to declare its data type(an int or a string for example) as you would in C++ or some other languages. At its core everything in the python language is an object, when you assign a variable you're really creating an object of type variable which inherits its traits and properties based on its value which determines its data type. Variables are also pointers which means you can re-assign a variable and it will simply reference a new object and pythons garbage collection will deal with freeing the memory of the old object behind the scenes. I feel like I'm not explaining this very well so perhaps an example is in order, consider the following code:
This is a basic example from the python interactive interpreter which is something we'll discuss in further detail later on. What we are doing here is declaring and initializing 3 variables: a, b, and c. a is of type int because its value is 4 which is a whole number. b on the other hand is of type float, as its a floating point number.(notice we do not have to tell the script their types this is all determined for us) c is equal to a * b, if you write in other languages you would be forced to cast a to a float or b to an int before you could perform an operation with both values. Python automatically creates c as a float and stores the value of the operation, if you wanted to do an int operation you could rewrite the c equation as follows:
The main data types you should be familiar with in python are as follows:
* Mutable means it can be changed after it has been created, immutable means it cannot be changed after it has been created, so for instance you cannot make a tuple tup = (1,2,3,4,5) and reassign tup[1] = 3, although you can re-assign tup1 to an entirely new tuple, note there is a distinct difference between doing the two. Whereas with a list or dictionary you can add new items, as well as modify items values in place.
Theres some others but for the scope of this tutorial this list should be sufficient to get you started writing programs.
Now that we have our basic building blocks lets look at how we can write a simple application to get the user to enter their name, then write back out a greeting to them:
To break down whats happening here, raw_input() is a function used to get input from the user, it takes as a parameter a string to use as a prompt and it returns the value the user enters, so we create a variable user which python will automatically set to be a string and set it to the users name. Next we have two print statements, this is to illustrate two ways the print function can be used in python(note there are differences made to print in 3.x some are actually pretty cool but i digress thats beyond the scope of this) the first print statement is an example of string concatenation, in english that means "taking two strings and sticking them together into one" but sometimes you want to print other variables as output ints, for example. to do this you could use string concatenation and print something like
Here we've casted the integer num to a string so this works, alternatively we can send print parameters(as many as you want) by seperating the values with a , as is illustrated in the second print statement. Note if you do this it will automatically print a space between the items. Printing \n is also unnecessary in python unless you're using sys.stdout.write() as the print function automatically puts a \n at the end, you can override this by ending a print statement with a , doing so will prevent the line from printing until the next print sequence that ends without a comma is executed and should print on the same line.
[color=#00ffff][size=5]Conditionals:[/size][/color]
Conditional statements in python are pretty typical, the major changes are just in the syntax but in the logical exp<b></b>ression they are more or less the same. Examples help to make things clear so lets look at a basic example:
This script is similar to our previous one at the start, it prompts the user to enter the current day of the week, it then takes the value they entered and uses the lower() method on it(remember everything in python is an object, as such different data types have different methods that can perform operations on them, strings for example have .lower() and .upper() to modify the case of the letters) we check the value the user entered and see if its either 'saturday' or 'sunday' if it is then it prints a warm message otherwise its not so bright. In python we use a : in most places where you would see the { opening bracket in other languages, for example at the opening of a class, function, conditional, etc... the : denotes the opening of a new code block. Notice the use of the 'or' word here, in a language like PHP the equivalent would be written as:
For an example like this theres not a whole lot of difference, and note that you still use the == sign, python does have an 'is' keyword but its use might be slightly confusing at first, instead of checking for value equality 'is' checks if they are actually references to the same object, I wont elaborate on this since its beyond the scope of this tutorial but for those who are interested in looking it up there it 'is'.(bad pun)
we've seen how if/else clauses work here, we check a condition and if its true one set of instructions is executed, if it is not true the else set of instructions is executed. However often we have problems that are more complex and require more potential streams of execution than just two. In most languages you're probably familiar with, this is where the else if statement comes in, which isnt technically a seperate statement in its own right but rather a convention used in formatting nested if/else clauses. Python uses something a little different but for practical reasons and in use it is identical we simply use the 'elif' key word e.g.:
Unfortunately Python does not have an equivalent to switch statements(much to my chagrin) you can do something similar if you're using a dictionary but thats only really useful in specific scenarios. Its another thing i wont go into detail on but you can look up the technique at: [url="http://stackoverflow.com/questions/60208/replacements-for-switch-statement-in-python"]http://stackoverflow...ement-in-python[/url]
Loops:
Every programming language needs a way of repeating the execution of a set of instructions a preset number of times, for example if you have a list of employees you need to run through to process pay, the same steps need to be performed for each employee. There are two primarily different types of loops: Event Controlled Loops and Count Controlled Loops. The former will continue execution until some event occurs, say reading a file until it reaches the EOF marker, or iterating through a list until it hits the last element(this is technically refered to as collection controlled loop but it is a sub-type of an event controlled loop) The latter: count controlled loops, will execute a preset number of times. The vast majority of the time in python you will use the for loop, it is spectacularily useful and has been optomized significantly, and can be adapted to most situations, however there will be times where you need something different which is when you can utilise the while loop.
The For loop in python is incredibly full-featured im having trouble thinking of where to begin. As with most things lets go to an example to illustrate how for loops work.
what it does here is it takes the list usernames[] and iterates through each of its values, each time substituting the current value for 'user' in the code block. This can be very convenient in many scenarios but one problem you will encounter is using this method of iteration you become unable to reference other elements relative to the current one, say if you used an array in php you could say $var[$i] + $var[$i-3]; to continually modify a set as you go through it. Not to worry for this can be accomplished in python just the same by creating a range of numbers, and iterating through it instead. An example to clarify:
here we are using the len() function which returns the number of elements in an element like a list, tuple, or dictionary(it also works on strings to get the strlen) this gives us the total number of items and the range() function creates an iterable sequence to run through and iterates through them using i as the numeric index, we have effectively gone from an event controlled for loop which checked for the end of the iteration sequence, to a count controlled loop. If this is confusing try to think of it as a way of encompassing a for and foreach loop into one thing. You can also take say a dictionary and iterate through its key => value sets
note you need to use the .iteritems() method to do this.
The Else clause can be used in conjunction with for loops as well, note that if the for loop reaches the end of its execution(e.g. if it reaches its event or count control) the else clause will always execute, the idea of it is if you want to search for a condition which would break out of the loop entirely, if that condition fails to meet then you run the else clause at the end of the sequence, again an example for clarity:
in our example if it finds a match it immediately ends the for loop and jumps to the end of the else statement before resuming execution, however if no match is found it alerts you as such.
In this example you might be asking what 'break' is since we haven't discussed it yet. In many programming languages there are keywords for either breaking out of a loop entirely or skipping to the next iteration, these are 'break' and 'continue' respectively, they operate in the same manner as they do in other languages and generally aren't very difficult concepts to grasp so i wont spend any further time discussing them.
The final thing i'll discuss in the way of for loops is something called list comprehensions, list comprehensions make it easy to generate a list in one statement by iterating through a set. As always that probably makes sense if you know what it is already and is a bunch of jargon and jibberish otherwise so lets take a look at an example:
*example taken from: http://www.pythonforbeginners.com/lists/list-comprehensions-in-python/
List comprehensions can take a while to wrap your head around so start simple and work your way up, they can be extremely useful at times.
Alright so next we have while loops, while loops are significantly simpler than for loops so well spend a lot less time here. A while loop will check a condition and continue executing a block of code as long as that condition is true, it is possible that if the condition is initially false that the while loop won't run at all. Example:
This is an example of an event controlled loop, it will continue to execute until the user enters either 'quit' or 'exit', while loops can be used with count controlled loops as well:
in this example we initialize the counter variable with a value of 0, from there we check if counter is less than 10, since 0 < 10 is a true statement the code executes printing "Step: 0" then adding 1 to the value of counter, it then checks 1 < 10 which is also true, etc... until i becomes equal to 10 at which point the statement 10 < 10 becomes false and the loop exits
basically as long as the conditions true the loop will run, pretty straight forward and not any different from most other programming languages.
Functions:
In Python functions serve the same purpose they do in other programming languages, they allow you to make your code modular and more abstract. Due to pythons duck typing you wont need to declare a function data type like you would in a traditional programming language like C(where you would either determine the data type of a value returning function, or declare void). Although you dont have to explicitly declare their types functions can be used in much the same way returning any data type or nothing at all. Lets look at a brief example:
the keyword def tells the interpreter that we are defining a function named 'sayhi' which takes a single parameter 'name' it then proceeds to print out a greeting to the name it was supplied. sayhi() does not return any value whatsoever. Lets take a look at a function that might be a little more useful lets build our own square function:
when you see a line that includes a function call such as that print statement, read it in the sense that by the time that line itself is executed that all of the function calls have already executed, so the sqr(3) will run before the print and the value it returns is substituted in its place, in this case the value 9, so as far as the interpreter sees it the statement print sqr(3) is equivalent to print 9. Being able to use functions like this allows us to write some very flexible code.
Ill post a followup tutorial that goes more in depth on functions in terms of things like passing by value and passing by reference but for the time being this should be sufficient.
Interactive Interpreter:
The last thing I want to touch on in this is the Python Interactive Interpreter. I cant stress enough how freaking awesome the interactive interpreter is, its a programmers wetdream and so much more. If you open a command prompt on linux windows or mac just type: 'python' and it will load you into the interactive interpreter, from there you can begin writing code and it will instantly execute, this allows you to test out new theories, and check sections of code independantly of the program they're being written for. In the interpreter you can import any modules and gain access to their features as well. All of the examples used in this tutorial can be quickly tested in the interpreter.(remember to use tabs for indentation xD) spend time just fucking around with the interactive interpreter I have a few things I want you to play with that ill list here, look at the results they output and consider how these could be utilized in applications
Common Phrases:
- cast - changing a variable from one data type to another
- class - an abstract data structure
- module - another python script that can add extended functionality
- object - an abstract data structure, an object is an instance of a class
Python Syntax:
The syntax in python is different from what you might be familiar with in languages like PHP or C++. Lines do not need to be terminated by a semi-colon in python(although you can still use them and it will execute fine, the standard convention is to omit them) There are also no curly braces in use in python to denote a block of code. As such code NEEDS to be indented properly in python or it will not run. Indentation needs to be consistent as well, you can for example use 4 spaces as an indent or one tab but you can not interchangeably use them both throughout the same program(as a rule of thumb when i copy python code i immediately do a find and replace of 4 spaces to \t to make life easier, if you port code from windows to python you may find you need to use dos2unix or unix2dos for the vise versa) If you are familiar with how indentation works then this shouldn't be a problem, if you aren't aware what I'm referring to we'll get to that shortly.
Python is also distinct from other languages in the amount of english used in its syntax, where other languages use the || and && logical 'or' and 'and' operators respectively in python you can simply write the words: 'or' and 'and' in conditional statements. This english based syntax mixed with strongly enforced indentation makes python code quite easy to read at initial glance which is one of the many benefits of Python.(there is also a keyword 'not' in place of ! however you can still use ! as an operator as in the example: 1 != 2)
Modules:
If you're wondering if you can do something in python the answer is almost invariably yes, there is almost certainly already a module in existance that will make your life easier. Python comes pre-packaged with a ton of useful modules of all types that you can use and theres just as many or more third party modules available that usually require only a simple install script be run to make them accessible. To import a module in essence the same as including a header file in C/C++ or doing an include/require in php or whatever the equivalency is in any other language, same deal. The syntax to do so is:
import modulename
theres also a few spins on this, you can import multiple modules in one line like:
import module1, module2, module3
or you can choose to import only a sepcific class or method from a module
from module1 import class3
if you use import modulename you will find often that you have to declare modulename.classname() when you instantiate an instance, I forget the exact reasons but it has to do with how it is imported into the namespace, to avoid this you can write:
from module1 import *
which will import everything into the global namespace, be warned though this can be exhaustive on memory and a bad idea for several reasons, its not against standard procedures to do it but dont use it in a sloppy manner be sure theres a reason for it.
you can also import other scripts you have written, if they are in the same cwd as the script you are executing you can simply type
import myscript
if your script is named 'myscript.py' it will now be imported, besure to omit the .py in the import call.
you can also import local files in a subdirectory provided you use:
import folder.myscript
NOTE: to do this you will likely need an empty file in the directory called __init__.py inside it simply write: 'pass' and include it in the directory 'folder' or whatever the case is.
Variables
Python uses the Duck Typing paradigm, in which if a variable walks like a duck and talks like a duck its probably a duck. What this means in english is that when you declare a variable you do not need to declare its data type(an int or a string for example) as you would in C++ or some other languages. At its core everything in the python language is an object, when you assign a variable you're really creating an object of type variable which inherits its traits and properties based on its value which determines its data type. Variables are also pointers which means you can re-assign a variable and it will simply reference a new object and pythons garbage collection will deal with freeing the memory of the old object behind the scenes. I feel like I'm not explaining this very well so perhaps an example is in order, consider the following code:
>>> a = 4
>>> b = 32.342
>>> c = a * b
>>> c
129.368
>>>
This is a basic example from the python interactive interpreter which is something we'll discuss in further detail later on. What we are doing here is declaring and initializing 3 variables: a, b, and c. a is of type int because its value is 4 which is a whole number. b on the other hand is of type float, as its a floating point number.(notice we do not have to tell the script their types this is all determined for us) c is equal to a * b, if you write in other languages you would be forced to cast a to a float or b to an int before you could perform an operation with both values. Python automatically creates c as a float and stores the value of the operation, if you wanted to do an int operation you could rewrite the c equation as follows:
>>> c = a * int(b)
The main data types you should be familiar with in python are as follows:
- Integer - (Think Whole numbers: 1, 5, 420)
- Float - (Floating point numbers, numbers with a decimal, 3.14, 4.20, 1.21 gigawatts)
- Char - (A single character such as 'a', 'r', '4', '*')
- String - (A sequence of chars terminated by a nullbyte character, it is an iterable object(think arrays))
- Boolean - (A logical True or False value often associated with the binary values 1 and 0)
- Tuples - (Tuples are for storing a immutable set of data, they are like arrays but their values cannot be changed after creation, items of tuples can be any data type including other tuples, lists or dictionaries which we will discuss next)
- Lists - (A list is a mutable set of data, also like arrays but with several different properties, lists can be sliced to return a segment of its data, lists can contain any data type as well as other lists tuples or dictionaries)
- Lists and Tuples both use an integer index, e.g.: list1[5] tuple3[2] to access independant variables, if you wanted to create the equivalent of an associative array you would use a dictionary, dictionaries are inherently unordered so sorting through them can require some more advanced techniques that are beyond the scope of this tutorial
- Dictionaries - (A Dictionary is a mutable set of data, similar to an array it uses a key => value type of index, keys and values can both be a range of data types)
* Mutable means it can be changed after it has been created, immutable means it cannot be changed after it has been created, so for instance you cannot make a tuple tup = (1,2,3,4,5) and reassign tup[1] = 3, although you can re-assign tup1 to an entirely new tuple, note there is a distinct difference between doing the two. Whereas with a list or dictionary you can add new items, as well as modify items values in place.
Theres some others but for the scope of this tutorial this list should be sufficient to get you started writing programs.
Now that we have our basic building blocks lets look at how we can write a simple application to get the user to enter their name, then write back out a greeting to them:
>>> user = raw_input("Please Enter Your Name: ")
>>> print "Hello " + user
>>> print "Hello", user
To break down whats happening here, raw_input() is a function used to get input from the user, it takes as a parameter a string to use as a prompt and it returns the value the user enters, so we create a variable user which python will automatically set to be a string and set it to the users name. Next we have two print statements, this is to illustrate two ways the print function can be used in python(note there are differences made to print in 3.x some are actually pretty cool but i digress thats beyond the scope of this) the first print statement is an example of string concatenation, in english that means "taking two strings and sticking them together into one" but sometimes you want to print other variables as output ints, for example. to do this you could use string concatenation and print something like
print "Number" + str(num)
Here we've casted the integer num to a string so this works, alternatively we can send print parameters(as many as you want) by seperating the values with a , as is illustrated in the second print statement. Note if you do this it will automatically print a space between the items. Printing \n is also unnecessary in python unless you're using sys.stdout.write() as the print function automatically puts a \n at the end, you can override this by ending a print statement with a , doing so will prevent the line from printing until the next print sequence that ends without a comma is executed and should print on the same line.
[color=#00ffff][size=5]Conditionals:[/size][/color]
Conditional statements in python are pretty typical, the major changes are just in the syntax but in the logical exp<b></b>ression they are more or less the same. Examples help to make things clear so lets look at a basic example:
>>> day = raw_input("What Day Is It?: ")
>>> if day.lower() == 'saturday' or day.lower() == 'sunday':
... print "Wewt its the weekend!"
... else:
... print "Boo its a weekday!"
This script is similar to our previous one at the start, it prompts the user to enter the current day of the week, it then takes the value they entered and uses the lower() method on it(remember everything in python is an object, as such different data types have different methods that can perform operations on them, strings for example have .lower() and .upper() to modify the case of the letters) we check the value the user entered and see if its either 'saturday' or 'sunday' if it is then it prints a warm message otherwise its not so bright. In python we use a : in most places where you would see the { opening bracket in other languages, for example at the opening of a class, function, conditional, etc... the : denotes the opening of a new code block. Notice the use of the 'or' word here, in a language like PHP the equivalent would be written as:
<?php
$day = trim(fgets(STDIN));
if(strtolower($day) == 'saturday' || strtolower($day) == 'sunday')
print "Wewt its the weekend!\n";
else
print "Boo its a weakday!\n";
?>
For an example like this theres not a whole lot of difference, and note that you still use the == sign, python does have an 'is' keyword but its use might be slightly confusing at first, instead of checking for value equality 'is' checks if they are actually references to the same object, I wont elaborate on this since its beyond the scope of this tutorial but for those who are interested in looking it up there it 'is'.(bad pun)
we've seen how if/else clauses work here, we check a condition and if its true one set of instructions is executed, if it is not true the else set of instructions is executed. However often we have problems that are more complex and require more potential streams of execution than just two. In most languages you're probably familiar with, this is where the else if statement comes in, which isnt technically a seperate statement in its own right but rather a convention used in formatting nested if/else clauses. Python uses something a little different but for practical reasons and in use it is identical we simply use the 'elif' key word e.g.:
>>> day = raw_input("What Day Is It?: ")
>>> if day.lower() == 'saturday' or day.lower() == 'sunday':
... print "Wewt its the weekend!"
... elif day.lower() == 'monday' or day.lower == 'tuesday':
... print "Stay Strong Brother!"
... elif day.lower() == 'thursday' or day.lower() == 'friday':
... print "Almost There!"
... else:
... print "Boo its a weakday!"
Unfortunately Python does not have an equivalent to switch statements(much to my chagrin) you can do something similar if you're using a dictionary but thats only really useful in specific scenarios. Its another thing i wont go into detail on but you can look up the technique at: [url="http://stackoverflow.com/questions/60208/replacements-for-switch-statement-in-python"]http://stackoverflow...ement-in-python[/url]
Loops:
Every programming language needs a way of repeating the execution of a set of instructions a preset number of times, for example if you have a list of employees you need to run through to process pay, the same steps need to be performed for each employee. There are two primarily different types of loops: Event Controlled Loops and Count Controlled Loops. The former will continue execution until some event occurs, say reading a file until it reaches the EOF marker, or iterating through a list until it hits the last element(this is technically refered to as collection controlled loop but it is a sub-type of an event controlled loop) The latter: count controlled loops, will execute a preset number of times. The vast majority of the time in python you will use the for loop, it is spectacularily useful and has been optomized significantly, and can be adapted to most situations, however there will be times where you need something different which is when you can utilise the while loop.
The For loop in python is incredibly full-featured im having trouble thinking of where to begin. As with most things lets go to an example to illustrate how for loops work.
>>> usernames = ["Phaedrus", "Logic", "MrGreen", "Zentrix"] #Create a list of usernames, note lists use [] tuples use () and dictionaries use {} as parentheses
>>> for user in usernames:
... print user,"is fucking dope"
what it does here is it takes the list usernames[] and iterates through each of its values, each time substituting the current value for 'user' in the code block. This can be very convenient in many scenarios but one problem you will encounter is using this method of iteration you become unable to reference other elements relative to the current one, say if you used an array in php you could say $var[$i] + $var[$i-3]; to continually modify a set as you go through it. Not to worry for this can be accomplished in python just the same by creating a range of numbers, and iterating through it instead. An example to clarify:
>>> usernames = ["Phaedrus", "Logic", "MrGreen", "Zentrix"]
>>> for i in range(len(usernames)):
... print usernames[i],"is fuckin dope"
here we are using the len() function which returns the number of elements in an element like a list, tuple, or dictionary(it also works on strings to get the strlen) this gives us the total number of items and the range() function creates an iterable sequence to run through and iterates through them using i as the numeric index, we have effectively gone from an event controlled for loop which checked for the end of the iteration sequence, to a count controlled loop. If this is confusing try to think of it as a way of encompassing a for and foreach loop into one thing. You can also take say a dictionary and iterate through its key => value sets
>>> users = { 'Durendal' : 'coder', 'Logic' : 'coder', 'MrGreen' : 'coder', 'zentrix' : 'bawsman' }
>>> for name, job in users.iteritems():
... print name,"is a",job
note you need to use the .iteritems() method to do this.
The Else clause can be used in conjunction with for loops as well, note that if the for loop reaches the end of its execution(e.g. if it reaches its event or count control) the else clause will always execute, the idea of it is if you want to search for a condition which would break out of the loop entirely, if that condition fails to meet then you run the else clause at the end of the sequence, again an example for clarity:
>>> usernames = ["Phaedrus", "Logic", "MrGreen", "Zentrix"]
>>> user = raw_input("Enter your name: ")
>>> for ppl in usernames:
... if user == ppl:
... print "Found you in the list!",ppl
... break
... else:
... print "Failed to find you in the list"
in our example if it finds a match it immediately ends the for loop and jumps to the end of the else statement before resuming execution, however if no match is found it alerts you as such.
In this example you might be asking what 'break' is since we haven't discussed it yet. In many programming languages there are keywords for either breaking out of a loop entirely or skipping to the next iteration, these are 'break' and 'continue' respectively, they operate in the same manner as they do in other languages and generally aren't very difficult concepts to grasp so i wont spend any further time discussing them.
The final thing i'll discuss in the way of for loops is something called list comprehensions, list comprehensions make it easy to generate a list in one statement by iterating through a set. As always that probably makes sense if you know what it is already and is a bunch of jargon and jibberish otherwise so lets take a look at an example:
The list comprehension always returns a result list.
If you used to do it like this:
new_list = []
for i in old_list:
if filter(i):
new_list.append(expressions(i))
You can obtain the same thing using list comprehension:
new_list = [expression(i) for i in old_list if filter(i)]
*example taken from: http://www.pythonforbeginners.com/lists/list-comprehensions-in-python/
List comprehensions can take a while to wrap your head around so start simple and work your way up, they can be extremely useful at times.
Alright so next we have while loops, while loops are significantly simpler than for loops so well spend a lot less time here. A while loop will check a condition and continue executing a block of code as long as that condition is true, it is possible that if the condition is initially false that the while loop won't run at all. Example:
>>> input = raw_input("Enter exit to quit")# Priming Read
>>> while input.lower() != 'exit' and input.lower() != 'quit': #While loop conditional
... print input #Instructions
... input = raw_input("Enter exit to quit")
This is an example of an event controlled loop, it will continue to execute until the user enters either 'quit' or 'exit', while loops can be used with count controlled loops as well:
>>> counter = 0
>>> while counter < 10:
... print "Step:",counter
... counter += 1
in this example we initialize the counter variable with a value of 0, from there we check if counter is less than 10, since 0 < 10 is a true statement the code executes printing "Step: 0" then adding 1 to the value of counter, it then checks 1 < 10 which is also true, etc... until i becomes equal to 10 at which point the statement 10 < 10 becomes false and the loop exits
basically as long as the conditions true the loop will run, pretty straight forward and not any different from most other programming languages.
Functions:
In Python functions serve the same purpose they do in other programming languages, they allow you to make your code modular and more abstract. Due to pythons duck typing you wont need to declare a function data type like you would in a traditional programming language like C(where you would either determine the data type of a value returning function, or declare void). Although you dont have to explicitly declare their types functions can be used in much the same way returning any data type or nothing at all. Lets look at a brief example:
>>> def sayhi(name):
... print "Hey",name
...
>>> name = raw_input("Please enter your name: ")
>>> sayhi(name)
Hey Phaedrus
>>>
the keyword def tells the interpreter that we are defining a function named 'sayhi' which takes a single parameter 'name' it then proceeds to print out a greeting to the name it was supplied. sayhi() does not return any value whatsoever. Lets take a look at a function that might be a little more useful lets build our own square function:
>>> def sqr(val):
... return val * val
>>> print sqr(3)
9
>>> print sqr(5)
25
when you see a line that includes a function call such as that print statement, read it in the sense that by the time that line itself is executed that all of the function calls have already executed, so the sqr(3) will run before the print and the value it returns is substituted in its place, in this case the value 9, so as far as the interpreter sees it the statement print sqr(3) is equivalent to print 9. Being able to use functions like this allows us to write some very flexible code.
Ill post a followup tutorial that goes more in depth on functions in terms of things like passing by value and passing by reference but for the time being this should be sufficient.
Interactive Interpreter:
The last thing I want to touch on in this is the Python Interactive Interpreter. I cant stress enough how freaking awesome the interactive interpreter is, its a programmers wetdream and so much more. If you open a command prompt on linux windows or mac just type: 'python' and it will load you into the interactive interpreter, from there you can begin writing code and it will instantly execute, this allows you to test out new theories, and check sections of code independantly of the program they're being written for. In the interpreter you can import any modules and gain access to their features as well. All of the examples used in this tutorial can be quickly tested in the interpreter.(remember to use tabs for indentation xD) spend time just fucking around with the interactive interpreter I have a few things I want you to play with that ill list here, look at the results they output and consider how these could be utilized in applications
- str1 = "This is a test string"
- print str1*5
- print 'a'*50
- for letter in str1:
print letter - strlen = len(str1)
- for i in range(strlen):
print str1[i], - 5 * 3
- a = 4
- a / 2
- str1
Subscribe to:
Posts (Atom)