Hi! I have had some problems trying to make a crawler in PHP. The code I have found on websites through google doesn't seem to work for some reason. So I would just like an example of the easiest possible crawler in php. Where I can just copy paste the code directly to a document and make it work. And then try to make it more advanced myself. I just want it to go to a website and crawl to another website and print something in my browser when i run the code.
I guess the simpliest crawler is somthing like this. $page = file_get_contents("http://example.com"); PHP: Anyway this is a good start.
Thanks.. so, what is it exactly that gets stored in the page-variable when I do this? I mean if I want to print something from the example website to my to see in my browser when i run the script. How do I do that the easiest possible way?
Why don't you test it and see for yourself? What is saved in the variable is the entire contents (entire HTML) at the destination. This can also be read at http://php.net/file_get_contents - If you want to simply show a portion of the HTML, you will have to extract it with a regex. (preg_match and so on)
now I made it work.. thanks.. and to crawl an entire site, I need to loop this function I guess and make it follow internal links?
Yes, you are right. In short you should need to get all links from that page and decide if you should download them. Take a look at snoopy php class, it has some great features for your task.
Yes, you'll again have to use a regex to gather all the links from the $page variable which contains the html contents. You can use http://regexlib.com or a similar site to find such a regex.
It is not necessary to use regex, you can try to use DOMDocument and XPath queries to fetch all links from html. Besides, AFAIK snoopy has a method to fetch all links from downloaded page.
I managed to go this far: <?php include "Snoopy.class.php"; $snoopy = new Snoopy; $snoopy->fetchlinks("http://www.msn.com/"); print $snoopy->results; var_dump($snoopy->results); ?> Now I get the links in my browser but the all look like this: [1]=> string(42) "http://www.example.com/example" I guess the [] is the index number in the array and the () is the number of characters or something like that. I tried to put the array: $snoopy->results which should include all these links in a database but that didn't seem to work. I thought that I must put them into a database to be able to follow the links to crawl the rest of the site.
Ok so will the... [1]=> string(42) ... part for disapear if I manage to put the array in a database. I mean that is just information about the array, not the array itself isn't it? And do I have to create a new variable or something for the $snoopy->results I mean when I put it in a database it was empty it seemed.
that returns the word "array" and a link http://example.com Like this: arrayhttp://example.com Just wonder why I get the word "array" when it should return what's stored in the array in index 0.
yes you're right.. now get only get the url Ok below is the code as it is now. The problem is that it doesn't write anyting to the mysql database. If I try with just a word instead of a variable it works fine, so I guess something is wrong with either the syntax or if I need to do something more to get it to work. <?php include "Snoopy.class.php"; $snoopy = new Snoopy; $snoopy->fetchlinks("http://www.msn.com/"); $links = $snoopy->results; $lankstring = implode(',' , $links); echo $linkstring.'<br />'; $con = mysql_connect("localhost","root",""); if (!$con) { die('Could not connect: ' . mysql_error()); } mysql_select_db("one", $con); mysql_query("CREATE TABLE testtable( text VARCHAR(100), INDEX (text) )") or die(mysql_error()); mysql_query ("INSERT INTO testtable(text) VALUES ('$linkstring')"); mysql_close($con); ?>
How would one go about using DomDocument and XPath functions to get all the links from a page? I'd never heard about that before.
organicCyborg, here is a good tutorial how to get links from webpage with xpath and domdocument theblackjacker, take a look at that tutorial too. It uses another technique, but it almost what you want, I suppose. about your code: 1. As for me I would not create a table each time I run script. Just create it once and then just run insert queries. 2. You have set field length to 100 symbols by this : Are you sure it is enough (100 symbols, not 100 lines)? 3. I'm not a big specialist in SQL but usually I write insert statements like this: $q = "INSERT INTO testtable SET `text` =' ".$linkstring." ' "; PHP: Also 'text' may be a reserved keyword in MySQL, so I advice to change it to something different, like linktext or links.
Yes I got an error message and it seems the problem is one of the links, maybe it's to long or uses some weird symbols. I'm also going to change the tables later but first I want to fix this problem with the database. Perhaps I should use text instead of varchar as you said. I tried this: mysql_query("CREATE TABLE testtable( thetext text(600), INDEX (thetext) but then I get this error message: BLOB/TEXT column 'thetext' used in key specification without a key length
As I've said I'm not a great specialist in MySQL, so I never made indexes on text columns. I usually have an id column which is primary key and has an index on it. I usually create tables in PhpMyAdmin so I don't know exact syntax for creating tables. If you have error message, please, show it here so we can help. Maybe you should use mysql_real_escape_string function when inserting into database.