1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Super simple Php Crawler

Discussion in 'PHP' started by theblackjacker, Oct 17, 2009.

  1. #1
    Hi!

    I have had some problems trying to make a crawler in PHP. The code I have found on websites through google doesn't seem to work for some reason.

    So I would just like an example of the easiest possible crawler in php. Where I can just copy paste the code directly to a document and make it work. And then try to make it more advanced myself. I just want it to go to a website and crawl to another website and print something in my browser when i run the code.
     
    theblackjacker, Oct 17, 2009 IP
  2. AsHinE

    AsHinE Well-Known Member

    Messages:
    240
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    138
    #2
    I guess the simpliest crawler is somthing like this.
    
    $page = file_get_contents("http://example.com");
    
    PHP:
    Anyway this is a good start.
     
    AsHinE, Oct 17, 2009 IP
  3. theblackjacker

    theblackjacker Peon

    Messages:
    52
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Thanks..

    so, what is it exactly that gets stored in the page-variable when I do this?

    I mean if I want to print something from the example website to my to see in my browser when i run the script. How do I do that the easiest possible way?
     
    theblackjacker, Oct 17, 2009 IP
  4. premiumscripts

    premiumscripts Peon

    Messages:
    1,062
    Likes Received:
    48
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Why don't you test it and see for yourself? What is saved in the variable is the entire contents (entire HTML) at the destination. This can also be read at http://php.net/file_get_contents - If you want to simply show a portion of the HTML, you will have to extract it with a regex. (preg_match and so on)
     
    premiumscripts, Oct 17, 2009 IP
  5. theblackjacker

    theblackjacker Peon

    Messages:
    52
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    now I made it work.. thanks..

    and to crawl an entire site, I need to loop this function I guess and make it follow internal links?
     
    Last edited: Oct 17, 2009
    theblackjacker, Oct 17, 2009 IP
  6. AsHinE

    AsHinE Well-Known Member

    Messages:
    240
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    138
    #6
    Yes, you are right. In short you should need to get all links from that page and decide if you should download them.
    Take a look at snoopy php class, it has some great features for your task.
     
    AsHinE, Oct 17, 2009 IP
  7. premiumscripts

    premiumscripts Peon

    Messages:
    1,062
    Likes Received:
    48
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Yes, you'll again have to use a regex to gather all the links from the $page variable which contains the html contents. You can use http://regexlib.com or a similar site to find such a regex.
     
    premiumscripts, Oct 17, 2009 IP
  8. AsHinE

    AsHinE Well-Known Member

    Messages:
    240
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    138
    #8
    It is not necessary to use regex, you can try to use DOMDocument and XPath queries to fetch all links from html.
    Besides, AFAIK snoopy has a method to fetch all links from downloaded page.
     
    AsHinE, Oct 18, 2009 IP
  9. premiumscripts

    premiumscripts Peon

    Messages:
    1,062
    Likes Received:
    48
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Didn't know that about snoopy, and yeah using domdocument is probably a better way.
     
    premiumscripts, Oct 18, 2009 IP
  10. theblackjacker

    theblackjacker Peon

    Messages:
    52
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #10
    I managed to go this far:

    <?php
    include "Snoopy.class.php";
    $snoopy = new Snoopy;

    $snoopy->fetchlinks("http://www.msn.com/");
    print $snoopy->results;

    var_dump($snoopy->results);
    ?>

    Now I get the links in my browser but the all look like this: [1]=> string(42) "http://www.example.com/example"

    I guess the [] is the index number in the array and the () is the number of characters or something like that.

    I tried to put the array: $snoopy->results which should include all these links in a database but that didn't seem to work.

    I thought that I must put them into a database to be able to follow the links to crawl the rest of the site.
     
    theblackjacker, Oct 18, 2009 IP
  11. AsHinE

    AsHinE Well-Known Member

    Messages:
    240
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    138
    #11
    $snoopy->results is an array that's why just print $snoopy->results didn't work.
     
    AsHinE, Oct 18, 2009 IP
  12. theblackjacker

    theblackjacker Peon

    Messages:
    52
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #12
    Ok so will the... [1]=> string(42) ... part for disapear if I manage to put the array in a database. I mean that is just information about the array, not the array itself isn't it?

    And do I have to create a new variable or something for the $snoopy->results

    I mean when I put it in a database it was empty it seemed.
     
    theblackjacker, Oct 18, 2009 IP
  13. AsHinE

    AsHinE Well-Known Member

    Messages:
    240
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    138
    #13
    try
    
    $links = $snoopy->results;
    echo $links[0];
    
    PHP:
    and see what you'll get
     
    AsHinE, Oct 18, 2009 IP
  14. theblackjacker

    theblackjacker Peon

    Messages:
    52
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #14
    that returns the word "array" and a link http://example.com

    Like this: arrayhttp://example.com

    Just wonder why I get the word "array" when it should return what's stored in the array in index 0.
     
    theblackjacker, Oct 18, 2009 IP
  15. AsHinE

    AsHinE Well-Known Member

    Messages:
    240
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    138
    #15
    Probably you are getting "array" because of this line:
    print $snoopy->results;
    comment or delete it.
     
    AsHinE, Oct 18, 2009 IP
  16. theblackjacker

    theblackjacker Peon

    Messages:
    52
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #16
    yes you're right.. now get only get the url :)

    Ok below is the code as it is now. The problem is that it doesn't write anyting to the mysql database. If I try with just a word instead of a variable it works fine, so I guess something is wrong with either the syntax or if I need to do something more to get it to work.

    <?php
    include "Snoopy.class.php";
    $snoopy = new Snoopy;

    $snoopy->fetchlinks("http://www.msn.com/");


    $links = $snoopy->results;


    $lankstring = implode(',' , $links);
    echo $linkstring.'<br />';




    $con = mysql_connect("localhost","root","");
    if (!$con)
    {
    die('Could not connect: ' . mysql_error());
    }

    mysql_select_db("one", $con);



    mysql_query("CREATE TABLE testtable(

    text VARCHAR(100), INDEX (text)

    )")
    or die(mysql_error());


    mysql_query ("INSERT INTO testtable(text) VALUES ('$linkstring')");


    mysql_close($con);

    ?>
     
    theblackjacker, Oct 18, 2009 IP
  17. organicCyborg

    organicCyborg Peon

    Messages:
    330
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #17
    How would one go about using DomDocument and XPath functions to get all the links from a page?

    I'd never heard about that before.
     
    organicCyborg, Oct 18, 2009 IP
  18. AsHinE

    AsHinE Well-Known Member

    Messages:
    240
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    138
    #18
    organicCyborg, here is a good tutorial how to get links from webpage with xpath and domdocument

    theblackjacker, take a look at that tutorial too. It uses another technique, but it almost what you want, I suppose.
    about your code:
    1. As for me I would not create a table each time I run script. Just create it once and then just run insert queries.
    2. You have set field length to 100 symbols by this :
    Are you sure it is enough (100 symbols, not 100 lines)?
    3. I'm not a big specialist in SQL but usually I write insert statements like this:
    
    $q = "INSERT INTO testtable SET `text` =' ".$linkstring." ' ";
    
    PHP:
    Also 'text' may be a reserved keyword in MySQL, so I advice to change it to something different, like linktext or links.
     
    AsHinE, Oct 18, 2009 IP
  19. theblackjacker

    theblackjacker Peon

    Messages:
    52
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #19
    Yes I got an error message and it seems the problem is one of the links, maybe it's to long or uses some weird symbols. I'm also going to change the tables later but first I want to fix this problem with the database.

    Perhaps I should use text instead of varchar as you said. I tried this:

    mysql_query("CREATE TABLE testtable(

    thetext text(600), INDEX (thetext)

    but then I get this error message: BLOB/TEXT column 'thetext' used in key specification without a key length
     
    theblackjacker, Oct 19, 2009 IP
  20. AsHinE

    AsHinE Well-Known Member

    Messages:
    240
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    138
    #20
    As I've said I'm not a great specialist in MySQL, so I never made indexes on text columns.
    I usually have an id column which is primary key and has an index on it.

    I usually create tables in PhpMyAdmin so I don't know exact syntax for creating tables.

    If you have error message, please, show it here so we can help. Maybe you should use mysql_real_escape_string function when inserting into database.
     
    AsHinE, Oct 19, 2009 IP