[HELP] Web Crawler

Discussion in 'PHP' started by miniramen, May 20, 2010.

  1. #1
    First of all...tnx for helping. I've been trying to code a web crawler and I found a link

    http://syntax.cwarn23.net/PHP/Making_a_search_engine

    Where it lets me find links of an entire website, not only on a webpage.

    So I was going to play around with it, in order to get the data that I want, so I tried lets say, pulling out all the postal code from a website but that didn't work at all.

    I don't want to copy paste the entire code, but you can definitely take a look at the link, all I did was modifying the end a bit to see how I can play around with it.

    -----------------
    function generate($url) {

    global $f_data; //Data of file contents

    //do something with webpage $f_data.
    $Regex = "/^[a-zA-Z]{1}[0-9]{1}[a-zA-Z]{1}(\-| |){1}[0-9]{1}[a-zA-Z]{1}[0-9]{1}$/";
    PREG_MATCH_ALL($Regex, $f_data, $pcode, PREG_PATTERN_ORDER);
    echo $pcode[0][0] . ", " . $pcode[0][1] . "\n";

    echo $url.'<br>';



    unset($f_data);
    }
    -----------------------------
    //I was trying to output the matched postal codes, but nothing was shown.

    any help would be appreciated, i've been stucked on this for so many hours already =(!!!
    Again, thanks for any help :)
     
    miniramen, May 20, 2010 IP
  2. mike.judd

    mike.judd Greenhorn

    Messages:
    84
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    18
    #2
    If you notice, the regex for matching the zipcode is wrong. The right regex should be ^(\d{5}-\d{4})|(\d{5})$

    That's the regex for matching US Zip code. If you are in different country, try to see whether the regex is right first before checking other code. Hope that helps
     
    mike.judd, May 20, 2010 IP
  3. roopajyothi

    roopajyothi Active Member

    Messages:
    1,302
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    80
    #3
    Yep! Thats the exact solution
    Try what mike said! :)
     
    roopajyothi, May 22, 2010 IP