1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Php crawler

Discussion in 'Programming' started by ssimon171078, Sep 27, 2014.

  1. #1
    i started to write Php crawler i need to filter some links how can i do it ?preg_match_all ?
    for example i need to block https://www.website.com/user..

    my code :
    <?php
    $page = file_get_contents('https://www.website.com/');
    
    $newstr=preg_split("/<a href=/",$page);
    
    //print_r($newstr);
    $fh=fopen("file111.txt","a");
    foreach ($newstr as $links)
    {
    $links=strip_tags($links);
    print ("$links <br/>");
    trim  ($links);
    fwrite($fh,$links);
    
    
    }
    fclose($fh);
    ?>
    PHP:

     
    Last edited by a moderator: Oct 6, 2014
    ssimon171078, Sep 27, 2014 IP
  2. NetStar

    NetStar Notable Member

    Messages:
    2,471
    Likes Received:
    541
    Best Answers:
    21
    Trophy Points:
    245
    #2
    No. That is not how you do it. Google Simple HTML DOM Parser
     
    NetStar, Sep 27, 2014 IP
  3. Anveto

    Anveto Well-Known Member

    Messages:
    697
    Likes Received:
    40
    Best Answers:
    19
    Trophy Points:
    195
    #3
    So disregard NetStar :)

    And perhaps do this


    <?php
    $page = file_get_contents('https://www.website.com/');
    
    $newstr=preg_split("/<a href=/",$page);
    
    //print_r($newstr);
    $fh=fopen("file111.txt","a");
    foreach ($newstr as $links)
    {
    //If $links does not match the string below we echo and write to file
      if (strpos($links,'https://www.website.com/user..') === false) {
    $links=strip_tags($links);
    print ("$links <br/>");
    trim ($links);
    fwrite($fh,$links);
    }
    
    
    }
    fclose($fh);
    ?>
    PHP:
     
    Last edited by a moderator: Oct 6, 2014
    Anveto, Sep 28, 2014 IP
  4. NetStar

    NetStar Notable Member

    Messages:
    2,471
    Likes Received:
    541
    Best Answers:
    21
    Trophy Points:
    245
    #4
    You can disagree with me. However, you can't disregard the use of a library for parsing HTML. Using a regular expression to extract links is not the right tool for the job. First, we are assuming that ALL links begin with "a href", include no additional spaces and are on a single line which is never always true. Sometimes TITLE, NAME, ETC also follows "A". Second, the posted code is NOT parsing all links.

    So again... Look into PHP Simple HTML DOM Parser OR PHPQuery.
     
    Last edited by a moderator: Oct 6, 2014
    NetStar, Sep 28, 2014 IP
  5. Anveto

    Anveto Well-Known Member

    Messages:
    697
    Likes Received:
    40
    Best Answers:
    19
    Trophy Points:
    195
    #5
    He didn't ask if he was parsing it correctly, he asked what the next step was which I have answered.

    If he wants to do a better job of parsing the page you could just recommend that he user PHP's DOMXPath class which works very well, no need to mess with any of the solutions you listed.
     
    Anveto, Sep 28, 2014 IP
  6. kutchbhi

    kutchbhi Active Member

    Messages:
    130
    Likes Received:
    4
    Best Answers:
    2
    Trophy Points:
    70
    #6
    Disregard both of them .
    Use Querypath , which is a wrapper around PHP's DOMdocument and is superior in every way to simple html dom / regex .
    simple html dom uses regex only, so its not really better than regex, plus it has nasty memory leak issues (sort of)
     
    kutchbhi, Sep 29, 2014 IP
  7. NetStar

    NetStar Notable Member

    Messages:
    2,471
    Likes Received:
    541
    Best Answers:
    21
    Trophy Points:
    245
    #7
    So you took the time to provide an answer knowing that ultimately it wouldn't have served him any justice? Welp...that's terrible advice because these posts are archived for others to see.
     
    NetStar, Sep 29, 2014 IP
  8. nitsanbn

    nitsanbn Active Member

    Messages:
    382
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    58
    #8
    I personally prefer regex for simple HTML search/replace.
    You can use this regex: <a[^>]*>[^\r\n]+</a> and use an "ungreedy" modifier (lookup modifiers).

    Any HTML parser which is not based on regex will fail to parse an improper HTML document due to HTML errors (forgotten closing tags, typos, closing tags in the wrong order, etc). That's why regex is better for a simple search/replace.
    If you are looking for anything beyond simple search/replace you should consider a better HTML parser.

    Good luck!
     
    nitsanbn, Oct 6, 2014 IP
  9. seductiveapps.com

    seductiveapps.com Active Member

    Messages:
    200
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    60
    #9
    depends on what you're crawling for and what site you're crawling, but preg_match_all() (see php.net searchbox) is an excellent way of getting the info you need at minimal CPU cost.
     
    seductiveapps.com, Nov 16, 2014 IP