1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

crawler

Discussion in 'PHP' started by ssimon171078, May 24, 2015.

  1. #1
    i wrote php crawler script ,how i can know all links from web page what i need to parse?
    my code is :
    <?php
    //parser of website ebay domain names
    $website="www.example.com";
    
    $filename="services4.txt";
    $fd=fopen($filename,"a+");
    
    $content=file_get_contents($website);
    $dom=new DOMDocument;
    $dom->loadhtml($content);
    $links=$dom->getElementsByTagName("a");
    foreach ($links as $link)
    {
        $link_nza=$link->getAttribute("href");
         if (strpos($link_nanza,"listings")){
          rtrim(link_nanza);
       fwrite($fd,$link_web.$link_nanza);
         fwrite($fd,"\n");}
        }
    
    
    fclose($fd);
    ?>
    PHP:

     
    ssimon171078, May 24, 2015 IP
  2. davidokedion

    davidokedion Greenhorn

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    21
    #2
    Use regex to grab the links.
     
    davidokedion, Jul 29, 2015 IP
  3. EricBruggema

    EricBruggema Well-Known Member

    Messages:
    1,740
    Likes Received:
    28
    Best Answers:
    13
    Trophy Points:
    175
    #3
    Don't use regex, as regex is very complex and slow. Parsing the HTML is way more secure and faster. But to fetch all links you should consider href="", src="" tags
     
    EricBruggema, Aug 7, 2015 IP
  4. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,998
    Best Answers:
    253
    Trophy Points:
    515
    #4
    You already seem to be pulling all the links, so what are you even asking?!?

    Or are you wanting to filter out just the links that point to the same domain? If so, use parse_url:
    http://php.net/manual/en/function.parse-url.php

    If you filter by PHP_URL_HOST and the result is either empty or matches the domain you are parsing, it's likely a document on the same site. You may also want to check if there's a <base> tag present and use that's value when PHP_URL_HOST is missing.

    I would also stick to just parsing href on anchors since things like SRC attributes on LINK or IMG tags should NOT contain content.

    Of course, if the site being parsed relies on scripttardery you're pretty well buggered on trying to deal with that.
     
    deathshadow, Aug 7, 2015 IP