How to do this with Regular Expressions?

Discussion in 'PHP' started by Geraldm, Apr 2, 2007.

  1. #1
    Hi,

    Lets say I have an html file containing A tags like this:

    <a *anything here* href="http://www.url.com" *anything here*>*anything here*</a>
    <a *anything here* href="folder/somepage.html" *anything here*>*anything here*</a>
    <a *anything here* href="../somepage.html" *anything here*>*anything here*</a>
    <a *anything here* href="somepage.php?test=43" *anything here*>*anything here*</a>
    HTML:
    The question I have, is how can I use Regular Expressions to get the URL string from the text above? What I mean is I would like to be able to produce a list of urls like this (including the domain etc):

    http://www.url.com
    http://www.url.com/folder/somepage.html
    http://www.url.com/somepage.html
    http://www.url.com/somepage.php?test=43
    Code (markup):
    Is this easy to do in PHP using Regular Expressions?
    Also, is there a way to grab the anchor text for each link as well?

    Cheers ...
    Gerald. :eek:
     
    Geraldm, Apr 2, 2007 IP
  2. bibel

    bibel Active Member

    Messages:
    289
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    58
    #2
    $reg='/<a(.*)href="(.*)"(.*)>(.*)<\/a>/sU';

    preg_match_all($reg,$file_contents,$out);

    Just look at what you have in $out after this. It should be all you need.
     
    bibel, Apr 2, 2007 IP
    Geraldm likes this.
  3. Geraldm

    Geraldm Well-Known Member

    Messages:
    1,330
    Likes Received:
    97
    Best Answers:
    0
    Trophy Points:
    115
    #3
    Hi,

    I did the following as a test:

    <?php
    $file_contents = '
    <a *anything here* href="http://www.url.com" *any1*>*here1*</a>
    <a *anything here* href="folder/somepage.html" *any2*>*here2*</a>
    <a *anything here* href="../somepage.html" *any3*>*here3*</a>
    <a *anything here* href="somepage.php?test=43" *any4*>*here4*</a>';
    
    $reg= '/<a(.*)href="(.*)"(.*)>(.*)<\/a>/sU';
    preg_match_all($reg,$file_contents,$out);
    
    foreach ($out as $val) {
        echo "part 1: " . $val[0] . " <br>\n";
        echo "part 2: " . $val[1] . "<br>\n";
        echo "part 3: " . $val[3] . "<br>\n";
        echo "part 4: " . $val[4] . "<br><br>\n\n";
    }
    PHP:
    But the output I get is this:
    part 1: <a *anything here* href="http://www.url.com" *any1*>*here1*</a> <br>
    part 2: <a *anything here* href="folder/somepage.html" *any2*>*here2*</a><br>
    part 3: <a *anything here* href="somepage.php?test=43" *any4*>*here4*</a><br>
    part 4: <br><br>
    
    part 1:  *anything here*  <br>
    part 2:  *anything here* <br>
    part 3:  *anything here* <br>
    part 4: <br><br>
    
    part 1: http://www.url.com <br>
    part 2: folder/somepage.html<br>
    part 3: somepage.php?test=43<br>
    part 4: <br><br>
    
    part 1:  *any1* <br>
    part 2:  *any2*<br>
    part 3:  *any4*<br>
    part 4: <br><br>
    
    part 1: *here1* <br>
    part 2: *here2*<br>
    part 3: *here4*<br>
    part 4: <br><br>
    Code (markup):
    For some reason it's not picking up the third one:
    <a *anything here* href="../somepage.html" *any3*>*here3*</a>
    
    Code (markup):
    Any ideas?

    Regards,
    Gerald.
     
    Geraldm, Apr 2, 2007 IP
  4. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #4
    
    preg_match_all('/<a.+href="([^"]+)"[^>]*>.+<\/a>/si', $text, $urls);
    
    echo '<pre>' . print_r($urls[1], true) .'</pre>';
    
    
    PHP:
    This works for me.
     
    nico_swd, Apr 2, 2007 IP
    Geraldm likes this.
  5. Geraldm

    Geraldm Well-Known Member

    Messages:
    1,330
    Likes Received:
    97
    Best Answers:
    0
    Trophy Points:
    115
    #5
    Thanks people for all your help!!! :)
    I've had a play and this does what I want:

    <?php
    $file_contents = '
    <a *anything here* href="http://www.url.com" *any1*>*here1*</a>
    <a *anything here* href="folder/somepage.html" *any2*>*here2*</a>
    <a *anything here* href="../somepage.html" *any3*>*here3*</a>
    <a *anything here* href="somepage.php?test=43" *any4*>*here4*</a>';
    
    $reg= '/<a(.*)href="(.*)"(.*)>(.*)<\/a>/sU';
    preg_match_all($reg,$file_contents,$out);
    $result = count($out[0]);
    echo 'Count: ' . $result . '<br><br>';
    echo '<strong>URLs:</strong><br>';
    foreach ($out[2] as $val)
    {
    echo '<br>' . $val;
    }
    echo '<br><br><strong>Anchors:</strong><br>';
    foreach ($out[4] as $val)
    {
    echo '<br>' . $val;
    }
    ?>
    PHP:
    Output:
    Count: 4
    
    URLs:
    
    http://www.url.com
    folder/somepage.html
    ../somepage.html
    somepage.php?test=43
    
    Anchors:
    
    *here1*
    *here2*
    *here3*
    *here4*
    Code (markup):
    Thanks for all your help !!!!!! Green Rep for all of you! :D
     
    Geraldm, Apr 2, 2007 IP
  6. Felu

    Felu Peon

    Messages:
    1,680
    Likes Received:
    124
    Best Answers:
    0
    Trophy Points:
    0
    #6
    You can Download Windows Script 5.6 Documentation from microsoft to learn Regular Expressions.
    http://www.microsoft.com/downloads/details.aspx?familyid=01592C48-207D-4BE1-8A76-1C4099D7BBB9&displaylang=en
    Code (markup):
     
    Felu, Apr 3, 2007 IP