How To Parse String For All URL's?

Discussion in 'PHP' started by fireflyproject, Jun 8, 2008.

  1. #1
    Hey guys,

    Let's say I have a string with a butt load of HTML in it. What would be the most efficient way to parse through and find all the URL's in it?

    I currently have a function that will pull all text from between let's say, <a> and </a>, but after it does it once, it stops (what it searches between can literally be anything). Obviously I need a loop of some sort... but I'm kinda brain dead today and not seeing a good way to pull this off.

    Thanks.
     
    fireflyproject, Jun 8, 2008 IP
  2. GLD

    GLD Well-Known Member

    Messages:
    307
    Likes Received:
    12
    Best Answers:
    0
    Trophy Points:
    140
    #2
    I needed something similar to this for one of my projects. I just created an infinite loop and set a condition which when met breaks the loop.

    
    while(1){
    
    // Code which pulls text between<a> and </a>
    
    if($offset === false) {
    break;
    }
    
    }
    PHP:
    Remember to store the offset, so that the loop doesn't loop over the same set of <a></a> over and over.
    i.e.
    $offset = strpos($text, "</a>", $offset);
    PHP:
    Hope this makes sense, I'm quite tired... :)
     
    GLD, Jun 8, 2008 IP
  3. ToddMicheau

    ToddMicheau Active Member

    Messages:
    183
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    58
    #3
    Hey, check out the PHP command preg_match_all.

    You can use it to easily find all matches (with a clever regular expression), like:

    
    $strHTML = "<a href=\"asdf\">etc etc</a>"; // This would be your html code you need parsed.
    preg_match_all('/<a href="(.+?)">(.+?)<\/a>/', $strHTML, $matches, PREG_SET_ORDER);
    print_r($matches);
    
    PHP:
    ( Untested of course but I hope it helps =] )
     
    ToddMicheau, Jun 8, 2008 IP
  4. fireflyproject

    fireflyproject Active Member

    Messages:
    969
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    70
    #4
    Thanks guys.

    I looked pretty heavily into preg_match_all but regular expressions were killing me. I went through the whole chapter about it in one of my books... it helped a little but there are a lot of nuances I guess I will have to learn over time.

    I'll try the preg_match_all method first, and then the latter.

    Thanks!
     
    fireflyproject, Jun 8, 2008 IP
  5. myhart

    myhart Peon

    Messages:
    228
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Ok I stripped this from a piece of code I wrote for an affiliate script. It will return the number of urls as well as print each out one.

    <?
    $file = file_get_contents('http://www.yourfile');
    preg_match_all('/<a href="(.*)">/', $file, $a);
    $count = count($a[1]);
    echo "<b>Number of Urls</b> = " .$count."<p>";
    for ($row = 0; $row < $count ; $row++) {
    echo $a[1]["$row"]."<br>";
    }
    ?>
    PHP:
     
    myhart, Jun 8, 2008 IP
    fireflyproject likes this.
  6. fireflyproject

    fireflyproject Active Member

    Messages:
    969
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    70
    #6
    This is almost exactly what I needed! Thanks! With just a couple teensy modifications, I should have this all ready to go in no time.

    Thanks!!! +rep!
     
    fireflyproject, Jun 8, 2008 IP