Extracting anchor text

Discussion in 'PHP' started by ferret77, Aug 14, 2005.

  1. #1
    is there a simple way to extract the anchor text of links using regualr expressions in php

    like I am looking for basically this pattern

    "target='_new'>Anti-War Protest In Texas</a>" from a bunch of text chunks

    I want to extract the "Anti-War Protest In Texas" part

    I could like split the string a fews times to get it, but is there a quicker way with regex

    I was looking at php.net there is bunch of regex stuff that returns true or false but that not what I want
     
    ferret77, Aug 14, 2005 IP
  2. palespyder

    palespyder Psycho Ninja

    Messages:
    1,254
    Likes Received:
    98
    Best Answers:
    0
    Trophy Points:
    168
    #2
    found this, not sure if this is exactly what you are looking for

    preg_match("/<a.>(.)<\/a>/", $matchText, $temp);
    PHP:
     
    palespyder, Aug 14, 2005 IP
  3. ferret77

    ferret77 Heretic

    Messages:
    5,276
    Likes Received:
    230
    Best Answers:
    0
    Trophy Points:
    0
    #3
    does (.) repesent any amount of charaters, I was trying (*)
     
    ferret77, Aug 14, 2005 IP
  4. palespyder

    palespyder Psycho Ninja

    Messages:
    1,254
    Likes Received:
    98
    Best Answers:
    0
    Trophy Points:
    168
    #4
    yeah you could do (.*) to represent any number of characters
     
    palespyder, Aug 14, 2005 IP
  5. ferret77

    ferret77 Heretic

    Messages:
    5,276
    Likes Received:
    230
    Best Answers:
    0
    Trophy Points:
    0
    #5
    got it

    ">(.*)<\/a>"
    PHP:
    actually spoke too soon , gives me the url too
     
    ferret77, Aug 14, 2005 IP
  6. Gmorkster

    Gmorkster Peon

    Messages:
    202
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #6
    try |(<a[\s]+[^>]+>)([^</a>])(</a>)|i

    The \s is important, otherwise it will match <abbr> and any other xhtml tag starting with a. Didn't test it, but it should be working.
     
    Gmorkster, Aug 14, 2005 IP
  7. ferret77

    ferret77 Heretic

    Messages:
    5,276
    Likes Received:
    230
    Best Answers:
    0
    Trophy Points:
    0
    #7
    is there a way to just get the link text?

    or do I have to do some sort of replace?
     
    ferret77, Aug 14, 2005 IP
  8. Gmorkster

    Gmorkster Peon

    Messages:
    202
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #8
    preg_match("|(<a[\s]+[^>]+>)([^</a>])(</a>)|i", $link, $matches);

    then $matches[1] will contain your anchor
     
    Gmorkster, Aug 14, 2005 IP
  9. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #9
    It's not even going to match

    <a href=\"test\">abc</a>

    This expression will work on all one-line anchors

    <a(?:[ \t]+[^>]*)?>([^<]+)<\/a>

    J.D.
     
    J.D., Aug 14, 2005 IP
  10. Gmorkster

    Gmorkster Peon

    Messages:
    202
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #10
    One char missing, sorry... :D

    preg_match("|(<a[\s]+[^>]+>)([^</a>]+)(</a>)|i", "<a href=\"test\">foo</a>", $m);
    print_r($m);

    And the anchor is $m[2], not $m[1]
     
    Gmorkster, Aug 14, 2005 IP
  11. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #11
    There's more than a char missing in this. Go ahead and give the anchor I quoted a try (the one with abc). You clearly don't understand what square brackets or parenthesis are for.

    J.D.
     
    J.D., Aug 14, 2005 IP
    palespyder likes this.
  12. Gmorkster

    Gmorkster Peon

    Messages:
    202
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #12
    bah-- I did, just replaced "abc" with "foo"!#!@#$

    |(<a[\s]+[^>]+>)([^</a>]+)(</a>)|i is separator (match1) (match2) (match3) separator case_insensitive

    - first parenthesis: match <a followed by any number of blanks (\s matches blanks and tabs), followed by any character but >
    - second parentesis: match anything but </a> -- the anchor
    - third parenthesis-- match </a>

    Second parenthesis matches the anchor, which is $matches[2].

    I believe I do understand how regex works... :)
     
    Gmorkster, Aug 14, 2005 IP
  13. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #13
    No. Square brackets mean "any of the listed characters" or "none of the listed characters" if used with ^. So, this [^</a>]+ says "one or more of any character except <, /, a or >".

    On top of that, why would you put parenthesis around everything? What's the point of capturing </a>?

    J.D.
     
    J.D., Aug 14, 2005 IP
  14. Gmorkster

    Gmorkster Peon

    Messages:
    202
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #14
    sheesh, got it now *blush*. Working for 15 straight hours must've gotten to me. Sorry!
     
    Gmorkster, Aug 14, 2005 IP
    J.D. likes this.