1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Extracting anchor text

Discussion in 'PHP' started by ferret77, Aug 14, 2005.

  1. #1
    is there a simple way to extract the anchor text of links using regualr expressions in php

    like I am looking for basically this pattern

    "target='_new'>Anti-War Protest In Texas</a>" from a bunch of text chunks

    I want to extract the "Anti-War Protest In Texas" part

    I could like split the string a fews times to get it, but is there a quicker way with regex

    I was looking at php.net there is bunch of regex stuff that returns true or false but that not what I want
    SEMrush
     
    ferret77, Aug 14, 2005 IP
    SEMrush
  2. palespyder

    palespyder Psycho Ninja

    Messages:
    1,254
    Likes Received:
    98
    Best Answers:
    0
    Trophy Points:
    168
    #2
    found this, not sure if this is exactly what you are looking for

    preg_match("/<a.>(.)<\/a>/", $matchText, $temp);
    PHP:
     
    palespyder, Aug 14, 2005 IP
  3. ferret77

    ferret77 Heretic

    Messages:
    5,276
    Likes Received:
    230
    Best Answers:
    0
    Trophy Points:
    0
    #3
    does (.) repesent any amount of charaters, I was trying (*)
     
    ferret77, Aug 14, 2005 IP
  4. palespyder

    palespyder Psycho Ninja

    Messages:
    1,254
    Likes Received:
    98
    Best Answers:
    0
    Trophy Points:
    168
    #4
    yeah you could do (.*) to represent any number of characters
     
    palespyder, Aug 14, 2005 IP
  5. ferret77

    ferret77 Heretic

    Messages:
    5,276
    Likes Received:
    230
    Best Answers:
    0
    Trophy Points:
    0
    #5
    got it

    ">(.*)<\/a>"
    PHP:
    actually spoke too soon , gives me the url too
     
    ferret77, Aug 14, 2005 IP
  6. Gmorkster

    Gmorkster Peon

    Messages:
    202
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #6
    try |(<a[\s]+[^>]+>)([^</a>])(</a>)|i

    The \s is important, otherwise it will match <abbr> and any other xhtml tag starting with a. Didn't test it, but it should be working.
     
    Gmorkster, Aug 14, 2005 IP
  7. ferret77

    ferret77 Heretic

    Messages:
    5,276
    Likes Received:
    230
    Best Answers:
    0
    Trophy Points:
    0
    #7
    is there a way to just get the link text?

    or do I have to do some sort of replace?
     
    ferret77, Aug 14, 2005 IP
  8. Gmorkster

    Gmorkster Peon

    Messages:
    202
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #8
    preg_match("|(<a[\s]+[^>]+>)([^</a>])(</a>)|i", $link, $matches);

    then $matches[1] will contain your anchor
     
    Gmorkster, Aug 14, 2005 IP
  9. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    64
    Best Answers:
    0
    Trophy Points:
    0
    #9
    It's not even going to match

    <a href=\"test\">abc</a>

    This expression will work on all one-line anchors

    <a(?:[ \t]+[^>]*)?>([^<]+)<\/a>

    J.D.
     
    J.D., Aug 14, 2005 IP
  10. Gmorkster

    Gmorkster Peon

    Messages:
    202
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #10
    One char missing, sorry... :D

    preg_match("|(<a[\s]+[^>]+>)([^</a>]+)(</a>)|i", "<a href=\"test\">foo</a>", $m);
    print_r($m);

    And the anchor is $m[2], not $m[1]
     
    Gmorkster, Aug 14, 2005 IP
  11. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    64
    Best Answers:
    0
    Trophy Points:
    0
    #11
    There's more than a char missing in this. Go ahead and give the anchor I quoted a try (the one with abc). You clearly don't understand what square brackets or parenthesis are for.

    J.D.
     
    J.D., Aug 14, 2005 IP
    palespyder likes this.
  12. Gmorkster

    Gmorkster Peon

    Messages:
    202
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #12
    bah-- I did, just replaced "abc" with "foo"!#!@#$

    |(<a[\s]+[^>]+>)([^</a>]+)(</a>)|i is separator (match1) (match2) (match3) separator case_insensitive

    - first parenthesis: match <a followed by any number of blanks (\s matches blanks and tabs), followed by any character but >
    - second parentesis: match anything but </a> -- the anchor
    - third parenthesis-- match </a>

    Second parenthesis matches the anchor, which is $matches[2].

    I believe I do understand how regex works... :)
     
    Gmorkster, Aug 14, 2005 IP
  13. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    64
    Best Answers:
    0
    Trophy Points:
    0
    #13
    No. Square brackets mean "any of the listed characters" or "none of the listed characters" if used with ^. So, this [^</a>]+ says "one or more of any character except <, /, a or >".

    On top of that, why would you put parenthesis around everything? What's the point of capturing </a>?

    J.D.
     
    J.D., Aug 14, 2005 IP
  14. Gmorkster

    Gmorkster Peon

    Messages:
    202
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #14
    sheesh, got it now *blush*. Working for 15 straight hours must've gotten to me. Sorry!
     
    Gmorkster, Aug 14, 2005 IP
    J.D. likes this.