hello I need to inject links in already ready html pages since the links are dynamic i dont want to do a static one time update of the html also some of the pages are dynamic too. so i added a buffer call back   $r = ob_start("bufferFilter");//4096  function bufferFilter($buffer) {   global $words;   $d  = addWords($words, $buffer);   return $d ;  }  Code (text): to parse the html i was searching for < and matching > ... but thats a very basic and not very hardy way . i also see that I can get all the tags using function get_tags( $tag, $xml ) {   $tag = preg_quote($tag);   $matches[]="1";   $matches[]="2";   $regex = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\â€.*?\â€|’.*?’|[^'\">\s]+))?)+\s*|\s*)\/?>/i";    preg_match_all($regex,           $xml,           $matches           );   /*preg_match_all('{<'.$tag.'[^>]*>(.*?)</'.$tag.'.'}',           $xml,           $matches,           PREG_PATTERN_ORDER);           */    return $matches;  }    $tags = get_tags("", $html);  var_dump($tags);  Code (text): and then i could again search the html for the actual content ...but i guess if there is a good hardy freeware parser that can do the same i would rather use that. so do u have any recommendations? also if there is nothing like this than i will use this method .... right now i have listed the following tags as to ignore when parsing (leave unaltered) : * head * script * embed * object any other tag whose text should be left alone? (Cannot use Perl)
you can do preg_match_all("/<[^>]+href=(["']{0,1})(.*?)$1/", $xml, $matches); PHP: $matches[2] will have all your links
i do not need the links. i need the text ... all the text within the div, p, span, font, ALL other tags and then i need to add some content back (means its not just get the text, its get the text, insert some text back keeping all existing formatting/ mark up untouched . so if html was <html> <!-- standard header removed for brevity --> <body> <p class=a1> word1 word2</p> <div class=a1> text3 </div> PHP: The new text could be something like <html> <!-- standard header removed for brevity --> <body> <p class=a1> <a href="lookup.php?w=word1">word1</a> word2 </p> <div class=a1> text3 </div> PHP: basicallty need all the text and then need to insert some text/ mark up in the same place with old html as it was.
Might just want to use the DOM parser or XML parsing if you want to traverse every tag and make replacements... That would mean your HTML would have to validate as XML.
no, I tried xml. its to strict. this is human written html. is there any other non strict parser around ?
that's even simpler the simplest way to do this would be this $prefix = "<a href=lookup.php?>"; $suffix = "</a>"; preg_match_all("/>([^<]+)</", $body, $matches); foreach($matches[1] as $value) { $body = str_replace("word1", $prefix."word1".$suffix, $body); } you can make it faster and more exact by having a preg_replace callback PHP:
ok ty so u think that regex is better than a parser? i was just testing http://www.phpclasses.org/browse/package/1420.html it works pretty well too. i found this regex somewhere it supposed to look after a lot of quirks too $regex = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\â€.*?\â€|’.*?’|[^'\">\s]+))?)+\s*|\s*)\/?>/i"; PHP: i dont know much about regex - which do u think i should use? Attached regex as not showing here correctly