html parser: full html parse and edit in place/ add tags

tgkprog Peon

Messages:: 28

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#1

hello

I need to inject links in already ready html pages

since the links are dynamic i dont want to do a static one time update of the html

also some of the pages are dynamic too.

so i added a buffer call back
Â 
Â 
$r = ob_start("bufferFilter");//4096
Â 
function bufferFilter($buffer)
{
Â  Â  global $words;
Â  Â  $d Â = addWords($words, $buffer);
Â  Â  return $d ;
Â 
}
Â 
Code (text):
to parse the html i was searching for < and matching > ... but thats a very basic and not very hardy way .

i also see that I can get all the tags

using
function get_tags( $tag, $xml ) {
Â  Â $tag = preg_quote($tag);
Â  Â $matches[]="1";
Â  Â $matches[]="2";
Â  Â $regex = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\â€.*?\â€|â€™.*?â€™|[^'\">\s]+))?)+\s*|\s*)\/?>/i";
Â Â  Â preg_match_all($regex,
Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  $xml,
Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  $matches
Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  );
Â  Â /*preg_match_all('{<'.$tag.'[^>]*>(.*?)</'.$tag.'.'}',
Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  $xml,
Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  $matches,
Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  PREG_PATTERN_ORDER);
Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  */
Â 
Â  Â return $matches;
Â }
Â 
Â 
Â 
$tags = get_tags("", $html);
Â 
var_dump($tags);
Â 
Code (text):
and then i could again search the html for the actual content ...but i guess if there is a good hardy freeware parser that can do the same i would rather use that.

so do u have any recommendations?

also if there is nothing like this than i will use this method .... right now i have listed the following tags as to ignore when parsing (leave unaltered) :

* head
* script
* embed
* object

any other tag whose text should be left alone?
(Cannot use Perl)

tgkprog, Jul 7, 2008 IP

ipro Active Member

Messages:: 101

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 53

#2

you can do
preg_match_all("/<[^>]+href=(["']{0,1})(.*?)$1/", $xml, $matches);
PHP:
$matches[2] will have all your links

ipro, Jul 7, 2008 IP

tgkprog Peon

Messages:: 28

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#3

i do not need the links. i need the text ... all the text within the div, p, span, font, ALL other tags and then i need to add some content back (means its not just get the text, its get the text, insert some text back keeping all existing formatting/ mark up untouched .

so if html was
<html>  

<body> <p class=a1> word1 word2</p>
<div class=a1>
text3
</div> 
 
PHP:
The new text could be something like
<html>  

<body> <p class=a1> <a href="lookup.php?w=word1">word1</a> word2  </p>
<div class=a1>
text3 
</div> 
 
PHP:
basicallty need all the text and then need to insert some text/ mark up in the same place with old html as it was.

tgkprog, Jul 7, 2008 IP

bucabay Peon

Messages:: 10

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#4

Might just want to use the DOM parser or XML parsing if you want to traverse every tag and make replacements...
That would mean your HTML would have to validate as XML.

bucabay, Jul 7, 2008 IP

tgkprog Peon

Messages:: 28

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#5

no, I tried xml. its to strict. this is human written html. is there any other non strict parser around ?

tgkprog, Jul 7, 2008 IP

ipro Active Member

Messages:: 101

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 53

#6

that's even simpler

the simplest way to do this would be this



$prefix = "<a href=lookup.php?>";
$suffix = "</a>";
preg_match_all("/>([^<]+)</", $body, $matches);

foreach($matches[1] as $value)
{
$body = str_replace("word1",  $prefix."word1".$suffix, $body);
}


you can make it faster and more exact by having a preg_replace callback

PHP:

ipro, Jul 7, 2008 IP

tgkprog Peon

Messages:: 28

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#7

ok ty so u think that regex is better than a parser? i was just testing http://www.phpclasses.org/browse/package/1420.html it works pretty well too.

i found this regex somewhere it supposed to look after a lot of quirks too
   $regex = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\â€.*?\â€|â€™.*?â€™|[^'\">\s]+))?)+\s*|\s*)\/?>/i";
PHP:
i dont know much about regex - which do u think i should use?

Attached regex as not showing here correctly

Attached Files:

tgkprog, Jul 7, 2008 IP

Log in or Sign up

html parser: full html parse and edit in place/ add tags

tgkprog Peon

ipro Active Member

tgkprog Peon

bucabay Peon

tgkprog Peon

ipro Active Member

tgkprog Peon

Attached Files:

html_regex_2.php

Log in or Sign up

html parser: full html parse and edit in place/ add tags

tgkprog Peon

ipro Active Member

tgkprog Peon

bucabay Peon

tgkprog Peon

ipro Active Member

tgkprog Peon

Attached Files:

html_regex_2.php

Useful Searches