html parser: full html parse and edit in place/ add tags

Discussion in 'PHP' started by tgkprog, Jul 7, 2008.

  1. #1
    hello

    I need to inject links in already ready html pages

    since the links are dynamic i dont want to do a static one time update of the html

    also some of the pages are dynamic too.

    so i added a buffer call back
     
     
    $r = ob_start("bufferFilter");//4096
     
    function bufferFilter($buffer)
    {
        global $words;
        $d  = addWords($words, $buffer);
        return $d ;
     
    }
     
    Code (text):
    to parse the html i was searching for < and matching > ... but thats a very basic and not very hardy way .

    i also see that I can get all the tags

    using
    function get_tags( $tag, $xml ) {
       $tag = preg_quote($tag);
       $matches[]="1";
       $matches[]="2";
       $regex = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\”.*?\”|’.*?’|[^'\">\s]+))?)+\s*|\s*)\/?>/i";
        preg_match_all($regex,
                        $xml,
                        $matches
                        );
       /*preg_match_all('{<'.$tag.'[^>]*>(.*?)</'.$tag.'.'}',
                        $xml,
                        $matches,
                        PREG_PATTERN_ORDER);
                        */
     
       return $matches;
     }
     
     
     
    $tags = get_tags("", $html);
     
    var_dump($tags);
     
    Code (text):
    and then i could again search the html for the actual content ...but i guess if there is a good hardy freeware parser that can do the same i would rather use that.

    so do u have any recommendations?

    also if there is nothing like this than i will use this method .... right now i have listed the following tags as to ignore when parsing (leave unaltered) :


    • * head
      * script
      * embed
      * object

    any other tag whose text should be left alone?
    (Cannot use Perl)
     
    tgkprog, Jul 7, 2008 IP
  2. ipro

    ipro Active Member

    Messages:
    101
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    53
    #2
    you can do
    preg_match_all("/<[^>]+href=(["']{0,1})(.*?)$1/", $xml, $matches);
    PHP:
    $matches[2] will have all your links
     
    ipro, Jul 7, 2008 IP
  3. tgkprog

    tgkprog Peon

    Messages:
    28
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #3
    i do not need the links. i need the text ... all the text within the div, p, span, font, ALL other tags and then i need to add some content back (means its not just get the text, its get the text, insert some text back keeping all existing formatting/ mark up untouched .

    so if html was

    <html> <!-- standard header removed for brevity --> 
    
    <body> <p class=a1> word1 word2</p>
    <div class=a1>
    text3
    </div> 
     
    PHP:
    The new text could be something like


    <html> <!-- standard header removed for brevity --> 
    
    <body> <p class=a1> <a href="lookup.php?w=word1">word1</a> word2  </p>
    <div class=a1>
    text3 
    </div> 
     
    PHP:
    basicallty need all the text and then need to insert some text/ mark up in the same place with old html as it was.
     
    tgkprog, Jul 7, 2008 IP
  4. bucabay

    bucabay Peon

    Messages:
    10
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Might just want to use the DOM parser or XML parsing if you want to traverse every tag and make replacements...
    That would mean your HTML would have to validate as XML.
     
    bucabay, Jul 7, 2008 IP
  5. tgkprog

    tgkprog Peon

    Messages:
    28
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    no, I tried xml. its to strict. this is human written html. is there any other non strict parser around ?
     
    tgkprog, Jul 7, 2008 IP
  6. ipro

    ipro Active Member

    Messages:
    101
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    53
    #6
    that's even simpler

    the simplest way to do this would be this
    
    
    $prefix = "<a href=lookup.php?>";
    $suffix = "</a>";
    preg_match_all("/>([^<]+)</", $body, $matches);
    
    foreach($matches[1] as $value)
    {
    $body = str_replace("word1",  $prefix."word1".$suffix, $body);
    }
    
    
    you can make it faster and more exact by having a preg_replace callback
    
    PHP:
     
    ipro, Jul 7, 2008 IP
  7. tgkprog

    tgkprog Peon

    Messages:
    28
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #7
    ok ty so u think that regex is better than a parser? i was just testing http://www.phpclasses.org/browse/package/1420.html it works pretty well too.

    i found this regex somewhere it supposed to look after a lot of quirks too

       $regex = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\”.*?\”|’.*?’|[^'\">\s]+))?)+\s*|\s*)\/?>/i";
    
    PHP:
    i dont know much about regex - which do u think i should use?

    Attached regex as not showing here correctly
     

    Attached Files:

    tgkprog, Jul 7, 2008 IP