regular expression - how reliable is this??

Discussion in 'PHP' started by vurentjie, Sep 19, 2008.

  1. #1
    Hi,

    I have been trying to come up with a expression that will grab the content of a html page, but only from opening body tag to its closing counterpart. I would really appreciate feedback on this, as it is about my fifth attempt. I have testing them all evening on different pages, to check the reliablity, this one works on all pages I have tested, but because some of my earlier attempts worked for some pages and not others I am a bit unsure, so here it is
    preg_match_all("/<\s*(body)((\s*(.)*[^a-zA-z0-9]\s*)*)(\/)(body)\s*>/i", $file, $matches);    
    Code (markup):
    I am using this with a site editor module I am building, I would like to be able to gather the current site layout with php before sending it through to the javascripts as I have other database actions that could be handled better this way, but I do have a javascript workaround that is not as pretty if this becomes to risky.

    Advice please!
     
    vurentjie, Sep 19, 2008 IP
  2. classic

    classic Peon

    Messages:
    96
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #2
    You do need to avoid regex as much as possible , you dont need your server killed especially if you have hi traffic one.
    There are much simpler and faster ways to achieve what you need
    Try loading a flle into simplexml and get body
    
    
    libxml_use_internal_errors( true );
    //file_put_contents( 'data.x', file_get_contents("http://www.php.net/mysql_connect") );
    $x = simplexml_load_file("data.x");
    echo $x->body->asXml();
    //or
    echo $x->BODY->asXml();
    
    PHP:
     
    classic, Sep 19, 2008 IP
  3. vurentjie

    vurentjie Peon

    Messages:
    11
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    thanks i am going to look into this, hadnt actually heard of this construct(?),

    sounds pretty awesome! the site editor that i am working on (about third draft - trying to make it real lean) is not for public access, it is a back-end module for clients (who don't code) to setup there site, well at least that is the initial idea. my first two working drafts work off of a predefined template and are a bit chunky, but now I am experimenting with anonymous templates to see if it can actually become a plugin, and drawing from what I did originally.
    thanks!
     
    vurentjie, Sep 19, 2008 IP
  4. Panzer

    Panzer Active Member

    Messages:
    381
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    58
    #4
    Yeah, regexp are pretty hard on the server. There are quite a few DOM libraries for PHP, i advise using them.
     
    Panzer, Sep 19, 2008 IP
  5. vurentjie

    vurentjie Peon

    Messages:
    11
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    i had a quick fiddle with the simplexml,

    it is very cool, but my problem at the moment is when i try to load html pages that have strange tags, first i had to google to see the way html is loaded, but then i tried to load a page that had a 'nobr' tag and it just didn't want to parse, it seems like a very neat tool, but until i can find a way to filter 'untidy' or unwanted html, I don't really want to use it, as I am trying set up the editor for any anonymous page, that i might not have coded myself


    edit: i did find an interesting package called html tidy....
     
    vurentjie, Sep 19, 2008 IP
  6. vurentjie

    vurentjie Peon

    Messages:
    11
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    thanks alot guys, definitely pointed me in the right direction,

    i found PHP Simple HTML DOM Parser on the web and have been testing it out, it is pretty impressive and just what I need.
     
    vurentjie, Sep 19, 2008 IP