regular expression - how reliable is this??

vurentjie Peon

Messages:: 11

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

Hi,

I have been trying to come up with a expression that will grab the content of a html page, but only from opening body tag to its closing counterpart. I would really appreciate feedback on this, as it is about my fifth attempt. I have testing them all evening on different pages, to check the reliablity, this one works on all pages I have tested, but because some of my earlier attempts worked for some pages and not others I am a bit unsure, so here it is
preg_match_all("/<\s*(body)((\s*(.)*[^a-zA-z0-9]\s*)*)(\/)(body)\s*>/i", $file, $matches);    
Code (markup):
I am using this with a site editor module I am building, I would like to be able to gather the current site layout with php before sending it through to the javascripts as I have other database actions that could be handled better this way, but I do have a javascript workaround that is not as pretty if this becomes to risky.

Advice please!

vurentjie, Sep 19, 2008 IP

classic Peon

Messages:: 96

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 0

#2

You do need to avoid regex as much as possible , you dont need your server killed especially if you have hi traffic one.
There are much simpler and faster ways to achieve what you need
Try loading a flle into simplexml and get body
libxml_use_internal_errors( true );
//file_put_contents( 'data.x', file_get_contents("http://www.php.net/mysql_connect") );
$x = simplexml_load_file("data.x");
echo $x->body->asXml();
//or
echo $x->BODY->asXml();
PHP:

classic, Sep 19, 2008 IP

vurentjie Peon

Messages:: 11

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#3

thanks i am going to look into this, hadnt actually heard of this construct(?),

sounds pretty awesome! the site editor that i am working on (about third draft - trying to make it real lean) is not for public access, it is a back-end module for clients (who don't code) to setup there site, well at least that is the initial idea. my first two working drafts work off of a predefined template and are a bit chunky, but now I am experimenting with anonymous templates to see if it can actually become a plugin, and drawing from what I did originally.
thanks!

vurentjie, Sep 19, 2008 IP

Panzer Active Member

Messages:: 381

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 58

#4

Yeah, regexp are pretty hard on the server. There are quite a few DOM libraries for PHP, i advise using them.

Panzer, Sep 19, 2008 IP

vurentjie Peon

Messages:: 11

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#5

i had a quick fiddle with the simplexml,

it is very cool, but my problem at the moment is when i try to load html pages that have strange tags, first i had to google to see the way html is loaded, but then i tried to load a page that had a 'nobr' tag and it just didn't want to parse, it seems like a very neat tool, but until i can find a way to filter 'untidy' or unwanted html, I don't really want to use it, as I am trying set up the editor for any anonymous page, that i might not have coded myself

edit: i did find an interesting package called html tidy....

vurentjie, Sep 19, 2008 IP

vurentjie Peon

Messages:: 11

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#6

thanks alot guys, definitely pointed me in the right direction,

i found PHP Simple HTML DOM Parser on the web and have been testing it out, it is pretty impressive and just what I need.

vurentjie, Sep 19, 2008 IP

Log in or Sign up

regular expression - how reliable is this??

vurentjie Peon

classic Peon

vurentjie Peon

Panzer Active Member

vurentjie Peon

vurentjie Peon

Useful Searches