Debt Consolidation - Discount Perfume - Credit Cards - Debt Consolidation - Debt Consolidation

PDA

View Full Version : regular expression - how reliable is this??


vurentjie
Sep 19th 2008, 3:59 pm
Hi,

I have been trying to come up with a expression that will grab the content of a html page, but only from opening body tag to its closing counterpart. I would really appreciate feedback on this, as it is about my fifth attempt. I have testing them all evening on different pages, to check the reliablity, this one works on all pages I have tested, but because some of my earlier attempts worked for some pages and not others I am a bit unsure, so here it is
preg_match_all("/<\s*(body)((\s*(.)*[^a-zA-z0-9]\s*)*)(\/)(body)\s*>/i", $file, $matches);

I am using this with a site editor module I am building, I would like to be able to gather the current site layout with php before sending it through to the javascripts as I have other database actions that could be handled better this way, but I do have a javascript workaround that is not as pretty if this becomes to risky.

Advice please!

classic
Sep 19th 2008, 5:03 pm
You do need to avoid regex as much as possible , you dont need your server killed especially if you have hi traffic one.
There are much simpler and faster ways to achieve what you need
Try loading a flle into simplexml and get body


libxml_use_internal_errors( true );
//file_put_contents( 'data.x', file_get_contents("http://www.php.net/mysql_connect") );
$x = simplexml_load_file("data.x");
echo $x->body->asXml();
//or
echo $x->BODY->asXml();

vurentjie
Sep 19th 2008, 5:30 pm
thanks i am going to look into this, hadnt actually heard of this construct(?),

sounds pretty awesome! the site editor that i am working on (about third draft - trying to make it real lean) is not for public access, it is a back-end module for clients (who don't code) to setup there site, well at least that is the initial idea. my first two working drafts work off of a predefined template and are a bit chunky, but now I am experimenting with anonymous templates to see if it can actually become a plugin, and drawing from what I did originally.
thanks!

Panzer
Sep 19th 2008, 5:58 pm
Yeah, regexp are pretty hard on the server. There are quite a few DOM libraries for PHP, i advise using them.

vurentjie
Sep 19th 2008, 6:10 pm
i had a quick fiddle with the simplexml,

it is very cool, but my problem at the moment is when i try to load html pages that have strange tags, first i had to google to see the way html is loaded, but then i tried to load a page that had a 'nobr' tag and it just didn't want to parse, it seems like a very neat tool, but until i can find a way to filter 'untidy' or unwanted html, I don't really want to use it, as I am trying set up the editor for any anonymous page, that i might not have coded myself


edit: i did find an interesting package called html tidy....

vurentjie
Sep 19th 2008, 7:40 pm
thanks alot guys, definitely pointed me in the right direction,

i found PHP Simple HTML DOM Parser on the web and have been testing it out, it is pretty impressive and just what I need.