I had someone create a test php script for me a long time back for a simple little job and it essentially locates a particular area of one specific website I use and scrapes a table from it. I have a question about regex so before people tell me not to solve my html problem this way (use jquery, use HTMLpurifier, so on and so forth) the issue is I'm not even close to being a coder (I'm a scientist) and can't do much myself. Original code is below function wpreg_get_data(){ $html = wpreg_get_url_data("website url"); preg_match_all('/<table><thead><tr><td colspan=\'4\' class=\'h3\'>(.*?)<\/table>/s',$html,$posts,PREG_SET_ORDER); $text = $posts[0][0]; $full_text = strip_tags($text,'<table><tbody><tr><td><b><th>'); return $full_text; } PHP: BUT now the website has added some class attributes to some of their tags and I want to remove them. Here they are <th class='alignLeft'> <td class='alignLeft'> <th class='noBr'> <td class='alignLeft noBr'> <tr class='bg2'> <td class='h3' colspan='4'> So I thought I'd be smart and try and modify the code myself to below. But it doesn't work. Any help? AM I EVEN CLOSE? function wpreg_get_data(){ $html = wpreg_get_url_data("website url"); preg_match_all('/<table><thead><tr><td colspan=\'4\' class=\'h3\'>(.*?)<\/table>/s',$html,$posts,PREG_SET_ORDER); $text = $posts[0][0]; $full_text = strip_tags($text,'<table><tbody><tr><td><b><th>'); $content = preg_replace('<([^\s]+)(\s[^>]*?)?(?<!/)>', '', $full_text); return $content; } PHP:
I didn't figure i would get any help on the regex option given the discussions I regularly see with regex parsing html. Does anyone have a DOMDocument solution?
Whenever someone asks about a question regarding Regex, I always say not to over think it. Regex is very easy if you approach it correctly. From what I understand you want to remove all attributes from HTML tags, you can do this: $content = preg_replace('/<([a-zA-Z]+) .*>/U', '<$1>', $full_text); Code (markup): We are telling it to replace anything in <TAG(.()> with <TAG> (contained in $1) and not be greedy about it (U).