preg_replace to fix tags

Tanner Peon

Messages:: 4

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

Hi all, first time post here.

So I'm banging my head on the table trying to understand regular expressions with preg_replace.

I'm trying to take HTML code and throw it into an XML. For the most part it works but where it fails is when the parser is expecting something and it's not there.

For example, if I have:

<img src="http://farm1.static.flickr.com/xx/xxxxxxxx/xxxx_m.jpg" width="240" height="179" alt="Some text">

The above throws out the XML parser. Everything is fine if there is the missing \ added before the >.

I basically got to this point:

$text = preg_replace("#<img src=('|\")((http|ftp|https|ftps)://)([^ \?&=\#\"\n\r\t<]*?(\.(jpg|jpeg|gif|png)))('|\")>#sie", "'<img src=\\1\\2\\4\\7/>'", $text);

The problem is if there are the other things after the src="" such as width, height, alt, and so on, it doesn't work, but works fine if I have

<img src="http://farm1.static.flickr.com/xx/xxxxxxxx/xxxx_m.jpg">

Anybody suggest what I need to add to make this work with almost every possible thing that somebody can use with <img src="">?

TIA,
Mike

Tanner, Mar 7, 2008 IP

zerxer Peon

Messages:: 368

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 0

#2

Well, right before your >, you would have to test to see if there is any other data. You could throw ([^>]*) right in front of your >

You should also maybe put that between your <img and src in case they juggle their attributes around such as <img width="100" src="blahblah">, you know?

EDIT: Also, try checking out this thread where I helped someone fix another problem that you might run into (if they don't use any quotes around their src).

zerxer, Mar 7, 2008 IP

blueparukia Well-Known Member

Messages:: 1,564

Likes Received:: 71

Best Answers:: 7

Trophy Points:: 160

#3

Why not just use
<(.*)img(.*)>
Code (markup):
?

It does the same basic thing, and allows all image extenisions (.tif, anyone?) and allows for missing attributes and whitespace.

blueparukia, Mar 7, 2008 IP

zerxer Peon

Messages:: 368

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 0

#4

blueparukia said: ↑
Why not just use
<(.*)img(.*)>
Code (markup):
?

It does the same basic thing, and allows all image extenisions (.tif, anyone?) and allows for missing attributes and whitespace.
Click to expand...
Why would you need the (.*) between the opening < and the tag name (img)? If they had any white space, like < img, it wouldn't parse.

I wouldn't suggest using <img(.*)> either because in some cases, it could keep using every char as the . and use the last > on the page as the closing. If you wanna go this way, use ([^>]*) instead so that it does every char except > so that it stops properly when you have the > by itself at the end. I would also suggest having at least one space between the img and the any-char regex because if there isn't a space immediately after the tag name, it wouldn't parse as an image either.

<img ([^>]*)>

We don't really know what he's trying to do though so he might not want to grab it like this.

zerxer, Mar 7, 2008 IP

blueparukia Well-Known Member

Messages:: 1,564

Likes Received:: 71

Best Answers:: 7

Trophy Points:: 160

#5

Ahh right, XML. Needs to be all valid, in which case you should probably also check if an alt attribute exists, and if it doesn't, create an empty one.

blueparukia, Mar 7, 2008 IP

lephron Active Member

Messages:: 204

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 53

#6

Regular expressions are greedy be default, so the regex "<img(.*)>" would until the last > character on the line, not just the > that closes the img tag.

To improve the perfomance of the regex get it to not count the part in the brackets:

<img (?:[^>]*)>

lephron, Mar 8, 2008 IP

zerxer Peon

Messages:: 368

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 0

#7

lephron said: ↑

Regular expressions are greedy be default, so the regex "<img(.*)>" would until the last > character on the line, not just the > that closes the img tag.

To improve the perfomance of the regex get it to not count the part in the brackets:

<img (?:[^>]*)>
Click to expand...

Yeah. I was gonna suggest that last part as well but so far it seems he wants that data.

zerxer, Mar 8, 2008 IP

Tanner Peon

Messages:: 4

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#8

Thanks, I'll try that. Yes, I want to keep all of the attributes. Especially the ones where a user cuts and copies the HTML data from their flickr image. Seems that phpBB handles all of the other tags properly.... so far from what I can tell.

Tanner, Mar 8, 2008 IP

Tanner Peon

Messages:: 4

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#9

So with the <img (?:[^>]*)> it does find all of the img tags, but still struggling with preg_replace... this also finds the ones where it does have a \ before the > which needs to be excluded, and still trying to figure out how to replace the ones where it doesn't have the ending backslash and replace it with the same img and all attributes.

I was thinking that

preg_replace("#<img (?:[^>]*)>#","'<img $1/>", $text);

would sort of work but it doesn't.

Anybody suggest what I'm doing wrong? I really need to get a book on regular expressions.... tried understanding this stuff and obviously failing at it

Tanner, Mar 9, 2008 IP

zerxer Peon

Messages:: 368

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 0

#10

preg_replace("#<img ([^/>]+)>#","'<img $1/>", $text);

lephron confused you a bit. Only use ?: at the beginning of your subclasses (the parenthesis) when you don't want to store the sub-matched string in a variable (i.e. $1). If you put ?: in your only subclass, that $1 that you're trying to use in the replace expression won't do anything. Also, I added / to the square brackets because you only want to match ones that don't have a / before the >, correct?

Also, not sure if you notice or not or whether you wanted it that way, but you have a single quote at the beginning of your replace string. '<img $1/>
Might be a typo you made while putting the double quote at the beginning, just letting you know.

zerxer, Mar 10, 2008 IP

Tanner Peon

Messages:: 4

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#11

Thanks zerxer. I tried it out and it didn't work.... it didn't find the img tag for example, in the below html fragment:

<a href="http://www.flickr.com/photos/metrix_feet/1267290550/" title="Photo Sharing"><img src="http://farm2.static.flickr.com/1255/1267290550_9d17cc03a3.jpg" width="401" height="500" alt="Siberian Lynx"></a>

I started to cheat and look for the alt="xxxxxx"> now ...

$message = preg_replace('/(alt=\".+?\")>/','$1/>',$message);

... since the img has either just the src attribute and nothing else, or the one similar to the above from flickr. Seems to work.

Tanner, Mar 10, 2008 IP

Log in or Sign up

preg_replace to fix tags

Tanner Peon

zerxer Peon

blueparukia Well-Known Member

zerxer Peon

blueparukia Well-Known Member

lephron Active Member

zerxer Peon

Tanner Peon

Tanner Peon

zerxer Peon

Tanner Peon

Useful Searches