I'm currently making a PHP project where I use preg_match_all to sort through the html of a web page to find URLs to certain images. As a first step, I went to Google's image search, and looked for "whales" http://www.google.com/images?hl=en&source=imghp&q=whales&gbv=2&aq=f&aqi=&aql=&oq=&gs_rfai= Then, I created some PHP script to look through the html, and list all of the URLs of the whale images that come up: <?php $url = "http://www.google.com/images?hl=en&source=imghp&q=whales&gbv=2&aq=f&aqi=g10&aql=&oq=&gs_rfai="; $str = file_get_contents($url); preg_match_all("/http\:\/\/t[0-9]\.gstatic.com\/images?(.+?)\.jpg/",$str,$matches); foreach( $matches[0] as $match ){ echo $match.'<br />'; } ?> PHP: And here are the results: http://t1.gstatic.com/images?q=tbn:AUYt3zv38OjmhM:http://www.travel-vancouver-island.com/data/media/3/resident-killer-whales_154.jpg http://t1.gstatic.com/images?q=tbn:Y0mRP70-9VC7GM:http://www.smh.com.au/ffximage/2006/03/28/whaling_narrowweb__300x377,0.jpg http://t3.gstatic.com/images?q=tbn:48GvnPRq3JdloM:http://yvonnelindsay.files.wordpress.com/2009/07/orca.jpg http://t3.gstatic.com/images?q=tbn:lDdfIshtFMMWuM:http://www.kspc.org/blog/pix/2009/whales.jpg http://t1.gstatic.com/images?q=tbn:Ou053LdnZ3HSJM:http://www.zoobangoo.com/content/wp-content/uploads/2009/07/humpback-whales-singing.jpg http://t1.gstatic.com/images?q=tbn:3fydu8qTB2eMsM:http://rdouglasfields.files.wordpress.com/2010/02/killer-whale.jpg http://t1.gstatic.com/images?q=tbn:aeYZfdO7srkQmM:http://www.saveyourheritage.com/images/humpback-whales.jpg http://t3.gstatic.com/images?q=tbn:-rqzZxnYS00aSM:http://www.lifeinthefastlane.ca/wp-content/uploads/2007/09/humpback_whale_sfw.jpg http://t3.gstatic.com/images?q=tbn:W9dXY9zaafu4lM:http://thegoldenspiral.org/wp-content/uploads/2008/10/humpback_whale_02.jpg http://t0.gstatic.com/images?q=tbn:yif4QybyjkGb1M:http://www.hickerphoto.com/data/media/42/orca_whales_T8067.jpg http://t0.gstatic.com/images?q=tbn:1Yx0k3o9Jjv8GM:http://www.topnews.in/files/Humpback-Whales.jpg http://t2.gstatic.com/images?q=tbn:JeT606tj3YWo0M:http://sarahpriyanka13.files.wordpress.com/2008/06/two-killer-whales_5872.jpg http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.smh.com.au/ffximage/2006/03/28/whaling_narrowweb__300x377,0.jpg http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://yvonnelindsay.files.wordpress.com/2009/07/orca.jpg http://t3.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.kspc.org/blog/pix/2009/whales.jpg http://t3.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.zoobangoo.com/content/wp-content/uploads/2009/07/humpback-whales-singing.jpg http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://rdouglasfields.files.wordpress.com/2010/02/killer-whale.jpg http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.saveyourheritage.com/images/humpback-whales.jpg http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.lifeinthefastlane.ca/wp-content/uploads/2007/09/humpback_whale_sfw.jpg http://t3.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://thegoldenspiral.org/wp-content/uploads/2008/10/humpback_whale_02.jpg http://t3.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.hickerphoto.com/data/media/42/orca_whales_T8067.jpg http://t0.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.topnews.in/files/Humpback-Whales.jpg http://t0.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://sarahpriyanka13.files.wordpress.com/2008/06/two-killer-whales_5872.jpg http://t2.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.bigbluetech.net/big-blue-tech-news/wp-content/uploads/2009/03/whales_1358200c.jpg http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://brianlean.files.wordpress.com/2007/12/w.jpg http://t2.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://fathertheo.files.wordpress.com/2010/02/three_beached_whales_1577.jpg http://t3.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.exzooberance.com/virtual%2520zoo/they%2520swim/humpback%2520whale/Humpback%2520Whale%2520485076.jpg http://t0.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.greenpeace.org/raw/image_full/international/photosvideos/photos/an-endangered-fin-whale-harpo-5.jpg http://t3.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.sad61.k12.me.us/~kknox/Reading%2520Round%2520Up_files/two-humpback-whales-breaching.jfif\x26imgrefurl\x3dhttp://www.sad61.k12.me.us/~kknox/Index.html\x26usg\x3d__K4G3Luikdj_eJtSe106El1jLp4c\x3d\x26h\x3d280\x26w\x3d430\x26sz\x3d45\x26hl\x3den\x26start\x3d18\x26itbs\x3d1","","XAgMrMNCf8DtJM:","http://www.sad61.k12.me.us/~kknox/Reading%2520Round%2520Up_files/two-humpback-whales-breaching.jfif","126","82","Baby \x3cb\x3eWhales\x3c/b\x3e","","","430 × 280 - 45k","jfif","sad61.k12.me.us","","","http://t0.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.agorafinancial.com/afrude/wp-content/whale.gif\x26imgrefurl\x3dhttp://rudeawakening.agorafinancial.com/2007/10/02/beached-whales-and-economic-omens/\x26usg\x3d__Pzafe8gkcZixGHBmTR9KzeBfKQs\x3d\x26h\x3d354\x26w\x3d450\x26sz\x3d72\x26hl\x3den\x26start\x3d19\x26itbs\x3d1","","7kAFSPZVmo2G0M:","http://www.agorafinancial.com/afrude/wp-content/whale.gif","127","100","\x3cb\x3eWhales\x3c/b\x3e are large, beastly","","","450 × 354 - 72k","gif","rudeawakening.agorafinancial.com","","","http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://blog.redfin.com/orangecounty/files/2008/03/whales.jpg http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.hickerphoto.com/data/media/42/pictures-of-killer-whales_4250.jpg _________________________________________________________________________________ So, it sorta works. It correctly prints the first few pictures, then starts glitching out certain parts, then just fails. I have no idea why it's doing this, please help!
It looks like it's starting to pick up some of Google's JavaScript code as well. The problem is that the regex thinks you're saying that the "s" in "images" should be optional rather than matching the "?", since the question mark is a special character in regular expressions. Try escaping it: preg_match_all("/http\:\/\/t[0-9]\.gstatic.com\/images\?(.+?)\.jpg/",$str,$matches); PHP:
Thanks for the reply! I ended up editing my code, but now I'm only getting the first 12 out of the total 21 on the page. Any idea why this might be happening? Here's my new script: <?php $url = "http://www.google.com/images?hl=en&source=imghp&q=whales&gbv=2&aq=f&aqi=g10&aql=&oq=&gs_rfai="; $str = file_get_contents($url); preg_match_all("/http:\/\/t[0-9]\.gstatic\.com\/images\?q=tbn:(.+?)\.jpg/",$str,$matches); foreach( $matches[0] as $match ){ echo $match.'<br />'; } echo $str ?> PHP: And here are my results: http://t1.gstatic.com/images?q=tbn:AUYt3zv38OjmhM:http://www.travel-vancouver-island.com/data/media/3/resident-killer-whales_154.jpg http://t1.gstatic.com/images?q=tbn:Y0mRP70-9VC7GM:http://www.smh.com.au/ffximage/2006/03/28/whaling_narrowweb__300x377,0.jpg http://t3.gstatic.com/images?q=tbn:48GvnPRq3JdloM:http://yvonnelindsay.files.wordpress.com/2009/07/orca.jpg http://t3.gstatic.com/images?q=tbn:lDdfIshtFMMWuM:http://www.kspc.org/blog/pix/2009/whales.jpg http://t1.gstatic.com/images?q=tbn:Ou053LdnZ3HSJM:http://www.zoobangoo.com/content/wp-content/uploads/2009/07/humpback-whales-singing.jpg http://t1.gstatic.com/images?q=tbn:3fydu8qTB2eMsM:http://rdouglasfields.files.wordpress.com/2010/02/killer-whale.jpg http://t1.gstatic.com/images?q=tbn:aeYZfdO7srkQmM:http://www.saveyourheritage.com/images/humpback-whales.jpg http://t3.gstatic.com/images?q=tbn:-rqzZxnYS00aSM:http://www.lifeinthefastlane.ca/wp-content/uploads/2007/09/humpback_whale_sfw.jpg http://t3.gstatic.com/images?q=tbn:W9dXY9zaafu4lM:http://thegoldenspiral.org/wp-content/uploads/2008/10/humpback_whale_02.jpg http://t0.gstatic.com/images?q=tbn:yif4QybyjkGb1M:http://www.hickerphoto.com/data/media/42/orca_whales_T8067.jpg http://t0.gstatic.com/images?q=tbn:1Yx0k3o9Jjv8GM:http://www.topnews.in/files/Humpback-Whales.jpg http://t2.gstatic.com/images?q=tbn:JeT606tj3YWo0M:http://sarahpriyanka13.files.wordpress.com/2008/06/two-killer-whales_5872.jpg Thanks again!
Looking at the page source, it seems that Google only has image tags for 12 images for some reason, while the full set of results are output via JavaScript. I'm assuming it's for pre-fetching or something similar. Anyway, you'll need to adapt your regex to look in the JavaScript instead. Try this: preg_match_all("/\/imgres\?imgurl\\x3d(.+?)\\x26/",$str,$matches); PHP:
Thanks for the reply! I looked over the pages source again, but the javascript doesn't have the actual URL to google's image, only the original one. So, I guess I'm going to have to stick to 12 pictures, unless there's another way to maybe find_all_pictures();?
You can piece the image URLs together using the JSON array. This isn't the most efficient regex out there, but it gets the job done: preg_match_all("/\["\/imgres.+?"","([a-zA-Z0-9-_]+\:)","(.+?)",".+?(t[0-9]\.gstatic\.com).+?""\]/",$str,$matches); PHP: Now, $matches[1] will contain the tbn values, $matches[2] will contain the original URLs, and $matches[3] will contain the gstatic server to use. You can then piece them together in a for loop like this: for ($i = 0; $i < sizeof($matches[1]); $i++) { echo "http://" . $matches[3][$i] . "/images/?q=tbn:" . $matches[1][$i] . $matches[2][$i]; } PHP:
Ah, I see. Thanks for the code, it seems to have a regex syntax issue or something: Warning: Unexpected character in input: '\' (ASCII=92) state=1 Parse error: syntax error, unexpected '?' But that should be pretty easy to fix. Again, thanks a lot!
Oops, that should've been single quotes rather than double quotes: preg_match_all('/\["\/imgres.+?"","([a-zA-Z0-9-_]+\:)","(.+?)",".+?(t[0-9]\.gstatic\.com).+?""\]/',$str,$matches); PHP: