Problem With preg_match_all

Discussion in 'PHP' started by jebego, May 29, 2010.

  1. #1
    I'm currently making a PHP project where I use preg_match_all to sort through the html of a web page to find URLs to certain images.

    As a first step, I went to Google's image search, and looked for "whales"
    http://www.google.com/images?hl=en&source=imghp&q=whales&gbv=2&aq=f&aqi=&aql=&oq=&gs_rfai=

    Then, I created some PHP script to look through the html, and list all of the URLs of the whale images that come up:
    
    <?php
        $url = "http://www.google.com/images?hl=en&source=imghp&q=whales&gbv=2&aq=f&aqi=g10&aql=&oq=&gs_rfai=";
    
        $str = file_get_contents($url);
    	
    	preg_match_all("/http\:\/\/t[0-9]\.gstatic.com\/images?(.+?)\.jpg/",$str,$matches);
    	
    	foreach( $matches[0] as $match ){
        	echo $match.'<br />';
    	}
    	
    ?>
    
    PHP:
    And here are the results:
    http://t1.gstatic.com/images?q=tbn:AUYt3zv38OjmhM:http://www.travel-vancouver-island.com/data/media/3/resident-killer-whales_154.jpg
    http://t1.gstatic.com/images?q=tbn:Y0mRP70-9VC7GM:http://www.smh.com.au/ffximage/2006/03/28/whaling_narrowweb__300x377,0.jpg
    http://t3.gstatic.com/images?q=tbn:48GvnPRq3JdloM:http://yvonnelindsay.files.wordpress.com/2009/07/orca.jpg
    http://t3.gstatic.com/images?q=tbn:lDdfIshtFMMWuM:http://www.kspc.org/blog/pix/2009/whales.jpg
    http://t1.gstatic.com/images?q=tbn:Ou053LdnZ3HSJM:http://www.zoobangoo.com/content/wp-content/uploads/2009/07/humpback-whales-singing.jpg
    http://t1.gstatic.com/images?q=tbn:3fydu8qTB2eMsM:http://rdouglasfields.files.wordpress.com/2010/02/killer-whale.jpg
    http://t1.gstatic.com/images?q=tbn:aeYZfdO7srkQmM:http://www.saveyourheritage.com/images/humpback-whales.jpg
    http://t3.gstatic.com/images?q=tbn:-rqzZxnYS00aSM:http://www.lifeinthefastlane.ca/wp-content/uploads/2007/09/humpback_whale_sfw.jpg
    http://t3.gstatic.com/images?q=tbn:W9dXY9zaafu4lM:http://thegoldenspiral.org/wp-content/uploads/2008/10/humpback_whale_02.jpg
    http://t0.gstatic.com/images?q=tbn:yif4QybyjkGb1M:http://www.hickerphoto.com/data/media/42/orca_whales_T8067.jpg
    http://t0.gstatic.com/images?q=tbn:1Yx0k3o9Jjv8GM:http://www.topnews.in/files/Humpback-Whales.jpg
    http://t2.gstatic.com/images?q=tbn:JeT606tj3YWo0M:http://sarahpriyanka13.files.wordpress.com/2008/06/two-killer-whales_5872.jpg
    http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.smh.com.au/ffximage/2006/03/28/whaling_narrowweb__300x377,0.jpg
    http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://yvonnelindsay.files.wordpress.com/2009/07/orca.jpg
    http://t3.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.kspc.org/blog/pix/2009/whales.jpg
    http://t3.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.zoobangoo.com/content/wp-content/uploads/2009/07/humpback-whales-singing.jpg
    http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://rdouglasfields.files.wordpress.com/2010/02/killer-whale.jpg
    http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.saveyourheritage.com/images/humpback-whales.jpg
    http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.lifeinthefastlane.ca/wp-content/uploads/2007/09/humpback_whale_sfw.jpg
    http://t3.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://thegoldenspiral.org/wp-content/uploads/2008/10/humpback_whale_02.jpg
    http://t3.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.hickerphoto.com/data/media/42/orca_whales_T8067.jpg
    http://t0.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.topnews.in/files/Humpback-Whales.jpg
    http://t0.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://sarahpriyanka13.files.wordpress.com/2008/06/two-killer-whales_5872.jpg
    http://t2.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.bigbluetech.net/big-blue-tech-news/wp-content/uploads/2009/03/whales_1358200c.jpg
    http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://brianlean.files.wordpress.com/2007/12/w.jpg
    http://t2.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://fathertheo.files.wordpress.com/2010/02/three_beached_whales_1577.jpg
    http://t3.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.exzooberance.com/virtual%2520zoo/they%2520swim/humpback%2520whale/Humpback%2520Whale%2520485076.jpg
    http://t0.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.greenpeace.org/raw/image_full/international/photosvideos/photos/an-endangered-fin-whale-harpo-5.jpg
    http://t3.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.sad61.k12.me.us/~kknox/Reading%2520Round%2520Up_files/two-humpback-whales-breaching.jfif\x26imgrefurl\x3dhttp://www.sad61.k12.me.us/~kknox/Index.html\x26usg\x3d__K4G3Luikdj_eJtSe106El1jLp4c\x3d\x26h\x3d280\x26w\x3d430\x26sz\x3d45\x26hl\x3den\x26start\x3d18\x26itbs\x3d1","","XAgMrMNCf8DtJM:","http://www.sad61.k12.me.us/~kknox/Reading%2520Round%2520Up_files/two-humpback-whales-breaching.jfif","126","82","Baby \x3cb\x3eWhales\x3c/b\x3e","","","430 × 280 - 45k","jfif","sad61.k12.me.us","","","http://t0.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.agorafinancial.com/afrude/wp-content/whale.gif\x26imgrefurl\x3dhttp://rudeawakening.agorafinancial.com/2007/10/02/beached-whales-and-economic-omens/\x26usg\x3d__Pzafe8gkcZixGHBmTR9KzeBfKQs\x3d\x26h\x3d354\x26w\x3d450\x26sz\x3d72\x26hl\x3den\x26start\x3d19\x26itbs\x3d1","","7kAFSPZVmo2G0M:","http://www.agorafinancial.com/afrude/wp-content/whale.gif","127","100","\x3cb\x3eWhales\x3c/b\x3e are large, beastly","","","450 × 354 - 72k","gif","rudeawakening.agorafinancial.com","","","http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://blog.redfin.com/orangecounty/files/2008/03/whales.jpg
    http://t1.gstatic.com/images","1",[],"",1,"",[],"","",""],["/imgres?imgurl\x3dhttp://www.hickerphoto.com/data/media/42/pictures-of-killer-whales_4250.jpg

    _________________________________________________________________________________

    So, it sorta works. It correctly prints the first few pictures, then starts glitching out certain parts, then just fails.

    I have no idea why it's doing this, please help!
     
    jebego, May 29, 2010 IP
  2. Qc4

    Qc4 Peon

    Messages:
    44
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #2
    It looks like it's starting to pick up some of Google's JavaScript code as well. The problem is that the regex thinks you're saying that the "s" in "images" should be optional rather than matching the "?", since the question mark is a special character in regular expressions. Try escaping it:

    preg_match_all("/http\:\/\/t[0-9]\.gstatic.com\/images\?(.+?)\.jpg/",$str,$matches);
    PHP:
     
    Qc4, May 29, 2010 IP
  3. jebego

    jebego Peon

    Messages:
    8
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Thanks for the reply! I ended up editing my code, but now I'm only getting the first 12 out of the total 21 on the page. Any idea why this might be happening?

    Here's my new script:
    
    <?php
        $url = "http://www.google.com/images?hl=en&source=imghp&q=whales&gbv=2&aq=f&aqi=g10&aql=&oq=&gs_rfai=";
    
        $str = file_get_contents($url);
    	
    	preg_match_all("/http:\/\/t[0-9]\.gstatic\.com\/images\?q=tbn:(.+?)\.jpg/",$str,$matches);
    	
    	foreach( $matches[0] as $match ){
        	echo $match.'<br />';
    	}
    	echo $str
    	
    ?>
    
    PHP:
    And here are my results:
    http://t1.gstatic.com/images?q=tbn:AUYt3zv38OjmhM:http://www.travel-vancouver-island.com/data/media/3/resident-killer-whales_154.jpg
    http://t1.gstatic.com/images?q=tbn:Y0mRP70-9VC7GM:http://www.smh.com.au/ffximage/2006/03/28/whaling_narrowweb__300x377,0.jpg
    http://t3.gstatic.com/images?q=tbn:48GvnPRq3JdloM:http://yvonnelindsay.files.wordpress.com/2009/07/orca.jpg
    http://t3.gstatic.com/images?q=tbn:lDdfIshtFMMWuM:http://www.kspc.org/blog/pix/2009/whales.jpg
    http://t1.gstatic.com/images?q=tbn:Ou053LdnZ3HSJM:http://www.zoobangoo.com/content/wp-content/uploads/2009/07/humpback-whales-singing.jpg
    http://t1.gstatic.com/images?q=tbn:3fydu8qTB2eMsM:http://rdouglasfields.files.wordpress.com/2010/02/killer-whale.jpg
    http://t1.gstatic.com/images?q=tbn:aeYZfdO7srkQmM:http://www.saveyourheritage.com/images/humpback-whales.jpg
    http://t3.gstatic.com/images?q=tbn:-rqzZxnYS00aSM:http://www.lifeinthefastlane.ca/wp-content/uploads/2007/09/humpback_whale_sfw.jpg
    http://t3.gstatic.com/images?q=tbn:W9dXY9zaafu4lM:http://thegoldenspiral.org/wp-content/uploads/2008/10/humpback_whale_02.jpg
    http://t0.gstatic.com/images?q=tbn:yif4QybyjkGb1M:http://www.hickerphoto.com/data/media/42/orca_whales_T8067.jpg
    http://t0.gstatic.com/images?q=tbn:1Yx0k3o9Jjv8GM:http://www.topnews.in/files/Humpback-Whales.jpg
    http://t2.gstatic.com/images?q=tbn:JeT606tj3YWo0M:http://sarahpriyanka13.files.wordpress.com/2008/06/two-killer-whales_5872.jpg

    Thanks again!
     
    jebego, May 29, 2010 IP
  4. jebego

    jebego Peon

    Messages:
    8
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Oh, and pay no attention to the
    echo $str
    I was just testing something
     
    jebego, May 29, 2010 IP
  5. Qc4

    Qc4 Peon

    Messages:
    44
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Looking at the page source, it seems that Google only has image tags for 12 images for some reason, while the full set of results are output via JavaScript. I'm assuming it's for pre-fetching or something similar. Anyway, you'll need to adapt your regex to look in the JavaScript instead. Try this:

    preg_match_all("/\/imgres\?imgurl\\x3d(.+?)\\x26/",$str,$matches);
    PHP:
     
    Qc4, May 29, 2010 IP
  6. jebego

    jebego Peon

    Messages:
    8
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Thanks for the reply! I looked over the pages source again, but the javascript doesn't have the actual URL to google's image, only the original one.

    So, I guess I'm going to have to stick to 12 pictures, unless there's another way to maybe find_all_pictures();?
     
    jebego, May 29, 2010 IP
  7. Qc4

    Qc4 Peon

    Messages:
    44
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #7
    You can piece the image URLs together using the JSON array. This isn't the most efficient regex out there, but it gets the job done:

    preg_match_all("/\["\/imgres.+?"","([a-zA-Z0-9-_]+\:)","(.+?)",".+?(t[0-9]\.gstatic\.com).+?""\]/",$str,$matches);
    PHP:
    Now, $matches[1] will contain the tbn values, $matches[2] will contain the original URLs, and $matches[3] will contain the gstatic server to use. You can then piece them together in a for loop like this:

    for ($i = 0; $i < sizeof($matches[1]); $i++) {
      echo "http://" . $matches[3][$i] . "/images/?q=tbn:" . $matches[1][$i] . $matches[2][$i];
    }
    PHP:
     
    Qc4, May 29, 2010 IP
  8. edpatton

    edpatton Active Member

    Messages:
    261
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    83
    Digital Goods:
    1
    #8
    Yes I could not have explained it better myself.
     
    edpatton, May 29, 2010 IP
  9. jebego

    jebego Peon

    Messages:
    8
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Ah, I see.

    Thanks for the code, it seems to have a regex syntax issue or something:
    Warning: Unexpected character in input: '\' (ASCII=92) state=1
    Parse error: syntax error, unexpected '?'

    But that should be pretty easy to fix.

    Again, thanks a lot!
     
    jebego, May 29, 2010 IP
  10. Qc4

    Qc4 Peon

    Messages:
    44
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Oops, that should've been single quotes rather than double quotes:

    preg_match_all('/\["\/imgres.+?"","([a-zA-Z0-9-_]+\:)","(.+?)",".+?(t[0-9]\.gstatic\.com).+?""\]/',$str,$matches);
    PHP:
     
    Qc4, May 29, 2010 IP
  11. jebego

    jebego Peon

    Messages:
    8
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Works perfectly! Can't express how grateful I am :)

    Thanks a ton!
     
    jebego, May 29, 2010 IP