1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

curl website html grabber

Discussion in 'PHP' started by baddot, May 22, 2007.

  1. #1
    hi can anyone knows how do i do a html grabber from a third party website ?

    which i know need a few of this:
    curl
    fopen
    php

    etc.

    which after that my script will read the following url as a source ? for example yahoo and what my script will read for is

    <input type="text" name="test">

    and so i can use my script to do some more of the additional functions i need help please guide me
     
    baddot, May 22, 2007 IP
  2. Free Directory

    Free Directory Peon

    Messages:
    89
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Free Directory, May 22, 2007 IP
  3. baddot

    baddot Active Member

    Messages:
    309
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    58
    #3
    ermm....i dun get you
     
    baddot, May 22, 2007 IP
  4. Free Directory

    Free Directory Peon

    Messages:
    89
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I have given you only some advices, what articles to read, to get example files , not providing a final script:)
    You want a website scrapper, that's all. And i don't provide such things:)
     
    Free Directory, May 22, 2007 IP
  5. baddot

    baddot Active Member

    Messages:
    309
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    58
    #5
    i mean i tried some of the codes but it stated it need some sort of curl scripting to do the viewing of other third partys source ? isit true ?
     
    baddot, May 25, 2007 IP
  6. projectshifter

    projectshifter Peon

    Messages:
    394
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #6
    I don't use curl very often, generally fopen() is acceptable, just fopen($url, 'r') will set it up and you run a loop to grab the data. Depending if you're on linux, it's also possible to just exec() to wget and have it stored as a file (also an option to overcome if your host blocks fopening urls).
     
    projectshifter, May 25, 2007 IP
  7. baddot

    baddot Active Member

    Messages:
    309
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    58
    #7
    ermm but for the fopen right i manage to get my own codes but i cant detect the codes can anyone help out ?

      $filename = "http://www.baddot.com" ;
      $dataFile = fopen( $filename, "r" ) ;
    
      if ( $dataFile )
      {
        while (!feof($dataFile)) 
        {
           $buffer = fgets($dataFile, 4096);
           //$myfile = html_entity_decode($buffer); full link without picture
    	   $myfile = htmlentities($buffer);
    	   if($myfile == "%.jpg"){
    	   echo $myfile . "<br>";
    	   echo "I got a picture named";
    	   }
    	}
    	   fclose($dataFile);
      }
      else
      {
        die( "fopen failed for $filename" ) ;
      }
    PHP:
     
    baddot, May 26, 2007 IP
  8. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #8
    
    <?
    function html_to_array( $url, $element = null )
    {
     	if( !( $data = file_get_contents( $url ) ) )
    		return false;
    	
    	preg_match_all( '~<img.*?>(</img>)?~si', $data, $page['img'] );
    	preg_match_all( '~<div.*?>.*?[^<]</div>~', $data, $page['div'] );
    	preg_match_all( '~<style.*?>.*?[^<]</style>~', $data, $page['Inline_Css'] );
    	preg_match_all( '~<link.*?>~', $data, $page['Linked_Css'] );
    	preg_match_all( '~<meta.*?[^>]>~', $data, $page['Meta'] );
    	preg_match_all( '~<a.*?[^>].*[^<]</a>~', $data, $page['Link'] );
    	return !is_null( $element ) ? $page[ $element ] : $page ;
    }
    function display_links( $links, $htmlentities = true )
    {
    	foreach( $links as $number => $link )
    	{
    		printf("Link number %d : [ %s ]<br />\n", $number + 1, $htmlentities ? htmlentities( $link ) : $link );
    	}	
    }
    foreach( html_to_array( 'http://forums.digitalpoint.com' ) as $element => $html )
    {
    	printf( "I see %d %s tags<br />\n",
    		 	count( $html[0] ),
    			str_replace('_', ' ', $element ) 
    	);
    }
    foreach( html_to_array( 'http://forums.digitalpoint.com', 'Link' ) as $links )
    {
    	printf("I found %d links, here they are :<br />\n%s",
    		   count( $links ),
    		   display_links( $links )
    	);
    }
    ?>
    
    PHP:
    Something like that, I wouldn't use curl if you don't have too, it's marginally quicker than file_get_contents but totally uncalled for in most cases........
     
    krakjoe, May 26, 2007 IP
  9. baddot

    baddot Active Member

    Messages:
    309
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    58
    #9
    ermm but the script that u posted cant work ? why ?
     
    baddot, May 27, 2007 IP
  10. baddot

    baddot Active Member

    Messages:
    309
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    58
    #10
    hmm or is there anything wrong with the permissions ?
     
    baddot, May 27, 2007 IP
  11. coderbari

    coderbari Well-Known Member

    Messages:
    3,168
    Likes Received:
    193
    Best Answers:
    0
    Trophy Points:
    135
    #11
    that's why curl is needed ;)
     
    coderbari, May 27, 2007 IP
  12. baddot

    baddot Active Member

    Messages:
    309
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    58
    #12
    erm then can you show me how do i know where to start from ? or is there any article with it ?
     
    baddot, May 27, 2007 IP
  13. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #13
    using curl in most cases is NO different to using fopen or file_get_contents or file, curl is only useful if you need some special control over the http request you are making, like setting a useragent, or referer string, it's also helpful if you have large files to download as you can use callbacks to write data as it becomes available instead of waiting untill the server has the whole file in its temporary filesystem.

    that code DOES work, I tested it before I posted it, if you could tell me exactly what doesn't work for you, and post the exact code that doesn't work for you, I'm sure someone can get it to work.
     
    krakjoe, May 27, 2007 IP
  14. baddot

    baddot Active Member

    Messages:
    309
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    58
    #14
    hi guys i already did a code which can display the gif and jpg but how do i get the variable from each array for instance

    Array
    (
        [0] => Array
            (
                [0] => src="http://images.friendster.com/200703E/js/headernav.js"></script><script type="text/javascript" src="http://images.friendster.com/200703E/js/friendster_v1.js"></script><script type="text/javascript" src="http://images.friendster.com/200703E/js/home.js"></script><script type="text/javascript" src="http://images.friendster.com/200703E/js/modules_friendster.js"></script><style type="text/css">body {background-color:#000000; background-image:url([url]http://i26.photobucket.com/albums/c106/drmzer/DRM2ER/th80.gif[/url]
                [1] => src="http://images.friendster.com/images/global/friendster_nav_logo.gif" border="0" class="logo" width="130" height="18"></a><script type="text/javascript">if(typeof correctPNGImage == 'function') {correctPNGImage(document.getElementById('f_logo'), 130, 18, 'http://images.friendster.com/images/friendster_nav_logo.png
                [2] => src="http://images.friendster.com/images/global/search_go_on.png" alt="Search" border="0" class="globnav_inputbtn fakeLink" width="19" height="19"></a><script type="text/javascript">if(typeof correctPNGImage == 'function') {correctPNGImage(document.getElementById('globnav_search_img'), 19, 19, 'http://images.friendster.com/images/search_go_on.png
                [3] => src="http://images.friendster.com/images/spacer.gif
                [4] => src="http://images.friendster.com/images/spacer.gif
                [5] => src="http://images.friendster.com/images/spacer.gif
                [6] => src="http://photos.friendster.com/photos/02/73/20373720/738411726m.jpg
                [7] => src="http://photos.friendster.com/photos/02/73/20373720/654616460m.jpg
                [8] => src="http://photos.friendster.com/photos/02/73/20373720/280126803m.jpg
                [9] => src="http://photos.friendster.com/photos/02/73/20373720/841893688m.jpg
    
    HTML:
    how do i use the script to detect photos.friendter.com/photos ?


    $data = file_get_contents("http://www.friendster.com/baddot");
    $pattern = "/src=[\”‘]?([^\”‘]?.*(png|jpg|gif))[\”‘]?/i";
    //$pattern="photos.friendster.com/photos";
    preg_match_all($pattern, $data, $images);
    print_r($images);
    PHP:
     
    baddot, May 28, 2007 IP
  15. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #15
    
    <?php
    function grab_friendster_photos( $name )
    {
    	preg_match_all( '~src="(http://photos.friendster.com/photos/(.*?).jpg)"~si', file_get_contents( sprintf('http://www.friendster.com/%s', $name ) ), $img );
    	return $img[1];
    }
    function grab_several_friendster_photos( $names, $assoc = false )
    {
     	$start = 0 ;
     	$end = count( $names );
     	
    	do
    	{
    		preg_match_all( '~src="(http://photos.friendster.com/photos/(.*?).jpg)"~si', 
    						file_get_contents( 
    							sprintf('http://www.friendster.com/%s', $names[$start] ) 
    						),
    						$img 
    		);
    		$returns[ $assoc ? $names[$start] : $start ] = array_unique( $img[1] ) ;
    		$start++;
    	}
    	while( $start < $end );
    	
    	return $returns ;
    }
    
    
    /**
     print_r( grab_friendster_photos( 'baddot' ) );
    **/
    /**
    foreach( grab_friendster_photos( 'baddot' ) as $image )
    {
    	printf("<img src='%s' />\n", $image );
    }
    **/
    /**
    // I don't have more than one name to work with, but this will work when you do, returns associative array of return[name] => array( photos )
    print_r( grab_several_friendster_photos( array(
    	'baddot',
    	'baddot',
    	'baddot'
    ), true ) );
    **/
    /**
    // returns array return[int] => array( photos )
    print_r( grab_several_friendster_photos( array(
    	'baddot',
    	'baddot',
    	'baddot'
    ), false ) );
    **/
    
    PHP:
     
    krakjoe, May 29, 2007 IP
  16. coderlinks

    coderlinks Peon

    Messages:
    282
    Likes Received:
    19
    Best Answers:
    0
    Trophy Points:
    0
    #16
    coderlinks, May 29, 2007 IP
  17. baddot

    baddot Active Member

    Messages:
    309
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    58
    #17
    erm what if i just need the variable for photos.friendster.com only how do i do that ?

    $pattern = "/([^\"']?.*(photos.friendster.com))[\"']?/i";

    correct ?
     
    baddot, May 30, 2007 IP
  18. baddot

    baddot Active Member

    Messages:
    309
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    58
    #18
    can anyone tell me how the rockyou.com did the own photo importer at friendster.com ?
     
    baddot, May 30, 2007 IP
  19. baddot

    baddot Active Member

    Messages:
    309
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    58
    #19
    for example i manage to get the codes from www.friendster.com/photos/memberid



    
    
    $data = file_get_contents("http://www.friendster.com/photos/20373720"); 
    $pattern = "/class=[\"']?http://([^\"']?.*(png|jpg|gif|\"))[\"']?/i"; 
    //$pattern="photos.friendster.com/photos"; 
     preg_match_all($pattern, $data, $images); 
     //print_r($images);   
     foreach ($images[0] as $key => $value)
     { 
     if (eregi("photos.friendster.com", $value)) 
     { 
     
     echo "<img $value\"><BR>\n";
     }
     } 
    
    
    PHP:


    but how do i remove the javascript codes just to detect the http://filename.jpg ?
     
    baddot, May 30, 2007 IP