curl website html grabber

baddot Active Member

Messages:: 309

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 58

#1

hi can anyone knows how do i do a html grabber from a third party website ?

which i know need a few of this:
curl
fopen
php

etc.

which after that my script will read the following url as a source ? for example yahoo and what my script will read for is

<input type="text" name="test">

and so i can use my script to do some more of the additional functions i need help please guide me

baddot, May 22, 2007 IP

Free Directory Peon

Messages:: 89

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 0

#2

with curl you will get entire page. Then you must do some regexp to extract your desired content from it.
Check that link http://www.php.net/manual/en/ref.curl.php#65700 for a page grabber class
Another good reference link for you: http://blog.lejer.ro/tag/curl/ that contains a little article for curl books and the curl basic&usual usage
surely will help

Free Directory, May 22, 2007 IP

baddot Active Member

Messages:: 309

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 58

#3

ermm....i dun get you

baddot, May 22, 2007 IP

Free Directory Peon

Messages:: 89

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 0

#4

I have given you only some advices, what articles to read, to get example files , not providing a final script
You want a website scrapper, that's all. And i don't provide such things

Free Directory, May 22, 2007 IP

baddot Active Member

Messages:: 309

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 58

#5

i mean i tried some of the codes but it stated it need some sort of curl scripting to do the viewing of other third partys source ? isit true ?

baddot, May 25, 2007 IP

projectshifter Peon

Messages:: 394

Likes Received:: 7

Best Answers:: 0

Trophy Points:: 0

#6

I don't use curl very often, generally fopen() is acceptable, just fopen($url, 'r') will set it up and you run a loop to grab the data. Depending if you're on linux, it's also possible to just exec() to wget and have it stored as a file (also an option to overcome if your host blocks fopening urls).

projectshifter, May 25, 2007 IP

baddot Active Member

Messages:: 309

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 58

#7

projectshifter said: ↑

I don't use curl very often, generally fopen() is acceptable, just fopen($url, 'r') will set it up and you run a loop to grab the data. Depending if you're on linux, it's also possible to just exec() to wget and have it stored as a file (also an option to overcome if your host blocks fopening urls).
Click to expand...

ermm but for the fopen right i manage to get my own codes but i cant detect the codes can anyone help out ?
  $filename = "http://www.baddot.com" ;
  $dataFile = fopen( $filename, "r" ) ;

  if ( $dataFile )
  {
    while (!feof($dataFile)) 
    {
       $buffer = fgets($dataFile, 4096);
       //$myfile = html_entity_decode($buffer); full link without picture
	   $myfile = htmlentities($buffer);
	   if($myfile == "%.jpg"){
	   echo $myfile . "<br>";
	   echo "I got a picture named";
	   }
	}
	   fclose($dataFile);
  }
  else
  {
    die( "fopen failed for $filename" ) ;
  }
PHP:

baddot, May 26, 2007 IP

krakjoe Well-Known Member

Messages:: 1,795

Likes Received:: 141

Best Answers:: 0

Trophy Points:: 135

#8


<?
function html_to_array( $url, $element = null )
{
 	if( !( $data = file_get_contents( $url ) ) )
		return false;
	
	preg_match_all( '~<img.*?>(</img>)?~si', $data, $page['img'] );
	preg_match_all( '~<div.*?>.*?[^<]</div>~', $data, $page['div'] );
	preg_match_all( '~<style.*?>.*?[^<]</style>~', $data, $page['Inline_Css'] );
	preg_match_all( '~<link.*?>~', $data, $page['Linked_Css'] );
	preg_match_all( '~<meta.*?[^>]>~', $data, $page['Meta'] );
	preg_match_all( '~<a.*?[^>].*[^<]</a>~', $data, $page['Link'] );
	return !is_null( $element ) ? $page[ $element ] : $page ;
}
function display_links( $links, $htmlentities = true )
{
	foreach( $links as $number => $link )
	{
		printf("Link number %d : [ %s ]<br />\n", $number + 1, $htmlentities ? htmlentities( $link ) : $link );
	}	
}
foreach( html_to_array( 'http://forums.digitalpoint.com' ) as $element => $html )
{
	printf( "I see %d %s tags<br />\n",
		 	count( $html[0] ),
			str_replace('_', ' ', $element ) 
	);
}
foreach( html_to_array( 'http://forums.digitalpoint.com', 'Link' ) as $links )
{
	printf("I found %d links, here they are :<br />\n%s",
		   count( $links ),
		   display_links( $links )
	);
}
?>

PHP:

Something like that, I wouldn't use curl if you don't have too, it's marginally quicker than file_get_contents but totally uncalled for in most cases........

krakjoe, May 26, 2007 IP

baddot Active Member

Messages:: 309

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 58

#9

ermm but the script that u posted cant work ? why ?

baddot, May 27, 2007 IP

baddot Active Member

Messages:: 309

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 58

#10

hmm or is there anything wrong with the permissions ?

baddot, May 27, 2007 IP

coderbari Well-Known Member

Messages:: 3,168

Likes Received:: 193

Best Answers:: 0

Trophy Points:: 135

#11

that's why curl is needed

coderbari, May 27, 2007 IP

baddot Active Member

Messages:: 309

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 58

#12

coderbari said: ↑

that's why curl is needed
Click to expand...

erm then can you show me how do i know where to start from ? or is there any article with it ?

baddot, May 27, 2007 IP

krakjoe Well-Known Member

Messages:: 1,795

Likes Received:: 141

Best Answers:: 0

Trophy Points:: 135

#13

using curl in most cases is NO different to using fopen or file_get_contents or file, curl is only useful if you need some special control over the http request you are making, like setting a useragent, or referer string, it's also helpful if you have large files to download as you can use callbacks to write data as it becomes available instead of waiting untill the server has the whole file in its temporary filesystem.

that code DOES work, I tested it before I posted it, if you could tell me exactly what doesn't work for you, and post the exact code that doesn't work for you, I'm sure someone can get it to work.

krakjoe, May 27, 2007 IP

baddot Active Member

Messages:: 309

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 58

#14

hi guys i already did a code which can display the gif and jpg but how do i get the variable from each array for instance

Array
(
    [0] => Array
        (
            [0] => src="http://images.friendster.com/200703E/js/headernav.js"></script><script type="text/javascript" src="http://images.friendster.com/200703E/js/friendster_v1.js"></script><script type="text/javascript" src="http://images.friendster.com/200703E/js/home.js"></script><script type="text/javascript" src="http://images.friendster.com/200703E/js/modules_friendster.js"></script><style type="text/css">body {background-color:#000000; background-image:url([url]http://i26.photobucket.com/albums/c106/drmzer/DRM2ER/th80.gif[/url]
            [1] => src="http://images.friendster.com/images/global/friendster_nav_logo.gif" border="0" class="logo" width="130" height="18"></a><script type="text/javascript">if(typeof correctPNGImage == 'function') {correctPNGImage(document.getElementById('f_logo'), 130, 18, 'http://images.friendster.com/images/friendster_nav_logo.png
            [2] => src="http://images.friendster.com/images/global/search_go_on.png" alt="Search" border="0" class="globnav_inputbtn fakeLink" width="19" height="19"></a><script type="text/javascript">if(typeof correctPNGImage == 'function') {correctPNGImage(document.getElementById('globnav_search_img'), 19, 19, 'http://images.friendster.com/images/search_go_on.png
            [3] => src="http://images.friendster.com/images/spacer.gif
            [4] => src="http://images.friendster.com/images/spacer.gif
            [5] => src="http://images.friendster.com/images/spacer.gif
            [6] => src="http://photos.friendster.com/photos/02/73/20373720/738411726m.jpg
            [7] => src="http://photos.friendster.com/photos/02/73/20373720/654616460m.jpg
            [8] => src="http://photos.friendster.com/photos/02/73/20373720/280126803m.jpg
            [9] => src="http://photos.friendster.com/photos/02/73/20373720/841893688m.jpg

HTML:

how do i use the script to detect photos.friendter.com/photos ?

$data = file_get_contents("http://www.friendster.com/baddot");
$pattern = "/src=[\â€â€˜]?([^\â€â€˜]?.*(png|jpg|gif))[\â€â€˜]?/i";
//$pattern="photos.friendster.com/photos";
preg_match_all($pattern, $data, $images);
print_r($images);

PHP:

baddot, May 28, 2007 IP

krakjoe Well-Known Member

Messages:: 1,795

Likes Received:: 141

Best Answers:: 0

Trophy Points:: 135

#15


<?php
function grab_friendster_photos( $name )
{
	preg_match_all( '~src="(http://photos.friendster.com/photos/(.*?).jpg)"~si', file_get_contents( sprintf('http://www.friendster.com/%s', $name ) ), $img );
	return $img[1];
}
function grab_several_friendster_photos( $names, $assoc = false )
{
 	$start = 0 ;
 	$end = count( $names );
 	
	do
	{
		preg_match_all( '~src="(http://photos.friendster.com/photos/(.*?).jpg)"~si', 
						file_get_contents( 
							sprintf('http://www.friendster.com/%s', $names[$start] ) 
						),
						$img 
		);
		$returns[ $assoc ? $names[$start] : $start ] = array_unique( $img[1] ) ;
		$start++;
	}
	while( $start < $end );
	
	return $returns ;
}


/**
 print_r( grab_friendster_photos( 'baddot' ) );
**/
/**
foreach( grab_friendster_photos( 'baddot' ) as $image )
{
	printf("<img src='%s' />\n", $image );
}
**/
/**
// I don't have more than one name to work with, but this will work when you do, returns associative array of return[name] => array( photos )
print_r( grab_several_friendster_photos( array(
	'baddot',
	'baddot',
	'baddot'
), true ) );
**/
/**
// returns array return[int] => array( photos )
print_r( grab_several_friendster_photos( array(
	'baddot',
	'baddot',
	'baddot'
), false ) );
**/

PHP:

krakjoe, May 29, 2007 IP

coderlinks Peon

Messages:: 282

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#16

To do this scraping business you need to learn to use regular expressions. You can see some tutorials for this at:

http://www.regular-expressions.info
http://weblogtoolscollection.com/regex/regex.php

Regular expressions are standard and is supported in many programing languages. It is a valuable knowledge.

~
Thomas

coderlinks, May 29, 2007 IP

baddot Active Member

Messages:: 309

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 58

#17

erm what if i just need the variable for photos.friendster.com only how do i do that ?

$pattern = "/([^\"']?.*(photos.friendster.com))[\"']?/i";

correct ?

baddot, May 30, 2007 IP

baddot Active Member

Messages:: 309

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 58

#18

can anyone tell me how the rockyou.com did the own photo importer at friendster.com ?

baddot, May 30, 2007 IP

baddot Active Member

Messages:: 309

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 58

#19

for example i manage to get the codes from www.friendster.com/photos/memberid



$data = file_get_contents("http://www.friendster.com/photos/20373720"); 
$pattern = "/class=[\"']?http://([^\"']?.*(png|jpg|gif|\"))[\"']?/i"; 
//$pattern="photos.friendster.com/photos"; 
 preg_match_all($pattern, $data, $images); 
 //print_r($images);   
 foreach ($images[0] as $key => $value)
 { 
 if (eregi("photos.friendster.com", $value)) 
 { 
 
 echo "<img $value\"><BR>\n";
 }
 }

PHP:

but how do i remove the javascript codes just to detect the http://filename.jpg ?

baddot, May 30, 2007 IP

Log in or Sign up

curl website html grabber

baddot Active Member

Free Directory Peon

baddot Active Member

Free Directory Peon

baddot Active Member

projectshifter Peon

baddot Active Member

krakjoe Well-Known Member

baddot Active Member

baddot Active Member

coderbari Well-Known Member

baddot Active Member

krakjoe Well-Known Member

baddot Active Member

krakjoe Well-Known Member

coderlinks Peon

baddot Active Member

baddot Active Member

baddot Active Member

Useful Searches