I don't know if this is possible and what sort of things I need to look into, and was hoping one of you could help me out. Basically I have a URL, for instance http://www.mysite.com/yeah.php and on this page, is a list of URLS/page text. I need to setup something so it will go through the source of the page and grab all the URL's containing a term such as 'pageno=' Then grab all these URL's and then spit them out in a list? What sort of things would I need to look at on php.net to accomplish this? Cheers in advance.
File handling functions or cURL. This usually does the job: $html = file_get_contents('http://site.com/page'); PHP: As for getting the URLs from the HTML that the above function returns, look up regex pattern matching. Something like: preg_match_all("~http://site.com/\?pageno=\d+~", $html, $m); PHP:
If I do that, then say "echo $m;" the file just says "Array" Should It just display every result it finds?
Ok I worked that part out, print_r($m); I'll see how i go from here. Thanks a tonne mate! I will post when I have my project completed
Gah, How do I get rid of all the Array ( [0] => Array ( [0] => stuff, and just echo each results on its own with a <br /> after it?
how about this one $html = file_get_contents('http://site.com/page'); $lines = explode("\n", $html); foreach ($lines as $line) if (strpos($line, 'pageno=') !== FALSE) echo $line.'<br/>'; PHP:
Thanks! That's like perfect, but I see it echos all the other HTML on the URL line of the page. Is there anyway to strip the URL's only, and display them out of the $lines
<? function extract_urls_from_html( $location, $inurl ) { $returns = array( ); if( ( $handle = fopen( $location, 'r' ) ) ) { while( !feof( $handle ) ) { if( preg_match( $inurl, fgets( $handle, 4096 ), $matches ) ) { $returns[ ] = $matches[1]; } } fclose( $handle ); } return $returns[1] ? $returns : false ; } $urls = extract_urls_from_html( "http://yoursite.com/yeah.php", '~href="(.*?pageno=.*?)"~' ); if( is_array( $urls ) ) { print( implode("<br />\n", $urls ) ); } ?> PHP: Something like that...... gimme a link to the page if the regex doesn't work, regex is a pretty specific thing to use, it's not normally good enough just to say a link like "pageno=" you would have to give the exact structure of the link should you not want the function to return anything else at all.