how can I get the html source code of any webpage a browser can see with php? This includes dynamic pages! if I use fopen, scripts can be blocked from seeing the code some how... try getting the source code of mogaard.ath.cx for example (with a script)
If you want it you need to ask some one to make a script for you that does the same thing. Clone it as they say
you can ... just try this code <? $host = "www.site.com"; $path = "/insidefolder/page.html"; if ($fp = @fsockopen($host, 80, $errno, $errstr, 5)) { fputs($fp, "GET $path HTTP/1.1\r\n"); fputs($fp, "Host: $host\r\n"); fputs($fp, "User-Agent: {$_SERVER['HTTP_USER_AGENT']}\r\n"); fputs($fp, "\r\n"); $content=''; while (!feof($fp)) { $content.=fgets($fp, 1024); } preg_match_all('|<[^>]+>(.*)</[^>]+>|U',$content,$output); for($i=0 ;$i<2000;$i++; ) { echo $output[0][$i] ; echo"<br>"; }; } else { print "Unable to connect: $errno :: $errstr"; } ?> PHP: just try it ... change $host = "www.site.com"; $path = "/insidefolder/page.html"; this was if ur link is :: www.site.com/insidefolder/page.html or ::: php files , no matter but it will apear in html code , not php ... --------------- for($i=0 ;$i<2000;$i++; ) { echo $output[0][$i] ; echo"<br>"; }; PHP: change this peice of code in the previus script ... to control how to write code in ur page ... regards Almrshal
You can get the html code with file_get_contents() (assuming you have PHP5, otherwise you will have to use fopen() )
He wants to view the php code. that's not possible, whenever you request a php file from a webserver it'll be executed and only the output will be sent to the browser.
Yes I agree, all server side script will be executed in server side before you see it. That is basic security as frankcow said.
NO... I want the html code. But I want it done by a script! but using fopen the script doesn't work all the time, someone I was talking to said something like admin of the server could block scripts from getting the html code or something. so how do I get the html code of a page 100% of the time. I have a script and it could get the HTML of W3Schools and my own site but not my friends site(http://mogaard.ath.cx)
Another site can block your IP address from accessing the site if they want to. Otherwise you can just use fopen or file_get_contents. If you were screen scraping my site I would block your IP.
I think he is talking about the admin blocking the use of fopen() to open external URLs by disabling allow_url_fopen http://in2.php.net/manual/en/ref.filesystem.php#ini.allow-url-fopen You can try using cURL instead. http://www.php.net/curl Here is an example of how to do it: $ch = curl_init(); // create new curl handle //set the URL to fetch curl_setopt($ch,CURLOPT_URL,"http://www.site.com/blah.php"); // do not output the HTTP reply header curl_setopt($ch,CURLOPT_HEADER,0); // return output in variable and not directly to browser curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); $output = curl_exec($ch); // send the request and get the output curl_close($ch); // close the curl handle PHP: After that, $output would contain the HTML code if the data transfer worked without errors. It would contain FALSE if there was any error. Then you can get the error using: echo curl_error($ch); PHP: I hope that was what you wanted. Thomas
Indeed curl or simply file_get_contents() are easiest. It works in PHP4 as well btw: http://uk2.php.net/file_get_contents
correction: If you want it you need to pay some one to make a script for you that does the same thing.
Well I'm developing a small php search engine... I was just running the ranking part of it and it had some problems getting the source code of a friends site... so I knew he didn't block me (we even ran the script on his server) I'm the admin of my server... it's in my house! so I'm not blocking anything from me.
If you ran it on the same server you wanted to crawl then it's likely to be the DNS issue I had on my server when I tried to do the exact same thing. I forgot the details of it but it's to do with the firewall routing only external traffic to port 80, 'internal' requests, you crawling your own site, can get blocked that way. Not really blocked but there's just no route to the content. It's beyond my knowledge and interest of DNS stuff but that might well be it. If so, you'll find that if you run the same script from a different server, it indexes that site just fine.
cool, that would most likely be the problem then! Can fopen() or get_file_content() get dynamic pages... like http://example.com?page=013012 -edit- I just tested it and it worked so that's a yes... why do some people say search engine's can't get dynamic pages? (like the stuff after the ?)
It's not that they can't, it's that they won't. For various reasons, but one big reason is that you can continually make links dynamically, making pages dynamically, making your site look bigger and so on, even though it's all the same basic stuff. Somewhere on the Google site (don't make me look it up), they say that they won't spider any page that has an 'id' parameter in it, for example.
I don't think so... check this out http://www.google.com/search?source...B2GGGL_en__177&q=site:forums.digitalpoint.com
I wont agree with you Twist. My site is using id as parameter and so far my pages have been indexed in all serach engines including google.
If you want to do something in PHP. Its probably been done already. I found this web spider written in PHP. http://www.phpdig.net/ http://www.phpdig.net/navigation.php?action=download http://sourceforge.net/projects/phpdig All the above links go to the same thing. Thomas
@PinoyIto and born2win: You might be right, but I'm just saying what Google says. See the last point under 'Technical guidelines' at http://www.google.com/support/webmasters/bin/answer.py?answer=35769 .