HOW TO: Watch Google, MSN, and Yahoo crawl your site!! (Perl and SSI required.)

Discussion in 'Site & Server Administration' started by Nintendo, Apr 29, 2005.

  1. #1
    If you've ever wanted to know when search engines like Google crawl your site with out geting E-mail bombed or doing grep, try this perl add-on if you use SSI on your sites.

    Add this code to your perl script.

    
    $database = "/complete_path/site90/html/logs/logs.txt";
    $domain = "http://www.domain.com";
    
    $shortdate = `date +"%D %T %Z"`; 
    chop ($shortdate);
    
    if ($ENV{'HTTP_USER_AGENT'} =~ /google|msn|yahoo/i) {
    open (DATABASE,">>$database");
    print DATABASE "$ENV{'REMOTE_ADDR'} - $ENV{'HTTP_USER_AGENT'} - $domain$ENV{'REQUEST_URI'} - $shortdate\n";
    close(DATABASE);
    }
    
    Code (markup):
    and create a logs/logs.txt where the database points to. This code will log Google, MSN, and Yahoo on any domain on the same server. Change http-//www.domain.com to the domain that the script is on, or you can make it blank if you only want to log one domain and don't want the http-//www.domain.com part to show up on the log.


    To only log a certian search engine, use one of these lines.

    if ($ENV{'HTTP_USER_AGENT'} =~ /google/i) {
    if ($ENV{'HTTP_USER_AGENT'} =~ /msn/i) {
    if ($ENV{'HTTP_USER_AGENT'} =~ /yahoo/i) {

    Or for two search engines....

    if ($ENV{'HTTP_USER_AGENT'} =~ /name|name/i) {

    Example of log.

    If you make a new site, submit it to Yahoo and MSN. I submited five new sites to them just two days ago, and as the log shows, Yahoo's allready doing some nice crawling on one of them, while the other two bots looked and then left!

    If you don't get any thing and want to make sure it's working,

    replace

    
    if ($ENV{'HTTP_USER_AGENT'} =~ /google|msn|yahoo/i) {
    open (DATABASE,">>$database");
    print DATABASE "$ENV{'REMOTE_ADDR'} - $ENV{'HTTP_USER_AGENT'} - $domain$ENV{'REQUEST_URI'} - $shortdate\n";
    close(DATABASE);
    }
    
    Code (markup):
    with
    
    open (DATABASE,">>$database");
    print DATABASE "$ENV{'REMOTE_ADDR'} - $ENV{'HTTP_USER_AGENT'} - $domain$ENV{'REQUEST_URI'} - $shortdate\n";
    close(DATABASE);
    
    Code (markup):
    That will log everything. Then change it back after you see logs show up!
     
    Nintendo, Apr 29, 2005 IP
  2. Stin

    Stin Guest

    Messages:
    264
    Likes Received:
    9
    Best Answers:
    0
    Trophy Points:
    0
    #2
    interesting post, I feel like there must be an easier way to do this though. Maybe an entry in the httpd.conf file or something..
     
    Stin, Apr 29, 2005 IP
  3. honey

    honey Prominent Member

    Messages:
    15,555
    Likes Received:
    712
    Best Answers:
    0
    Trophy Points:
    325
    #3
    Nice one nintendo, I tested it, works perfect. Thanks.
     
    honey, Apr 29, 2005 IP
  4. Nintendo

    Nintendo ♬ King of da Wackos ♬

    Messages:
    12,890
    Likes Received:
    1,064
    Best Answers:
    0
    Trophy Points:
    430
    #4
    If placing some code in a CGI file, geting the two lines right, and making the text file is not easy, then nothing is!!!!


    To empty the logs.txt file, in the same directory as the log.txt file, create a log.php file with
    
    <?
    $file = fopen("logs.txt","w");
    fclose($file);
    
    echo"File empty";
    ?>
    
    Code (markup):
    Go to that file and the log file will be emptied.
     
    Nintendo, Apr 29, 2005 IP