Here's the scenario: We have a site focused on a niche US market. After registering with only an email address and password, a user can submit information to the site that essentially becomes a sales lead, which then requires significant in-person followup. The overwhelming majority of submissions from outside the US are not, for one reason or another, considered viable leads, and to reduce the manual overhead of processing these unusable leads, we would like to block access for users who are not located within in the US. We are looking to do this based on the geographic location of the originating IP address, but we want to make sure that we do it in a way that will not affect the search spiders and our current excellent search rankings. I've been doing some research, and am looking at the following combination of solutions: - Block as few non-US IP addresses as possible. If a large amount of unusable submissions are coming from one or two geographic areas, block only those at first. - Do not block the entire site, but only block access to areas of the site requiring login, effectively preventing non-US users from registering in the first place and blocking existing profiles from logging in again. Any user, anywhere, could surf the public areas of the site, but only US users could access the login-only sections. As the robots.txt file specifically blocks spiders from the logged-in areas, and as these areas are only accessed via POST-ed form submit, I do not believe they are being crawled anyway. - Provide blocked users with a fully-functional, optimized and spiderable page. It's user-friendly, of course, and if a spider does find the page for some reason (coming through a DNS server in a blocked IP range, for instance), it will not be stopped dead but can continue on the crawl. It seems to me like these steps would leave as much of the site as possible open to being crawled, while properly preventing non-US users from registering or logging in. Thoughts/opinions? Has anyone done something like this? What am I missing?
If you are interested in specific spiders just explicitly call their name. All respectable bots use a standard user agent. if ($_SERVER['HTTP_USER_AGENT'] == 'msnbot/1.0 (+http://search.msn.com/msnbot.htm)') {$user_agent = 'MSNbot 1.0';} elseif ($_SERVER['HTTP_USER_AGENT'] == 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)') {$user_agent = 'Googlebot';} elseif ($_SERVER['HTTP_USER_AGENT'] == 'Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)') {$user_agent = 'Yahoo Slurp';} PHP: After you are sure that the visitor is not a bot that you want then filter the visitor by their ip address using $_SERVER['REMOTE_ADDR']. This could all be done by an include that you just drop on the correct pages. If you are US only market it will probably be faster (processor wise) if you just allow US ip addresses rather than not allowing specific countries.