I'm trying to crawl this site http://www.stjohns.ca/business/businessdirectory/index.jsp (where I then click on the "List All" button. The URL for that is too long to post here) I'm using PHPs file_get_contents($url). However, it only retrieves the source for the side menus, header and footer and skips over the business information. To make sure this wasn't just a problem with PHP I created an equivalent script in Ruby with the same results. Does anyone know what might be going on here and how I can get around it?
Okay I found what was causing the problem, but I don't know what to do about it. When you go to http://www.stjohns.ca/business/businessdirectory/index.jsp, you are given a randomly generated session variable. Then when you click on the "List All" button it checks to see if this variable has been set. If not, then you are redirected back to the index.jsp page. This prevents you from linking directly to search results. So, is there any way of getting around this one?
Emulate browser as much as possible. Fetch index page at first. Get cookie. Then request result page with that cookie and referer field. Use CURL or snoopy, they have such features like referer and cookies.