PHP Web Crawler problems

bonecone Peon

Messages:: 54

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

I'm trying to crawl this site http://www.stjohns.ca/business/businessdirectory/index.jsp (where I then click on the "List All" button. The URL for that is too long to post here)

I'm using PHPs file_get_contents($url). However, it only retrieves the source for the side menus, header and footer and skips over the business information.

To make sure this wasn't just a problem with PHP I created an equivalent script in Ruby with the same results. Does anyone know what might be going on here and how I can get around it?

bonecone, Nov 14, 2009 IP

bonecone Peon

Messages:: 54

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#2

Okay I found what was causing the problem, but I don't know what to do about it.

When you go to http://www.stjohns.ca/business/businessdirectory/index.jsp, you are given a randomly generated session variable. Then when you click on the "List All" button it checks to see if this variable has been set. If not, then you are redirected back to the index.jsp page. This prevents you from linking directly to search results.

So, is there any way of getting around this one?

bonecone, Nov 14, 2009 IP

AsHinE Well-Known Member

Messages:: 240

Likes Received:: 8

Best Answers:: 1

Trophy Points:: 138

#3

Emulate browser as much as possible.
Fetch index page at first. Get cookie. Then request result page with that cookie and referer field.
Use CURL or snoopy, they have such features like referer and cookies.

AsHinE, Nov 15, 2009 IP

bonecone Peon

Messages:: 54

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#4

Thanks, that's what I did afterwords. I used PHPCrawl.

bonecone, Nov 15, 2009 IP

Log in or Sign up

PHP Web Crawler problems

bonecone Peon

bonecone Peon

AsHinE Well-Known Member

bonecone Peon

Useful Searches