Planning on starting a product-based search engine which will search multiple websites (it's a specific niche of products) for the best price. Planning on making a custom cURL script for each site to display the latest prices and deals, as they're all written much differently than each other but there's only like 10 or 15 of them. The problem is is that I don't want to become an annoyance to each of the merchant sites by using up all their bandwidth. If I understand correctly, cURL downloads the entire web page to the (my) host in order to make any searches in it. There's like ~100 products I would be listing from each site, so if it's taking 100 100kb web pages, that's going to use 10 megs of bandwidth per crawl (per site). There's a lot of short-term deals, stock changes on a daily bases, etc. I want the sites to be indexed at least once every day in order to keep up. What's the best approach for this? Is there a way to download a portion of a web page, or does it have to be the whole freakin' thing? EDIT: Oh, a lot of the sites, though the merchants do handle their business well, look like they were written by a 12-year-old. Things like RSS and ATOM feeds are for the most part absent.
Rather than searching on a per product basis I would recommend starting with a list of products and their most likely location. I'd also have the "last search date". Then start at the top and search for product 001, when you find it see if any other products from your list are on the same page - if so update them and their last search date. Now go to the next item on the list that has a last search date greater than today. If you are only calling a limited number of pages and saving the info or caching the page then you shouldn't create too much of a load.
Well there is something to remember when calculating the bandwidth using cURL. Yes you do download the webpage to your server however this does NOT include any images, javascript, css or other attached files. It will only be the html code of the desired page. I would suggest you do the following. There are plenty of public proxy websites on the net. Create a database of them and use curl to fetch the pages you need from the websites. If you have a database of 200 proxies (easy possible as there are 1000's around) you can simply change the proxy every time you request a page. I have done it and it works like a treat even for crawling yellowpages websites etc.
Well like I said. One thing to consider is that it only loads the html code not any other attached files. So I would expect the actual filesize to be more in the region of 25kb (unless you checked properly that the filesize is actually 100kb). There are 2 curl_setopt options that you can use for only downloading part of a website. 'CURLOPT_BUFFERSIZE' This sets the Buffer size (In bytes) 'CURLOPT_RESUME_FROM' Sets where to start from (In bytes) So if there are alot of code in the <head> section of the document you could calculate the byte size of the code and simply start after that.
Thanks. Yeah, the file size was like 100kbs for the source code alone, not the whole page with images and whatnot. A lot of these sites still use JS for all their graphical stuff which takes up a heck of a lot of space per file. Getting rid of the head should dumb it down quite a bit.