Hi there, I have a new website which I'm going to start on, just wondered though, to make the job so much easier, is it possible to take the amazon book dicrectory, and in particular just a single category ? So I don't have to reconstruct the entire list?..... Thanks please get back to me
Yes, it can be done with many different programming languages. In fact, any language that will run on your computer and that can access low level files can be used to scrape pages from the net. However, there may be limits imposed on the amount of downloads from their site per hour to keep from exceeding bandwidth limits. There also might be something in their terms of service that makes it possible for them to come after you legally for data theft.
Any time you scrape someone else's pages you must consider "data theft." And there is no way to know whether that is a problem unless you read their terms of service. Generally if there is a copyright notice on the page you want to scrape, you run the risk of legal action. The risk is very low unless you are overloading their servers with requests. Scraping for commercial purposes is a lot riskier than for personal purposes because of the "Fair use" rule in the copyright law. Uploading 12,000 pages is WAY too many. The Googlebot scraper typically limits itself to a maximum 417 page requests per DAY per SITE (At least it does on my site of over 4000 pages). Typically Googlebot only does 70 page requests per day on my site. I suggest you do the same when scraping pages.
Funny you want to steal the data that you can get for free through Amazon Web Services API. I am creating a web site right now that extracts Amazon data based on certain parameters and there is no need to scrape pages (given that they change often) if you can get better results through AWS. But good luck anyway
Not really, if you loading an inventory for the first time it might be ok - as long as your server can handle the load of so much stuff loaded in the first place and then all the search engine bots from all the search engines coming in to index the content. Google would be least of my worries here
Its not a question of whether your server can handle the page load. It is a question of whether the bandwidth load on AMAZON'S servers would be a problem.
Are you trying to say Amazon's database servers won't handle yet another request for 12,000 records? Or am I laughing too early and misunderstood what you're asking?
You misunderstood or did not think it through. Yes, I am sure that Amazon COULD handle 12,000 record requests coming from ONE person at a time. But what if 10,000 people requested 12,000 records at the same time? Could it happen? Yes. Could they then handle it? I don't know. So, it still is a bandwidth issue for Amazon. If I were Amazon, I would be pissed if someone was making 12,000 requests. And I would put a reasonable limit of some kind on how many requests per hour anyone could make. WHY would I be pissed? Because the requester would be spending MY MONEY without my permission to enhance HIS LIFE. In other words, STEALING from ME for HIS benefit.