Possible? - Extract all books from amazon category?

Discussion in 'Databases' started by sjohal2006, Feb 7, 2009.

  1. #1
    Hi there,

    I have a new website which I'm going to start on, just wondered though, to make the job so much easier, is it possible to take the amazon book dicrectory, and in particular just a single category ? So I don't have to reconstruct the entire list?.....

    Thanks please get back to me
     
    sjohal2006, Feb 7, 2009 IP
  2. crivion

    crivion Notable Member

    Messages:
    1,669
    Likes Received:
    45
    Best Answers:
    0
    Trophy Points:
    210
    Digital Goods:
    3
    #2
    if you know very well data scrapping with php regular expressions yep its possible
     
    crivion, Feb 7, 2009 IP
  3. sjohal2006

    sjohal2006 Active Member

    Messages:
    125
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    51
    #3
    Do you ;) - Interested in programming a design ?
     
    sjohal2006, Feb 7, 2009 IP
  4. mmerlinn

    mmerlinn Prominent Member

    Messages:
    3,197
    Likes Received:
    819
    Best Answers:
    7
    Trophy Points:
    320
    #4
    Yes, it can be done with many different programming languages. In fact, any language that will run on your computer and that can access low level files can be used to scrape pages from the net.

    However, there may be limits imposed on the amount of downloads from their site per hour to keep from exceeding bandwidth limits. There also might be something in their terms of service that makes it possible for them to come after you legally for data theft.
     
    mmerlinn, Feb 7, 2009 IP
  5. sjohal2006

    sjohal2006 Active Member

    Messages:
    125
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    51
    #5
    Hmmm damn, data theft? Seriously...?
     
    sjohal2006, Feb 8, 2009 IP
  6. w0tan

    w0tan Peon

    Messages:
    77
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #6
    You could also use their API. Just limit how many items you download with it and you'll be fine. :)
     
    w0tan, Feb 8, 2009 IP
  7. sjohal2006

    sjohal2006 Active Member

    Messages:
    125
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    51
    #7
    Supposedly it is roughly 12,000, is that far too many?
     
    sjohal2006, Feb 8, 2009 IP
  8. mmerlinn

    mmerlinn Prominent Member

    Messages:
    3,197
    Likes Received:
    819
    Best Answers:
    7
    Trophy Points:
    320
    #8
    Any time you scrape someone else's pages you must consider "data theft." And there is no way to know whether that is a problem unless you read their terms of service. Generally if there is a copyright notice on the page you want to scrape, you run the risk of legal action. The risk is very low unless you are overloading their servers with requests. Scraping for commercial purposes is a lot riskier than for personal purposes because of the "Fair use" rule in the copyright law.

    Uploading 12,000 pages is WAY too many. The Googlebot scraper typically limits itself to a maximum 417 page requests per DAY per SITE (At least it does on my site of over 4000 pages). Typically Googlebot only does 70 page requests per day on my site. I suggest you do the same when scraping pages.
     
    mmerlinn, Feb 8, 2009 IP
  9. zealus

    zealus Active Member

    Messages:
    70
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    93
    #9
    Funny you want to steal the data that you can get for free through Amazon Web Services API. I am creating a web site right now that extracts Amazon data based on certain parameters and there is no need to scrape pages (given that they change often) if you can get better results through AWS.

    But good luck anyway ;)
     
    zealus, Feb 11, 2009 IP
  10. zealus

    zealus Active Member

    Messages:
    70
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    93
    #10
    Not really, if you loading an inventory for the first time it might be ok - as long as your server can handle the load of so much stuff loaded in the first place and then all the search engine bots from all the search engines coming in to index the content. Google would be least of my worries here :)
     
    zealus, Feb 11, 2009 IP
  11. mmerlinn

    mmerlinn Prominent Member

    Messages:
    3,197
    Likes Received:
    819
    Best Answers:
    7
    Trophy Points:
    320
    #11
    Its not a question of whether your server can handle the page load. It is a question of whether the bandwidth load on AMAZON'S servers would be a problem.
     
    mmerlinn, Feb 11, 2009 IP
  12. zealus

    zealus Active Member

    Messages:
    70
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    93
    #12
    Are you trying to say Amazon's database servers won't handle yet another request for 12,000 records? Or am I laughing too early and misunderstood what you're asking?
     
    zealus, Feb 17, 2009 IP
  13. mmerlinn

    mmerlinn Prominent Member

    Messages:
    3,197
    Likes Received:
    819
    Best Answers:
    7
    Trophy Points:
    320
    #13
    You misunderstood or did not think it through.

    Yes, I am sure that Amazon COULD handle 12,000 record requests coming from ONE person at a time. But what if 10,000 people requested 12,000 records at the same time? Could it happen? Yes. Could they then handle it? I don't know. So, it still is a bandwidth issue for Amazon.

    If I were Amazon, I would be pissed if someone was making 12,000 requests. And I would put a reasonable limit of some kind on how many requests per hour anyone could make. WHY would I be pissed? Because the requester would be spending MY MONEY without my permission to enhance HIS LIFE. In other words, STEALING from ME for HIS benefit.
     
    mmerlinn, Jun 12, 2010 IP