Combining scraped data from different sources

Discussion in 'Programming' started by Lupi, Jul 8, 2008.

  1. #1
    Hi there,

    I am working on a price comparison engine for a niche, let's say iPods.

    The product feeds (CSV) provided by some of the affiliate programs are either outdated or don't contain all the product data that I need.

    So I decided to use Ruby's Hpricot library to scrape the product pages of the affiliate programs that provide unsuitable feeds. This way, I can gather all the data I require.

    I export the scraped data as CSV, for instance

    iPod Video 20GB, $150

    I have one scraper script for each merchant. Now, the problem is, the product title on the merchants' sites don't always match. Instead of 'iPod' (merchant 1), they might write 'IPod' (merchant 2), 'i Pod' (merchant 3), 'ipod' (merchant 4)...

    Anyway, let's say this are two sample CSV files:

    merchant1.csv
    iPod Video 20GB, $150
    iPod Video 80GB, $400
    iPod nano 8GB, $100
    iPod touch 60 GB, $300


    merchant2.csv
    I pod Video 20GB, $140
    I pod Video 80GB, $390
    I pod nano 8GB, $80
    I pod touch 60 GB, $275


    What I want to do is: create a price comparison for each model, e.g. one for Video 20GB, one for Video 80 GB, one for nano 4GB, one for nano 8GB and so on... As I said, this is just an example, and in reality, there are a whole lot more different combinations of product names and more merchants.

    My questions:

    1. Affiliate Links
    I will probably save all affiliate links and an identifier in a MySQL table, then use a PHP jump script. This will be some manual work I suppose, generating all the aff links, writing them to a database, adding the IDs... But I only have to do it once, and it's very flexible if aff links change at a later date.
    Example: www.example.com/links/jump.php?id=amazon-ipodvideo20
    MySQL database: (ID:amazon-ipodvideo20 | LINK:www.here-goes-the-aff-link.com)

    Is that a good idea?

    2. CSV to DATABASE, Combining data, unique IDs
    Now this is where my head aches start. I was thinking about using one database table for each product. For instance, one table for the Video 20GB, one for the Video 60GB, one for the nano 4GB, one for the nano 8GB and so on.

    I though about using an identifier like 'ipodvideo20' as a table name, then adding an extra field to all the CSV rows with 'iPod Video 20GB', 'IPOD video 20 GB' and so on for all the different merchants...

    This way, at least I could write a script to add the correct product to the correct table.

    What I haven't figured out is how to add "id"s: I thought about putting them to lowercase and deleting whitespaces, which in this example might actually work (e.g. 'iPod Video 20GB' -> ipodvideo20gb, 'IPOD video 20 GB' -> ipodvideo20gb), but not if the product name is 'iPod Video blue 20 Gigs'.

    Any ideas or suggestions how I can combine scraped data from different sources into the correct database tables?

    Cheers, Chris
     
    Lupi, Jul 8, 2008 IP
  2. Vooler

    Vooler Well-Known Member

    Messages:
    1,146
    Likes Received:
    64
    Best Answers:
    4
    Trophy Points:
    150
    #2
    as far shortening

    Secondly, I wrote a price comparison engine for foodquote. They had 100s of vendors, 1000s of products and brands, and even vendors could create their tyres in which they could provide discount. And finally the end-user could create a full report based on categories and products offered by different vendors, in the form of pdf, csv, xls, and word document. The system admin could import plain CSV files to populate database. All structure has already been created alongwith an installer. You can also limit time for generating reports. And alsmost any kind of statistics.

    You might not be in need of such huge system. But I can provide you a demo where you can test as admin, vendor, and end user. Review my profile to contact me.

    regards
     
    Vooler, Jul 8, 2008 IP