Hi there, I am working on a price comparison engine for a niche, let's say iPods. The product feeds (CSV) provided by some of the affiliate programs are either outdated or don't contain all the product data that I need. So I decided to use Ruby's Hpricot library to scrape the product pages of the affiliate programs that provide unsuitable feeds. This way, I can gather all the data I require. I export the scraped data as CSV, for instance iPod Video 20GB, $150 I have one scraper script for each merchant. Now, the problem is, the product title on the merchants' sites don't always match. Instead of 'iPod' (merchant 1), they might write 'IPod' (merchant 2), 'i Pod' (merchant 3), 'ipod' (merchant 4)... Anyway, let's say this are two sample CSV files: merchant1.csv iPod Video 20GB, $150 iPod Video 80GB, $400 iPod nano 8GB, $100 iPod touch 60 GB, $300 merchant2.csv I pod Video 20GB, $140 I pod Video 80GB, $390 I pod nano 8GB, $80 I pod touch 60 GB, $275 What I want to do is: create a price comparison for each model, e.g. one for Video 20GB, one for Video 80 GB, one for nano 4GB, one for nano 8GB and so on... As I said, this is just an example, and in reality, there are a whole lot more different combinations of product names and more merchants. My questions: 1. Affiliate Links I will probably save all affiliate links and an identifier in a MySQL table, then use a PHP jump script. This will be some manual work I suppose, generating all the aff links, writing them to a database, adding the IDs... But I only have to do it once, and it's very flexible if aff links change at a later date. Example: www.example.com/links/jump.php?id=amazon-ipodvideo20 MySQL database: (ID:amazon-ipodvideo20 | LINK:www.here-goes-the-aff-link.com) Is that a good idea? 2. CSV to DATABASE, Combining data, unique IDs Now this is where my head aches start. I was thinking about using one database table for each product. For instance, one table for the Video 20GB, one for the Video 60GB, one for the nano 4GB, one for the nano 8GB and so on. I though about using an identifier like 'ipodvideo20' as a table name, then adding an extra field to all the CSV rows with 'iPod Video 20GB', 'IPOD video 20 GB' and so on for all the different merchants... This way, at least I could write a script to add the correct product to the correct table. What I haven't figured out is how to add "id"s: I thought about putting them to lowercase and deleting whitespaces, which in this example might actually work (e.g. 'iPod Video 20GB' -> ipodvideo20gb, 'IPOD video 20 GB' -> ipodvideo20gb), but not if the product name is 'iPod Video blue 20 Gigs'. Any ideas or suggestions how I can combine scraped data from different sources into the correct database tables? Cheers, Chris
as far shortening Secondly, I wrote a price comparison engine for foodquote. They had 100s of vendors, 1000s of products and brands, and even vendors could create their tyres in which they could provide discount. And finally the end-user could create a full report based on categories and products offered by different vendors, in the form of pdf, csv, xls, and word document. The system admin could import plain CSV files to populate database. All structure has already been created alongwith an installer. You can also limit time for generating reports. And alsmost any kind of statistics. You might not be in need of such huge system. But I can provide you a demo where you can test as admin, vendor, and end user. Review my profile to contact me. regards