I have a site with 1000's of pages. I would like a script that would scrape 6 pieces of content from the product pages (approx 50k products). The product pages are all 2 folders deep in the form of: http://www.domain.com/cat/subcat/product1.aspx. All other pages should be ignored. The data I want is structured the same on all pages. First, I want the full URL. Then I want the product name which is between the <h1> tags. There are no other h1 tags on the page. Also, I want the description. The description is in between the paragraph tags immediately below the HTML: "<h2>Product Description</h2>". etc. (I'll explain later) Then I want all of that data exported to an XML file... basically creating a new "item" for each product. Each "item" will also need a unique ID, which can be just started at 1. I'll tell you the exact format I need for the XML file, and the URL of the site with the products later. I would like it to be easy for me to be able to manipulate the structure of the XML data in case I need to add/edit elements. PM me with the following: Price, Turn Around Time. And...please take time outs into consideration. It's important that all the data is retrieved, plus I don't want to crash the server. So, let me know how those 2 potential issues will be handled. Thanks, Keith p.s. Any programming language is fine...
Ill be able to do this, but i have one question is all the websites that your going to be inputting the same coding structure? becuase the script will be looking for the same formatted structure it has been coded to in the url inputted. If you let me know this question ill send you a quote via pm. Thanks
Please re-read the thread. I edited it completely. The script will just be for 1 site and all the product pages are structured the same.
Hi i am not getting this line "I would like it to be easy for me to be able to manipulate the structure of the XML data in case I need to add/edit elements." i hope you know the starting and ending productid