Looking for HTML site scraper (for my own site) to scrape structured data

keithjameslock Peon

Messages:: 416

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 0

#1

I have a site with 1000's of pages. I would like a script that would scrape 6 pieces of content from the product pages (approx 50k products). The product pages are all 2 folders deep in the form of: http://www.domain.com/cat/subcat/product1.aspx. All other pages should be ignored. The data I want is structured the same on all pages.

First, I want the full URL.
Then I want the product name which is between the <h1> tags. There are no other h1 tags on the page.
Also, I want the description. The description is in between the paragraph tags immediately below the HTML: "<h2>Product Description</h2>".
etc. (I'll explain later)

Then I want all of that data exported to an XML file... basically creating a new "item" for each product. Each "item" will also need a unique ID, which can be just started at 1. I'll tell you the exact format I need for the XML file, and the URL of the site with the products later.

I would like it to be easy for me to be able to manipulate the structure of the XML data in case I need to add/edit elements.

PM me with the following: Price, Turn Around Time. And...please take time outs into consideration. It's important that all the data is retrieved, plus I don't want to crash the server. So, let me know how those 2 potential issues will be handled.

Thanks,
Keith

p.s. Any programming language is fine...

Last edited: Dec 22, 2009

keithjameslock, Dec 22, 2009 IP

NetworkTown.Net Well-Known Member

Messages:: 2,022

Likes Received:: 28

Best Answers:: 0

Trophy Points:: 165

#2

Ill be able to do this, but i have one question is all the websites that your going to be inputting the same coding structure? becuase the script will be looking for the same formatted structure it has been coded to in the url inputted. If you let me know this question ill send you a quote via pm.

Thanks

NetworkTown.Net, Dec 22, 2009 IP

keithjameslock Peon

Messages:: 416

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 0

#3

NetworkTown.Net said: ↑

Ill be able to do this, but i have one question is all the websites that your going to be inputting the same coding structure? becuase the script will be looking for the same formatted structure it has been coded to in the url inputted. If you let me know this question ill send you a quote via pm.

Thanks
Click to expand...

Please re-read the thread. I edited it completely. The script will just be for 1 site and all the product pages are structured the same.

keithjameslock, Dec 22, 2009 IP

frank007 Well-Known Member

Messages:: 160

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 123

#4

Please check PM

frank007, Dec 22, 2009 IP

innovatewebs Well-Known Member

Messages:: 194

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 101

#5

Hi
i am not getting this line "I would like it to be easy for me to be able to manipulate the structure of the XML data in case I need to add/edit elements."

i hope you know the starting and ending productid

innovatewebs, Dec 23, 2009 IP

Log in or Sign up

Looking for HTML site scraper (for my own site) to scrape structured data

keithjameslock Peon

NetworkTown.Net Well-Known Member

keithjameslock Peon

frank007 Well-Known Member

innovatewebs Well-Known Member

Useful Searches