Hi, I notice that people sell content databases that have obviously been ripped/extracted from other websites. How is this done? Do they write some custom perl/php code that spiders the website and extracts all the data from the web pages, cleans it up, and then inserts it into their own database? Or is there a piece of software that has already been written that can be purchased for such purpose? For example, if I wanted to extract/rip all the recipes from allrecipes.com, how would i do it? Regards, Wyatt
They've evidently hacked into it.. you can't rip off a content database. If you could do so, basically every website would be gone by now with stolen content...
I doubt they hack the mysql server to get the data, although very possible. My bet is they spider the website, and dump the content into a mysql database of their own. I just want to know if this software exists, before I spend the time to write my own. Wyatt
Most people do that, because that's where I see many databases come from, are hacking sites and not places like DP forums... the ones that are sold on DP forums are probably commercial databases created on a turnkey basis. It is also possible they spider the website, although a script would be needed to spider out the specific areas of each page on the website. So I'd reckon they might not choose that option because it would involve much custom coding.
So, you want to steal another website's content. And, you want us to help? It wouldn't be hard to write a program to steal content, but why not just develop your own unique information?
I'm not stealing anything. I'm simply trying to figure out how to extract data from a public website.
Public website? So the owner wouldn't mind you copying it? If so, why not just ask the webmaster for a copy?
Hey dude slow down. I think he just wants the basic idea of how to do it. If he were to really steal from the owner's website it'd be his own problem to handle the situation. Although yes I do agree, why not just ask the owner of the site for the copy?
I've had content stolen before... Takes me weeks to write dozens of good, quality posts... Then someone decides to post them on their site with no link back to me... For any coder, making a scrapper is beginner's stuff, so making one shouldn't be a problem for him... But still, it's not a great plan to make a website using other people's content...
I never said I was stealing or republishing data. You are making this assumption. I simply want to know if there is an easier way to extract data versus writing my own code.
people do it manually, yes it takes time, so people hire others to do it. also, people create some script which automatically extract data, but scripts do not always give 100% accurate data
Scraping content can be used for perfectly legitimate reasons. To do this, you just write a spider in your favorite programming language. With PHP, most opt to use regular expressions to scrape content. I think using explode() is far easier.
Use vder software, it is free screenshot:http://binhgiang.sourceforge.net/xmlalbum/screenshots.html and download: http://binhgiang.sourceforge.net/site/download.jsp
Spidering or manual copy + paste is the only way that comes to mind as of now. Check out this thread: www.daniweb.com/forums/thread10023.html Check out this zillman article http://zillman.blogspot.com/2004/09/web-data-extractors.html and also search for 'web data extractor'. I found tons of relevant links there. Check out websundew and mozenda for example.