How to extract/rip data from websites databases?

wyatt12 Active Member

Messages:: 148

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 53

#1

Hi,

I notice that people sell content databases that have obviously been ripped/extracted from other websites. How is this done?

Do they write some custom perl/php code that spiders the website and extracts all the data from the web pages, cleans it up, and then inserts it into their own database?

Or is there a piece of software that has already been written that can be purchased for such purpose?

For example, if I wanted to extract/rip all the recipes from allrecipes.com, how would i do it?

Regards,

Wyatt

wyatt12, Oct 30, 2007 IP

wvccboy Notable Member

Messages:: 2,632

Likes Received:: 81

Best Answers:: 1

Trophy Points:: 250

#2

They've evidently hacked into it.. you can't rip off a content database. If you could do so, basically every website would be gone by now with stolen content...

wvccboy, Oct 30, 2007 IP

wyatt12 Active Member

Messages:: 148

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 53

#3

I doubt they hack the mysql server to get the data, although very possible. My bet is they spider the website, and dump the content into a mysql database of their own. I just want to know if this software exists, before I spend the time to write my own.

Wyatt

wyatt12, Oct 30, 2007 IP

wvccboy Notable Member

Messages:: 2,632

Likes Received:: 81

Best Answers:: 1

Trophy Points:: 250

#4

Most people do that, because that's where I see many databases come from, are hacking sites and not places like DP forums... the ones that are sold on DP forums are probably commercial databases created on a turnkey basis.

It is also possible they spider the website, although a script would be needed to spider out the specific areas of each page on the website. So I'd reckon they might not choose that option because it would involve much custom coding.

wvccboy, Oct 30, 2007 IP

tarponkeith Well-Known Member

Messages:: 4,758

Likes Received:: 279

Best Answers:: 0

Trophy Points:: 180

#5

So, you want to steal another website's content. And, you want us to help?

It wouldn't be hard to write a program to steal content, but why not just develop your own unique information?

tarponkeith, Oct 30, 2007 IP

wyatt12 Active Member

Messages:: 148

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 53

#6

I'm not stealing anything. I'm simply trying to figure out how to extract data from a public website.

wyatt12, Oct 30, 2007 IP

tarponkeith Well-Known Member

Messages:: 4,758

Likes Received:: 279

Best Answers:: 0

Trophy Points:: 180

#7

wyatt12 said: ↑

I'm not stealing anything. I'm simply trying to figure out how to extract data from a public website.
Click to expand...

Public website? So the owner wouldn't mind you copying it? If so, why not just ask the webmaster for a copy?

tarponkeith, Oct 30, 2007 IP

wvccboy Notable Member

Messages:: 2,632

Likes Received:: 81

Best Answers:: 1

Trophy Points:: 250

#8

Hey dude slow down.

I think he just wants the basic idea of how to do it.

If he were to really steal from the owner's website it'd be his own problem to handle the situation.

Although yes I do agree, why not just ask the owner of the site for the copy?

wvccboy, Oct 30, 2007 IP

tarponkeith Well-Known Member

Messages:: 4,758

Likes Received:: 279

Best Answers:: 0

Trophy Points:: 180

#9

wvccboy said: ↑

Hey dude slow down.

I think he just wants the basic idea of how to do it.

If he were to really steal from the owner's website it'd be his own problem to handle the situation.

Although yes I do agree, why not just ask the owner of the site for the copy?
Click to expand...

I've had content stolen before... Takes me weeks to write dozens of good, quality posts... Then someone decides to post them on their site with no link back to me...

For any coder, making a scrapper is beginner's stuff, so making one shouldn't be a problem for him... But still, it's not a great plan to make a website using other people's content...

tarponkeith, Oct 30, 2007 IP

wyatt12 Active Member

Messages:: 148

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 53

#10

I never said I was stealing or republishing data. You are making this assumption. I simply want to know if there is an easier way to extract data versus writing my own code.

wyatt12, Oct 30, 2007 IP

tarponkeith Well-Known Member

Messages:: 4,758

Likes Received:: 279

Best Answers:: 0

Trophy Points:: 180

#11

wyatt12 said: ↑

I never said I was stealing or republishing data. You are making this assumption. I simply want to know if there is an easier way to extract data versus writing my own code.
Click to expand...

you could ask the owner of the content... he might give it to you...

tarponkeith, Oct 30, 2007 IP

nullpointer Peon

Messages:: 274

Likes Received:: 14

Best Answers:: 0

Trophy Points:: 0

#12

Try Web Content Extractor

nullpointer, Feb 3, 2008 IP

SNaRe Well-Known Member

Messages:: 1,132

Likes Received:: 32

Best Answers:: 0

Trophy Points:: 165

#13

you can do it with preg_match with php. If you need any service like that let meknow

SNaRe, Feb 4, 2008 IP

angilina Notable Member

Messages:: 7,824

Likes Received:: 186

Best Answers:: 0

Trophy Points:: 260

#14

wyatt12 said: ↑

Hi,

I notice that people sell content databases that have obviously been ripped/extracted from other websites. How is this done?

Do they write some custom perl/php code that spiders the website and extracts all the data from the web pages, cleans it up, and then inserts it into their own database?

Or is there a piece of software that has already been written that can be purchased for such purpose?

For example, if I wanted to extract/rip all the recipes from allrecipes.com, how would i do it?

Regards,

Wyatt
Click to expand...

people do it manually, yes it takes time, so people hire others to do it.

also, people create some script which automatically extract data,

but scripts do not always give 100% accurate data

angilina, Feb 4, 2008 IP

Alreadyinuse23 Active Member

Messages:: 73

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 91

#15

tarponkeith said: ↑

you could ask the owner of the content... he might give it to you...
Click to expand...

how the hell do you know he doesnt want to get the data from his own site?

Alreadyinuse23, Jul 10, 2008 IP

HBZSoftware.com Peon

Messages:: 88

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#16

Scraping content can be used for perfectly legitimate reasons.

To do this, you just write a spider in your favorite programming language.

With PHP, most opt to use regular expressions to scrape content. I think using explode() is far easier.

HBZSoftware.com, Jul 10, 2008 IP

doibuon Peon

Messages:: 1

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#17

Use vder software, it is free

screenshot:http://binhgiang.sourceforge.net/xmlalbum/screenshots.html

and download: http://binhgiang.sourceforge.net/site/download.jsp

doibuon, Jul 23, 2009 IP

Traffic-Bug Active Member

Messages:: 1,866

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 80

#18

wyatt12 said: ↑

I'm not stealing anything. I'm simply trying to figure out how to extract data from a public website.
Click to expand...

Spidering or manual copy + paste is the only way that comes to mind as of now.
Check out this thread:
www.daniweb.com/forums/thread10023.html

Check out this zillman article
http://zillman.blogspot.com/2004/09/web-data-extractors.html

and also search for 'web data extractor'. I found tons of relevant links there. Check out websundew and mozenda for example.

Traffic-Bug, Oct 27, 2009 IP

csharpp Peon

Messages:: 7

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#19

Try the new ScrapePro Web Scraper Designer from http://www.scrapepro.com

csharpp, Aug 29, 2010 IP

Log in or Sign up

How to extract/rip data from websites databases?

wyatt12 Active Member

wvccboy Notable Member

wyatt12 Active Member

wvccboy Notable Member

tarponkeith Well-Known Member

wyatt12 Active Member

tarponkeith Well-Known Member

wvccboy Notable Member

tarponkeith Well-Known Member

wyatt12 Active Member

tarponkeith Well-Known Member

nullpointer Peon

SNaRe Well-Known Member

angilina Notable Member

Alreadyinuse23 Active Member

HBZSoftware.com Peon

doibuon Peon

Traffic-Bug Active Member

csharpp Peon

Useful Searches