1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

How to extract/rip data from websites databases?

Discussion in 'General Business' started by wyatt12, Oct 30, 2007.

  1. #1
    Hi,

    I notice that people sell content databases that have obviously been ripped/extracted from other websites. How is this done?

    Do they write some custom perl/php code that spiders the website and extracts all the data from the web pages, cleans it up, and then inserts it into their own database?

    Or is there a piece of software that has already been written that can be purchased for such purpose?

    For example, if I wanted to extract/rip all the recipes from allrecipes.com, how would i do it?

    Regards,
    SEMrush
    Wyatt
     
    wyatt12, Oct 30, 2007 IP
    SEMrush
  2. wvccboy

    wvccboy Notable Member

    Messages:
    2,632
    Likes Received:
    81
    Best Answers:
    1
    Trophy Points:
    250
    #2
    They've evidently hacked into it.. you can't rip off a content database. If you could do so, basically every website would be gone by now with stolen content...
     
    wvccboy, Oct 30, 2007 IP
  3. wyatt12

    wyatt12 Active Member

    Messages:
    148
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    53
    #3
    I doubt they hack the mysql server to get the data, although very possible. My bet is they spider the website, and dump the content into a mysql database of their own. I just want to know if this software exists, before I spend the time to write my own.

    Wyatt
     
    wyatt12, Oct 30, 2007 IP
  4. wvccboy

    wvccboy Notable Member

    Messages:
    2,632
    Likes Received:
    81
    Best Answers:
    1
    Trophy Points:
    250
    #4
    Most people do that, because that's where I see many databases come from, are hacking sites and not places like DP forums... the ones that are sold on DP forums are probably commercial databases created on a turnkey basis.

    It is also possible they spider the website, although a script would be needed to spider out the specific areas of each page on the website. So I'd reckon they might not choose that option because it would involve much custom coding.
     
    wvccboy, Oct 30, 2007 IP
  5. tarponkeith

    tarponkeith Well-Known Member

    Messages:
    4,758
    Likes Received:
    279
    Best Answers:
    0
    Trophy Points:
    180
    #5
    So, you want to steal another website's content. And, you want us to help?

    It wouldn't be hard to write a program to steal content, but why not just develop your own unique information?
     
    tarponkeith, Oct 30, 2007 IP
  6. wyatt12

    wyatt12 Active Member

    Messages:
    148
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    53
    #6
    I'm not stealing anything. I'm simply trying to figure out how to extract data from a public website.
     
    wyatt12, Oct 30, 2007 IP
  7. tarponkeith

    tarponkeith Well-Known Member

    Messages:
    4,758
    Likes Received:
    279
    Best Answers:
    0
    Trophy Points:
    180
    #7
    Public website? So the owner wouldn't mind you copying it? If so, why not just ask the webmaster for a copy?
     
    tarponkeith, Oct 30, 2007 IP
  8. wvccboy

    wvccboy Notable Member

    Messages:
    2,632
    Likes Received:
    81
    Best Answers:
    1
    Trophy Points:
    250
    #8
    Hey dude slow down.

    I think he just wants the basic idea of how to do it.

    If he were to really steal from the owner's website it'd be his own problem to handle the situation.

    Although yes I do agree, why not just ask the owner of the site for the copy?
     
    wvccboy, Oct 30, 2007 IP
  9. tarponkeith

    tarponkeith Well-Known Member

    Messages:
    4,758
    Likes Received:
    279
    Best Answers:
    0
    Trophy Points:
    180
    #9
    I've had content stolen before... Takes me weeks to write dozens of good, quality posts... Then someone decides to post them on their site with no link back to me...

    For any coder, making a scrapper is beginner's stuff, so making one shouldn't be a problem for him... But still, it's not a great plan to make a website using other people's content...
     
    tarponkeith, Oct 30, 2007 IP
  10. wyatt12

    wyatt12 Active Member

    Messages:
    148
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    53
    #10
    I never said I was stealing or republishing data. You are making this assumption. I simply want to know if there is an easier way to extract data versus writing my own code.
     
    wyatt12, Oct 30, 2007 IP
  11. tarponkeith

    tarponkeith Well-Known Member

    Messages:
    4,758
    Likes Received:
    279
    Best Answers:
    0
    Trophy Points:
    180
    #11
    you could ask the owner of the content... he might give it to you...
     
    tarponkeith, Oct 30, 2007 IP
  12. nullpointer

    nullpointer Peon

    Messages:
    274
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    0
    #12
    Try Web Content Extractor
     
    nullpointer, Feb 3, 2008 IP
  13. SNaRe

    SNaRe Well-Known Member

    Messages:
    1,132
    Likes Received:
    32
    Best Answers:
    0
    Trophy Points:
    115
    #13
    you can do it with preg_match with php. If you need any service like that let meknow
     
    SNaRe, Feb 4, 2008 IP
  14. angilina

    angilina Notable Member

    Messages:
    7,824
    Likes Received:
    186
    Best Answers:
    0
    Trophy Points:
    260
    #14
    people do it manually, yes it takes time, so people hire others to do it.

    also, people create some script which automatically extract data,

    but scripts do not always give 100% accurate data
     
    angilina, Feb 4, 2008 IP
  15. Alreadyinuse23

    Alreadyinuse23 Active Member

    Messages:
    73
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    91
    #15
    how the hell do you know he doesnt want to get the data from his own site?
     
    Alreadyinuse23, Jul 10, 2008 IP
  16. HBZSoftware.com

    HBZSoftware.com Peon

    Messages:
    88
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #16
    Scraping content can be used for perfectly legitimate reasons.

    To do this, you just write a spider in your favorite programming language.

    With PHP, most opt to use regular expressions to scrape content. I think using explode() is far easier.
     
    HBZSoftware.com, Jul 10, 2008 IP
  17. doibuon

    doibuon Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #17
    Use vder software, it is free

    screenshot:http://binhgiang.sourceforge.net/xmlalbum/screenshots.html

    and download: http://binhgiang.sourceforge.net/site/download.jsp
     
    doibuon, Jul 23, 2009 IP
  18. Traffic-Bug

    Traffic-Bug Active Member

    Messages:
    1,866
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    80
    #18
    Spidering or manual copy + paste is the only way that comes to mind as of now.
    Check out this thread:
    www.daniweb.com/forums/thread10023.html

    Check out this zillman article
    http://zillman.blogspot.com/2004/09/web-data-extractors.html

    and also search for 'web data extractor'. I found tons of relevant links there. Check out websundew and mozenda for example.
     
    Traffic-Bug, Oct 27, 2009 IP
  19. csharpp

    csharpp Peon

    Messages:
    7
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #19
    Try the new ScrapePro Web Scraper Designer from http://www.scrapepro.com
     
    csharpp, Aug 29, 2010 IP