1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Best way to spider/crawl content on another site?

Discussion in 'Programming' started by yo-yo, Jul 24, 2005.

  1. #1
    Hi,

    I'm wondering what methods most of you use for spidering content / crawling a site...

    I'm currently using php with the file_get_contents funtion, doing a few string replaces and saving the data i want to mysql... I need to do this about 3,000 times to get all the data i need from the site...

    I've been doing it in batches of 100 urls at a time and that takes 1-2 mins each time. I'm little worried about hitting this site too many times.. are the complications?

    Are there better ways of doing this in php? What about other languages?

    Also, are there any legal implications when doing this? Not necc. about having someone elses content, but spidering some elses site a ton of times...

    PS: the content is copyrighted but i have permission to use it since im an affiliate..
    SEMrush
     
    yo-yo, Jul 24, 2005 IP
    SEMrush
  2. Shoemoney

    Shoemoney $

    Messages:
    4,475
    Likes Received:
    588
    Best Answers:
    0
    Trophy Points:
    295
    #2
    always legal issues on the net... but unless your impacting that sites services i doubt its much of a deal.

    when i get a site i use wget, disquse the user agent and use a rotating proxy script like tor and privoxy.

    if you have permission i would think your golden
     
    Shoemoney, Jul 24, 2005 IP
  3. yo-yo

    yo-yo Well-Known Member

    Messages:
    4,620
    Likes Received:
    205
    Best Answers:
    0
    Trophy Points:
    185
    #3
    I've heard good things about wget.. is it easy to use? Learning curve?
     
    yo-yo, Jul 24, 2005 IP
  4. Shoemoney

    Shoemoney $

    Messages:
    4,475
    Likes Received:
    588
    Best Answers:
    0
    Trophy Points:
    295
    #4
    its really easy to use....

    its all command line based but totally scriptable.

    I also use curl and other stuff just depending on what im doing
     
    Shoemoney, Jul 24, 2005 IP
  5. yo-yo

    yo-yo Well-Known Member

    Messages:
    4,620
    Likes Received:
    205
    Best Answers:
    0
    Trophy Points:
    185
    #5
    I'm right behind you.. on my way to $43k per month in AS :D
     
    yo-yo, Jul 24, 2005 IP
  6. Shoemoney

    Shoemoney $

    Messages:
    4,475
    Likes Received:
    588
    Best Answers:
    0
    Trophy Points:
    295
    #6
    =P lol... that was my 2nd lowest this year so far too =P
     
    Shoemoney, Jul 24, 2005 IP
  7. Wizard

    Wizard Member

    Messages:
    80
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    43
    #7
    I do this quite a bit using PHP. Data harvesting is one of my passions. There are a lot of web sites that I have spidered - alot, and have never had my IP address blocked.

    Most of the web sites have copyright notices about the content, so I will leave that decision to you. The data I extract is used for my own purpose and I have always hesitated trying to sell it. I have several large projects I work on and the data helps me gain some advantages on creating content, pricing, etc.

    Unethical - perhaps, but illegal - I do not think so as it's usually for research and the info is available to anyone visting their web site. I do not hack into databases, just view the data that is presented to any vistor.
     
    Wizard, Jul 26, 2005 IP
  8. aboyd

    aboyd Well-Known Member

    Messages:
    158
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    138
    #8
    Unlike Shoemoney, I don't think wget is easy to use. Easy to use is "press 1 or 2 buttons, get what you want."

    However, I've been using wget for a while now, and I've saved some wget command-line text that I've used, just so I don't have relearn the whole thing each time. So here is what I've saved.

    Grab every image from apinupsite.com, except thumbnails. If it asks for a login, give it. Verbose output, two second pause between requests:

    wget -r -k -nd --dot-style=binary --no-host-lookup -Q1000m -w2 -t5 -l0 -H -A.jpg,.jpeg,.png -R"th_*,tn_*" -D.apinupsite.com --http-user=username --http-passwd=password http://www.apinupsite.com/members/
    Code (markup):
    Download a full Web site, including HTML and images, keep directory structure:

    wget -r --progress=binary -w1 -Q1000m -v -t3 -nH -np -k -l10 -Dwww.yoursite.com http://www.yoursite/index.html
    Code (markup):
    Try those two on your own site first, to see the difference. Be verrrry careful to keep a pause in there (the -w stuff). Without a pause, wget will clobber a low-end server, and the admin will surely have a chat with your ISP.

    -Tony
     
    aboyd, Jul 27, 2005 IP
  9. nevetS

    nevetS Evolving Dragon

    Messages:
    2,544
    Likes Received:
    211
    Best Answers:
    0
    Trophy Points:
    135
    #9
    Putting together scripts with LWP::Simple is pretty easy as well if you need to do any parsing on the output. wget is by far the simplest way though.

    If you must use php, use curl. PHP Spidering seems much slower by comparison - although I don't know why.
     
    nevetS, Jul 27, 2005 IP
  10. Shoemoney

    Shoemoney $

    Messages:
    4,475
    Likes Received:
    588
    Best Answers:
    0
    Trophy Points:
    295
    #10
    I use wget to suck and python to parse ;)
     
    Shoemoney, Jul 27, 2005 IP
  11. Abilnet

    Abilnet Peon

    Messages:
    41
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Wget is powerful, but sometimes the command-line tools are "too powerfull" if you don't know exactly what you're doing.

    I have not personally used this script, but if you dont mind to pay a few bucks for quite an advanced data-mining-script, you maybe want to take a look at "Unit Miner" from qualityunit.com

    So, I've not personally tested it, but I've bought a license of their Affiliate-script and at least it is a very powerfull script

    ...2 cents :)
     
    Abilnet, Jul 28, 2005 IP
  12. senexom

    senexom Guest

    Messages:
    28
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #12
    There is a much easier way of pulling content of the website, not necessarily storing it in the database but, rather saving it as text...

    check out cURL - it's a desktop ap works pretty much on any platofrom, has great documentation.

    sorry can't post links just yet
    http://curl.haxx.se/
     
    senexom, Aug 1, 2005 IP
  13. yabsoft

    yabsoft Active Member

    Messages:
    118
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    55
    #13
    i have coded a script to import data from dir.google.com,dir.yahoo.com.
    Does this voilate their TOS?
     
    yabsoft, Aug 2, 2005 IP
  14. nevetS

    nevetS Evolving Dragon

    Messages:
    2,544
    Likes Received:
    211
    Best Answers:
    0
    Trophy Points:
    135
    #14
    if you hit their servers hard, your ip will be blocked.

    I believe it is against their TOS to scrape their results, but I honestly do not know the specifics behind what you can and cannot do.
     
    nevetS, Aug 2, 2005 IP
  15. yabsoft

    yabsoft Active Member

    Messages:
    118
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    55
    #15
    dmoz.org does allow to import data by scrape data?
     
    yabsoft, Aug 2, 2005 IP
  16. nevetS

    nevetS Evolving Dragon

    Messages:
    2,544
    Likes Received:
    211
    Best Answers:
    0
    Trophy Points:
    135
    #16
    dmoz has a few solutions - you can download the entire directory in an .rdf file, or you can google around for a few "live data" solutions that use web services. They do not let you scrape, however.
     
    nevetS, Aug 2, 2005 IP
  17. aboyd

    aboyd Well-Known Member

    Messages:
    158
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    138
    #17
    Note that a TOS on publicly accessible data is completely unenforcable. The TOS and other legal agreements have to be agreed to to have any weight. If they put up a page and say "if you scrape it, you violate the TOS" well big whoop.

    Of course, I'm not suggesting that violating the TOS is good. There are ways that it can haunt you. They can talk to your ISP, and many ISPs will take a TOS violation seriously, regardless of how legal it is. Also, if you do something with the screen scrape that violates copyright or trademarks, then forget the TOS, because you've got a whole boatload of other problems.

    -Tony
     
    aboyd, Aug 2, 2005 IP
  18. jyotsna

    jyotsna Peon

    Messages:
    5
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #18
    Hi,

    I started a new project where in I have to fetch the dynamic content of the two sites and save in my database. Please help in this regard.

    Each item having title, image, description etc...

    Thanks in Advance.

    I am waiting for your reply.

    Jyotsna. Ch.
     
    jyotsna, Sep 13, 2005 IP
  19. aboyd

    aboyd Well-Known Member

    Messages:
    158
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    138
    #19
    But but but... you're posting in a thread where the answers are already given. Didn't you see the mentions of wget -- including sample lines -- and cURL, and Perl's LWP module, and Unit Miner? What more do you need?

    I guess if you're just asking how to parse data once you've got it, then the ultimate tool is regular expressions. PHP regex here:

    http://us2.php.net/preg_match

    And Perl regex here:

    http://search.cpan.org/dist/perl/pod/perlre.pod

    -Tony
     
    aboyd, Sep 13, 2005 IP