Best way to spider/crawl content on another site?

yo-yo Well-Known Member

Messages:: 4,619

Likes Received:: 206

Best Answers:: 0

Trophy Points:: 185

#1

Hi,

I'm wondering what methods most of you use for spidering content / crawling a site...

I'm currently using php with the file_get_contents funtion, doing a few string replaces and saving the data i want to mysql... I need to do this about 3,000 times to get all the data i need from the site...

I've been doing it in batches of 100 urls at a time and that takes 1-2 mins each time. I'm little worried about hitting this site too many times.. are the complications?

Are there better ways of doing this in php? What about other languages?

Also, are there any legal implications when doing this? Not necc. about having someone elses content, but spidering some elses site a ton of times...

PS: the content is copyrighted but i have permission to use it since im an affiliate..

yo-yo, Jul 24, 2005 IP

Shoemoney $

Messages:: 4,474

Likes Received:: 588

Best Answers:: 0

Trophy Points:: 295

#2

always legal issues on the net... but unless your impacting that sites services i doubt its much of a deal.

when i get a site i use wget, disquse the user agent and use a rotating proxy script like tor and privoxy.

if you have permission i would think your golden

Shoemoney, Jul 24, 2005 IP

yo-yo Well-Known Member

Messages:: 4,619

Likes Received:: 206

Best Answers:: 0

Trophy Points:: 185

#3

I've heard good things about wget.. is it easy to use? Learning curve?

yo-yo, Jul 24, 2005 IP

Shoemoney $

Messages:: 4,474

Likes Received:: 588

Best Answers:: 0

Trophy Points:: 295

#4

its really easy to use....

its all command line based but totally scriptable.

I also use curl and other stuff just depending on what im doing

Shoemoney, Jul 24, 2005 IP

yo-yo Well-Known Member

Messages:: 4,619

Likes Received:: 206

Best Answers:: 0

Trophy Points:: 185

#5

I'm right behind you.. on my way to $43k per month in AS

yo-yo, Jul 24, 2005 IP

Shoemoney $

Messages:: 4,474

Likes Received:: 588

Best Answers:: 0

Trophy Points:: 295

#6

=P lol... that was my 2nd lowest this year so far too =P

Shoemoney, Jul 24, 2005 IP

Wizard Member

Messages:: 80

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 43

#7

I do this quite a bit using PHP. Data harvesting is one of my passions. There are a lot of web sites that I have spidered - alot, and have never had my IP address blocked.

Most of the web sites have copyright notices about the content, so I will leave that decision to you. The data I extract is used for my own purpose and I have always hesitated trying to sell it. I have several large projects I work on and the data helps me gain some advantages on creating content, pricing, etc.

Unethical - perhaps, but illegal - I do not think so as it's usually for research and the info is available to anyone visting their web site. I do not hack into databases, just view the data that is presented to any vistor.

Wizard, Jul 26, 2005 IP

aboyd Well-Known Member

Messages:: 158

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 138

#8

yo-yo said:

I've heard good things about wget.. is it easy to use? Learning curve?
Click to expand...

Unlike Shoemoney, I don't think wget is easy to use. Easy to use is "press 1 or 2 buttons, get what you want."

However, I've been using wget for a while now, and I've saved some wget command-line text that I've used, just so I don't have relearn the whole thing each time. So here is what I've saved.

Grab every image from apinupsite.com, except thumbnails. If it asks for a login, give it. Verbose output, two second pause between requests:
wget -r -k -nd --dot-style=binary --no-host-lookup -Q1000m -w2 -t5 -l0 -H -A.jpg,.jpeg,.png -R"th_*,tn_*" -D.apinupsite.com --http-user=username --http-passwd=password http://www.apinupsite.com/members/
Code (markup):
Download a full Web site, including HTML and images, keep directory structure:
wget -r --progress=binary -w1 -Q1000m -v -t3 -nH -np -k -l10 -Dwww.yoursite.com http://www.yoursite/index.html
Code (markup):
Try those two on your own site first, to see the difference. Be verrrry careful to keep a pause in there (the -w stuff). Without a pause, wget will clobber a low-end server, and the admin will surely have a chat with your ISP.

-Tony

aboyd, Jul 27, 2005 IP

nevetS Evolving Dragon

Messages:: 2,544

Likes Received:: 211

Best Answers:: 0

Trophy Points:: 135

#9

Putting together scripts with LWP::Simple is pretty easy as well if you need to do any parsing on the output. wget is by far the simplest way though.

If you must use php, use curl. PHP Spidering seems much slower by comparison - although I don't know why.

nevetS, Jul 27, 2005 IP

Shoemoney $

Messages:: 4,474

Likes Received:: 588

Best Answers:: 0

Trophy Points:: 295

#10

I use wget to suck and python to parse

Shoemoney, Jul 27, 2005 IP

Abilnet Peon

Messages:: 41

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 0

#11

Wget is powerful, but sometimes the command-line tools are "too powerfull" if you don't know exactly what you're doing.

I have not personally used this script, but if you dont mind to pay a few bucks for quite an advanced data-mining-script, you maybe want to take a look at "Unit Miner" from qualityunit.com

...You can create web robots to grab content from other sites and store it to your database, or use it as input to your system...
Click to expand...

So, I've not personally tested it, but I've bought a license of their Affiliate-script and at least it is a very powerfull script

...2 cents

Abilnet, Jul 28, 2005 IP

senexom Guest

Messages:: 28

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#12

There is a much easier way of pulling content of the website, not necessarily storing it in the database but, rather saving it as text...

check out cURL - it's a desktop ap works pretty much on any platofrom, has great documentation.

sorry can't post links just yet
http://curl.haxx.se/

senexom, Aug 1, 2005 IP

yabsoft Active Member

Messages:: 118

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 55

#13

i have coded a script to import data from dir.google.com,dir.yahoo.com.
Does this voilate their TOS?

yabsoft, Aug 2, 2005 IP

nevetS Evolving Dragon

Messages:: 2,544

Likes Received:: 211

Best Answers:: 0

Trophy Points:: 135

#14

if you hit their servers hard, your ip will be blocked.

I believe it is against their TOS to scrape their results, but I honestly do not know the specifics behind what you can and cannot do.

nevetS, Aug 2, 2005 IP

yabsoft Active Member

Messages:: 118

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 55

#15

dmoz.org does allow to import data by scrape data?

yabsoft, Aug 2, 2005 IP

nevetS Evolving Dragon

Messages:: 2,544

Likes Received:: 211

Best Answers:: 0

Trophy Points:: 135

#16

dmoz has a few solutions - you can download the entire directory in an .rdf file, or you can google around for a few "live data" solutions that use web services. They do not let you scrape, however.

nevetS, Aug 2, 2005 IP

aboyd Well-Known Member

Messages:: 158

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 138

#17

Note that a TOS on publicly accessible data is completely unenforcable. The TOS and other legal agreements have to be agreed to to have any weight. If they put up a page and say "if you scrape it, you violate the TOS" well big whoop.

Of course, I'm not suggesting that violating the TOS is good. There are ways that it can haunt you. They can talk to your ISP, and many ISPs will take a TOS violation seriously, regardless of how legal it is. Also, if you do something with the screen scrape that violates copyright or trademarks, then forget the TOS, because you've got a whole boatload of other problems.

-Tony

aboyd, Aug 2, 2005 IP

jyotsna Peon

Messages:: 5

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#18

Hi,

I started a new project where in I have to fetch the dynamic content of the two sites and save in my database. Please help in this regard.

Each item having title, image, description etc...

Thanks in Advance.

I am waiting for your reply.

Jyotsna. Ch.

jyotsna, Sep 13, 2005 IP

aboyd Well-Known Member

Messages:: 158

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 138

#19

jyotsna said:

I started a new project where in I have to fetch the dynamic content of the two sites and save in my database. Please help in this regard.
Click to expand...

But but but... you're posting in a thread where the answers are already given. Didn't you see the mentions of wget -- including sample lines -- and cURL, and Perl's LWP module, and Unit Miner? What more do you need?

I guess if you're just asking how to parse data once you've got it, then the ultimate tool is regular expressions. PHP regex here:

http://us2.php.net/preg_match

And Perl regex here:

http://search.cpan.org/dist/perl/pod/perlre.pod

-Tony

aboyd, Sep 13, 2005 IP

Log in or Sign up

Best way to spider/crawl content on another site?

yo-yo Well-Known Member

Shoemoney $

yo-yo Well-Known Member

Shoemoney $

yo-yo Well-Known Member

Shoemoney $

Wizard Member

aboyd Well-Known Member

nevetS Evolving Dragon

Shoemoney $

Abilnet Peon

senexom Guest

yabsoft Active Member

nevetS Evolving Dragon

yabsoft Active Member

nevetS Evolving Dragon

aboyd Well-Known Member

jyotsna Peon

aboyd Well-Known Member

Useful Searches