Hi, I'm wondering what methods most of you use for spidering content / crawling a site... I'm currently using php with the file_get_contents funtion, doing a few string replaces and saving the data i want to mysql... I need to do this about 3,000 times to get all the data i need from the site... I've been doing it in batches of 100 urls at a time and that takes 1-2 mins each time. I'm little worried about hitting this site too many times.. are the complications? Are there better ways of doing this in php? What about other languages? Also, are there any legal implications when doing this? Not necc. about having someone elses content, but spidering some elses site a ton of times... PS: the content is copyrighted but i have permission to use it since im an affiliate..
always legal issues on the net... but unless your impacting that sites services i doubt its much of a deal. when i get a site i use wget, disquse the user agent and use a rotating proxy script like tor and privoxy. if you have permission i would think your golden
its really easy to use.... its all command line based but totally scriptable. I also use curl and other stuff just depending on what im doing
I do this quite a bit using PHP. Data harvesting is one of my passions. There are a lot of web sites that I have spidered - alot, and have never had my IP address blocked. Most of the web sites have copyright notices about the content, so I will leave that decision to you. The data I extract is used for my own purpose and I have always hesitated trying to sell it. I have several large projects I work on and the data helps me gain some advantages on creating content, pricing, etc. Unethical - perhaps, but illegal - I do not think so as it's usually for research and the info is available to anyone visting their web site. I do not hack into databases, just view the data that is presented to any vistor.
Unlike Shoemoney, I don't think wget is easy to use. Easy to use is "press 1 or 2 buttons, get what you want." However, I've been using wget for a while now, and I've saved some wget command-line text that I've used, just so I don't have relearn the whole thing each time. So here is what I've saved. Grab every image from apinupsite.com, except thumbnails. If it asks for a login, give it. Verbose output, two second pause between requests: wget -r -k -nd --dot-style=binary --no-host-lookup -Q1000m -w2 -t5 -l0 -H -A.jpg,.jpeg,.png -R"th_*,tn_*" -D.apinupsite.com --http-user=username --http-passwd=password http://www.apinupsite.com/members/ Code (markup): Download a full Web site, including HTML and images, keep directory structure: wget -r --progress=binary -w1 -Q1000m -v -t3 -nH -np -k -l10 -Dwww.yoursite.com http://www.yoursite/index.html Code (markup): Try those two on your own site first, to see the difference. Be verrrry careful to keep a pause in there (the -w stuff). Without a pause, wget will clobber a low-end server, and the admin will surely have a chat with your ISP. -Tony
Putting together scripts with LWP::Simple is pretty easy as well if you need to do any parsing on the output. wget is by far the simplest way though. If you must use php, use curl. PHP Spidering seems much slower by comparison - although I don't know why.
Wget is powerful, but sometimes the command-line tools are "too powerfull" if you don't know exactly what you're doing. I have not personally used this script, but if you dont mind to pay a few bucks for quite an advanced data-mining-script, you maybe want to take a look at "Unit Miner" from qualityunit.com So, I've not personally tested it, but I've bought a license of their Affiliate-script and at least it is a very powerfull script ...2 cents
There is a much easier way of pulling content of the website, not necessarily storing it in the database but, rather saving it as text... check out cURL - it's a desktop ap works pretty much on any platofrom, has great documentation. sorry can't post links just yet http://curl.haxx.se/
if you hit their servers hard, your ip will be blocked. I believe it is against their TOS to scrape their results, but I honestly do not know the specifics behind what you can and cannot do.
dmoz has a few solutions - you can download the entire directory in an .rdf file, or you can google around for a few "live data" solutions that use web services. They do not let you scrape, however.
Note that a TOS on publicly accessible data is completely unenforcable. The TOS and other legal agreements have to be agreed to to have any weight. If they put up a page and say "if you scrape it, you violate the TOS" well big whoop. Of course, I'm not suggesting that violating the TOS is good. There are ways that it can haunt you. They can talk to your ISP, and many ISPs will take a TOS violation seriously, regardless of how legal it is. Also, if you do something with the screen scrape that violates copyright or trademarks, then forget the TOS, because you've got a whole boatload of other problems. -Tony
Hi, I started a new project where in I have to fetch the dynamic content of the two sites and save in my database. Please help in this regard. Each item having title, image, description etc... Thanks in Advance. I am waiting for your reply. Jyotsna. Ch.
But but but... you're posting in a thread where the answers are already given. Didn't you see the mentions of wget -- including sample lines -- and cURL, and Perl's LWP module, and Unit Miner? What more do you need? I guess if you're just asking how to parse data once you've got it, then the ultimate tool is regular expressions. PHP regex here: http://us2.php.net/preg_match And Perl regex here: http://search.cpan.org/dist/perl/pod/perlre.pod -Tony