It's generally illegal - both for copyright use and due to unauthorized use (disclaimer on how you can access their site). Sites that do allow content scraping usually have an API - so common rule of thumb is - no API, no copying
Did anybody heard of this case? http://petewarden.typepad.com/searchbrowser/2010/04/how-i-got-sued-by-facebook.html The highlight is, the only legal way to access any web site with a crawler was to obtain prior written permission. Unless you have a deep pocket and you want an attorney to knock on your door, you better don't do it. - Rufas
I just finished the article... Its a disturbing situation but not all that unexpected out of Facebook. Social networking sites are different then regular internet sites so additional care has to be used. Facebook, in particular, can be messy because of the private users wanting to keep their profiles private. The best way to handle Facebook and similar sites is to set up a fan page and make it clear that the whole intent of that page is to crawl and analyze profiles that like the page. Make sure you have a clearly defined purpose on how personal information will be used or if it will be aggregated into a larger picture. Post updates routinely and be transparent about the whole process.
The best way is to take the middle man ( Facebook ) out of the equation. No one needs someone else to tell them what they "like". Diaspora ( @joindiaspora ) does just that and includes encryption for a level of security Facebook will probably never have. It is open source and peer to peer.
That is not the highlight - that was Facebook's position and he didn't challenge it in any way. They called him and threatened him and he agreed with them. This is not even a case and stands for nothing whatsoever.
@browntwn Yes, you are right. It is not even a case as the lawyer haven't submit the document to the court yet. So it is more like a initial negotiation. But anyway, that guy did said, As they say, "Anyone can sue anybody for anything at any time, anywhere." Legal or not, I'll let the court decides. But be aware of any trouble you might get into. - Rufas
That case did not deal with just scraping, he was scraping and then he published the data. It is the publication that is the distinction in all of these cases. I've yet to see anything that indicates scraping itself is illegal.
I agree there, but if you have to interact with FaceBook from a business standpoint, there is a safe way to do it. Unfortunately, Diaspora doesn't have the footprint FaceBook does.
We are getting off-topic... True about the footprint, but do you believe that software will not replace every "middle man"? I believe it will, therefore every "third person" mode of connection is unnecessary. If it is unnecessary, it will fall by the wayside, because already we do not have wireless resources to support everyone coming online. LINK Ideas are not breakable, and the "peer to peer" idea is superior to the "peer to middle man to peer" idea.
I do agree with you to some extent. I don't necessarily see the middle man going away completely though. Web Directories are a good example of where the middle man is part of the equation even though most of the process is completely automated. Peer to peer I believe won't take a strong footprint compared to distributed processing. Search engine spiders distributed across large numbers of computers for web crawling and scraping will easily overshadow any peer to peer methods that may be employed. While peer to peer method have their place, there are severe restrictions involved in web scraping/crawling especially with the legalities involved with the content gathered scraped. I thing peer to peer web scraping/crawling is a law suit just waiting to happen because there is no inherent central authority to guarantee how the content will be managed, stored, and used.
Any distributed processing software ( spider ) created by whatever group of people will need to be approved will it not? By whom? And... it is still created by humans, I hope. If the distributed processing is not "approved" by the same central authority you mentioned, then there is only one other choice. The choice by the so-called "central authority". We are created "free and independent", according to President Kennedy. Our spiders and software and web will need to reflect that basic fact. Distributed computing sounds great and.... aligns with who we are. Here is the video : [video=youtube;V9uDlOA_bNA]https://www.youtube.com/watch?v=V9uDlOA_bNA[/video]
would be interested to see that tested. EG 'You may only access our site with a browser that uses Times Roman font'. Seeing as most scrapers fake a browser anyway, I'd like to see how they could determine it.
There where the problems really began, when you have distributed crawling without a central authority, ie (Google for example), there is no way to control the use of the information collected. I am in no means suggesting a government mess, but rather what SETI has done. SETI acts as a central authority to hundreds of thousands of computers maintained by individuals. Even with SETI's distributed power, its footprint is small comparatively speaking in the context of a peer to peer setting. There would simply be too much data throttling on a large scale. The only way I could see this approach working in a peer to peer methodology, would be as a clusterized crawler whereby small clusters connecting to "command nodes" handle the crawling. The command node would have to connect to the central data storage or be a part of a distributed data storage, bur not peer to peer at this level.
The software would enhance that. Google has massive servers, but not security. A distributed system has security because if the login credentials are encrypted, or changed to a new form ( my idea ), a hacker has nothing to do. A central server will always be more vulnerable. This is no doubt a large topic, but I am sure the throttling problem could be addressed with success. I would have any node/computer capable of making authorizations, similar to the way Bitcoin is setup. I would also use machine learning to improve the system over time.
Technically, it's not illegal to "scrape" the content (it's like reading the yellow pages and writing down the numbers/businesses that interest you for example). But it's illegal if you reuse the data (for example publish it somewhere else) without the original owner's consent (and every bit published online is copyright of the writer, unless they explicitly gave up that right on a website like Flickr or Wikipedia).
this thing really force me to say excellent this reallly made me impressed this is really a awesome post thanks to post admin progresslightingdiscount.com
So, Google appears to be able to trawl through websites reading the content and recording key words that are then used by people when they do searches for the same or similar words. Google haven't really "re-published" the content but they do hold at least some of the content separate from the original website on their own servers. What happens if I have an application that needs to read through a large amount of information, sort of like what Google does, but instead of just looking at key words it is examining concepts and ideas and opinions about subjects and then uses that information to make recommendations to people. Just like Google you aren't really "re-publishing" the information you are just using the content as part of the body of knowledge that you build up from multiple sources to form a recommendation. For instance, if the application crawled this website and "absorbed" all the conversations in the various forums it might possibly use some of the convergent opinions expressed in the "Legal Issues" forum to add to a recommendation in response to a question posed to it. This application would improve in its ability to make accurate recommendations/suggestions as it absorbed more information, therefore it would be ideal if it was possible to have it crawl through the web in general just like google does accumulating more and more info from various sources. How does Google trawl the web without getting into copyright issues at every site it visits?