Is web scraping Legal?

nasimkhan Peon

Messages:: 17

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#21

if it is posted on the Internet it doesn't belong to anybody.

thanks.

nasimkhan, Aug 20, 2011 IP

Slincon Well-Known Member

Messages:: 1,319

Likes Received:: 44

Best Answers:: 0

Trophy Points:: 180

#22

It's generally illegal - both for copyright use and due to unauthorized use (disclaimer on how you can access their site).
Sites that do allow content scraping usually have an API - so common rule of thumb is - no API, no copying

Slincon, Aug 21, 2011 IP

Rufas Peon

Messages:: 27

Likes Received:: 1

Best Answers:: 3

Trophy Points:: 0

#23

Did anybody heard of this case?

http://petewarden.typepad.com/searchbrowser/2010/04/how-i-got-sued-by-facebook.html

The highlight is, the only legal way to access any web site with a crawler was to obtain prior written permission. Unless you have a deep pocket and you want an attorney to knock on your door, you better don't do it.

- Rufas

Rufas, Aug 21, 2011 IP

Matthias Member

Messages:: 88

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 48

#24

Rufas said: ↑

Did anybody heard of this case?

http://petewarden.typepad.com/searchbrowser/2010/04/how-i-got-sued-by-facebook.html

The highlight is, the only legal way to access any web site with a crawler was to obtain prior written permission. Unless you have a deep pocket and you want an attorney to knock on your door, you better don't do it.

- Rufas
Click to expand...

I just finished the article... Its a disturbing situation but not all that unexpected out of Facebook. Social networking sites are different then regular internet sites so additional care has to be used. Facebook, in particular, can be messy because of the private users wanting to keep their profiles private.

The best way to handle Facebook and similar sites is to set up a fan page and make it clear that the whole intent of that page is to crawl and analyze profiles that like the page. Make sure you have a clearly defined purpose on how personal information will be used or if it will be aggregated into a larger picture. Post updates routinely and be transparent about the whole process.

Matthias, Aug 21, 2011 IP

Blue Star Ent. Well-Known Member

Messages:: 1,989

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 160

#25

Matthias said: ↑

The best way to handle Facebook and similar sites is to set up a fan page and make it clear that the whole intent of that page is to crawl and analyze profiles that like the page.
Click to expand...

The best way is to take the middle man ( Facebook ) out of the equation. No one needs someone else to tell them what they "like". Diaspora ( @joindiaspora ) does just that and includes encryption for a level of security Facebook will probably never have. It is open source and peer to peer.

Blue Star Ent., Aug 21, 2011 IP

browntwn Illustrious Member

Messages:: 8,347

Likes Received:: 848

Best Answers:: 7

Trophy Points:: 435

#26

Rufas said: ↑

Did anybody heard of this case?

http://petewarden.typepad.com/searchbrowser/2010/04/how-i-got-sued-by-facebook.html

The highlight is, the only legal way to access any web site with a crawler was to obtain prior written permission. Unless you have a deep pocket and you want an attorney to knock on your door, you better don't do it.

- Rufas
Click to expand...

That is not the highlight - that was Facebook's position and he didn't challenge it in any way. They called him and threatened him and he agreed with them. This is not even a case and stands for nothing whatsoever.

browntwn, Aug 22, 2011 IP

Rufas Peon

Messages:: 27

Likes Received:: 1

Best Answers:: 3

Trophy Points:: 0

#27

@browntwn Yes, you are right. It is not even a case as the lawyer haven't submit the document to the court yet. So it is more like a initial negotiation.

But anyway, that guy did said,

...but my lawyer advised me that it had never been tested in court, and the legal costs alone of being a test case would bankrupt me....
Click to expand...

As they say, "Anyone can sue anybody for anything at any time, anywhere." Legal or not, I'll let the court decides. But be aware of any trouble you might get into.

- Rufas

Rufas, Aug 22, 2011 IP

browntwn Illustrious Member

Messages:: 8,347

Likes Received:: 848

Best Answers:: 7

Trophy Points:: 435

#28

Rufas said: ↑

@browntwn Yes, you are right. It is not even a case as the lawyer haven't submit the document to the court yet. So it is more like a initial negotiation.

But anyway, that guy did said,

As they say, "Anyone can sue anybody for anything at any time, anywhere." Legal or not, I'll let the court decides. But be aware of any trouble you might get into.

- Rufas
Click to expand...

That case did not deal with just scraping, he was scraping and then he published the data. It is the publication that is the distinction in all of these cases. I've yet to see anything that indicates scraping itself is illegal.

browntwn, Aug 22, 2011 IP

Matthias Member

Messages:: 88

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 48

#29

Blue Star Ent. said: ↑

The best way is to take the middle man ( Facebook ) out of the equation. No one needs someone else to tell them what they "like". Diaspora ( @joindiaspora ) does just that and includes encryption for a level of security Facebook will probably never have. It is open source and peer to peer.
Click to expand...

I agree there, but if you have to interact with FaceBook from a business standpoint, there is a safe way to do it. Unfortunately, Diaspora doesn't have the footprint FaceBook does.

Matthias, Aug 22, 2011 IP

Blue Star Ent. Well-Known Member

Messages:: 1,989

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 160

#30

Matthias said: ↑

I agree there, but if you have to interact with FaceBook from a business standpoint, there is a safe way to do it. Unfortunately, Diaspora doesn't have the footprint FaceBook does.
Click to expand...

We are getting off-topic...

True about the footprint, but do you believe that software will not replace every "middle man"? I believe it will, therefore every "third person" mode of connection is unnecessary. If it is unnecessary, it will fall by the wayside, because already we do not have wireless resources to support everyone coming online. LINK

Ideas are not breakable, and the "peer to peer" idea is superior to the "peer to middle man to peer" idea.

Blue Star Ent., Aug 23, 2011 IP

Matthias Member

Messages:: 88

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 48

#31

Blue Star Ent. said: ↑

True about the footprint, but do you believe that software will not replace every "middle man"? I believe it will, therefore every "third person" mode of connection is unnecessary. If it is unnecessary, it will fall by the wayside, because already we do not have wireless resources to support everyone coming online.
Ideas are not breakable, and the "peer to peer" idea is superior to the "peer to middle man to peer" idea.
Click to expand...

I do agree with you to some extent. I don't necessarily see the middle man going away completely though. Web Directories are a good example of where the middle man is part of the equation even though most of the process is completely automated. Peer to peer I believe won't take a strong footprint compared to distributed processing. Search engine spiders distributed across large numbers of computers for web crawling and scraping will easily overshadow any peer to peer methods that may be employed.

While peer to peer method have their place, there are severe restrictions involved in web scraping/crawling especially with the legalities involved with the content gathered scraped. I thing peer to peer web scraping/crawling is a law suit just waiting to happen because there is no inherent central authority to guarantee how the content will be managed, stored, and used.

Matthias, Aug 23, 2011 IP

Blue Star Ent. likes this.

Blue Star Ent. Well-Known Member

Messages:: 1,989

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 160

#32

Matthias said: ↑

Peer to peer I believe won't take a strong footprint compared to distributed processing.
Click to expand...

Any distributed processing software ( spider ) created by whatever group of people will need to be approved will it not? By whom? And... it is still created by humans, I hope.

If the distributed processing is not "approved" by the same central authority you mentioned, then there is only one other choice. The choice by the so-called "central authority". We are created "free and independent", according to President Kennedy. Our spiders and software and web will need to reflect that basic fact. Distributed computing sounds great and.... aligns with who we are.

Here is the video :

[video=youtube;V9uDlOA_bNA]https://www.youtube.com/watch?v=V9uDlOA_bNA[/video]

Blue Star Ent., Aug 24, 2011 IP

contentboss Peon

Messages:: 3,241

Likes Received:: 54

Best Answers:: 0

Trophy Points:: 0

#33

nasimkhan said: ↑

if it is posted on the Internet it doesn't belong to anybody.

thanks.
Click to expand...

yeah, right. Go straight to jail.

contentboss, Aug 24, 2011 IP

contentboss Peon

Messages:: 3,241

Likes Received:: 54

Best Answers:: 0

Trophy Points:: 0

#34

Slincon said: ↑

It's generally illegal - both for copyright use and due to unauthorized use (disclaimer on how you can access their site).
Sites that do allow content scraping usually have an API - so common rule of thumb is - no API, no copying
Click to expand...

would be interested to see that tested.

EG 'You may only access our site with a browser that uses Times Roman font'.

Seeing as most scrapers fake a browser anyway, I'd like to see how they could determine it.

contentboss, Aug 24, 2011 IP

Matthias Member

Messages:: 88

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 48

#35

Blue Star Ent. said: ↑

Any distributed processing software ( spider ) created by whatever group of people will need to be approved will it not? By whom? And... it is still created by humans, I hope.

If the distributed processing is not "approved" by the same central authority you mentioned, then there is only one other choice. The choice by the so-called "central authority". We are created "free and independent", according to President Kennedy. Our spiders and software and web will need to reflect that basic fact. Distributed computing sounds great and.... aligns with who we are.
Click to expand...

There where the problems really began, when you have distributed crawling without a central authority, ie (Google for example), there is no way to control the use of the information collected. I am in no means suggesting a government mess, but rather what SETI has done. SETI acts as a central authority to hundreds of thousands of computers maintained by individuals. Even with SETI's distributed power, its footprint is small comparatively speaking in the context of a peer to peer setting. There would simply be too much data throttling on a large scale.

The only way I could see this approach working in a peer to peer methodology, would be as a clusterized crawler whereby small clusters connecting to "command nodes" handle the crawling. The command node would have to connect to the central data storage or be a part of a distributed data storage, bur not peer to peer at this level.

Matthias, Aug 24, 2011 IP

Blue Star Ent. Well-Known Member

Messages:: 1,989

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 160

#36

Matthias said: ↑

SETI acts as a central authority to hundreds of thousands of computers maintained by individuals. Even with SETI's distributed power, its footprint is small comparatively speaking in the context of a peer to peer setting. There would simply be too much data throttling on a large scale.
Click to expand...

The software would enhance that. Google has massive servers, but not security. A distributed system has security because if the login credentials are encrypted, or changed to a new form ( my idea ), a hacker has nothing to do. A central server will always be more vulnerable. This is no doubt a large topic, but I am sure the throttling problem could be addressed with success.

Matthias said: ↑

The only way I could see this approach working in a peer to peer methodology, would be as a clusterized crawler whereby small clusters connecting to "command nodes" handle the crawling.
Click to expand...

I would have any node/computer capable of making authorizations, similar to the way Bitcoin is setup. I would also use machine learning to improve the system over time.

Blue Star Ent., Aug 25, 2011 IP

webhustla Peon

Messages:: 80

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#37

Technically, it's not illegal to "scrape" the content (it's like reading the yellow pages and writing down the numbers/businesses that interest you for example). But it's illegal if you reuse the data (for example publish it somewhere else) without the original owner's consent (and every bit published online is copyright of the writer, unless they explicitly gave up that right on a website like Flickr or Wikipedia).

webhustla, Aug 29, 2011 IP

nasimkhan Peon

Messages:: 17

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#38

this thing really force me to say excellent
this reallly made me impressed
this is really a awesome post
thanks to post admin

progresslightingdiscount.com

nasimkhan, Dec 4, 2011 IP

Graham W Peon

Messages:: 1

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 1

#39

So, Google appears to be able to trawl through websites reading the content and recording key words that are then used by people when they do searches for the same or similar words. Google haven't really "re-published" the content but they do hold at least some of the content separate from the original website on their own servers.
What happens if I have an application that needs to read through a large amount of information, sort of like what Google does, but instead of just looking at key words it is examining concepts and ideas and opinions about subjects and then uses that information to make recommendations to people. Just like Google you aren't really "re-publishing" the information you are just using the content as part of the body of knowledge that you build up from multiple sources to form a recommendation.
For instance, if the application crawled this website and "absorbed" all the conversations in the various forums it might possibly use some of the convergent opinions expressed in the "Legal Issues" forum to add to a recommendation in response to a question posed to it.
This application would improve in its ability to make accurate recommendations/suggestions as it absorbed more information, therefore it would be ideal if it was possible to have it crawl through the web in general just like google does accumulating more and more info from various sources.
How does Google trawl the web without getting into copyright issues at every site it visits?

Graham W, Dec 10, 2013 IP

Log in or Sign up

Is web scraping Legal?

nasimkhan Peon

Slincon Well-Known Member

Rufas Peon

Matthias Member

Blue Star Ent. Well-Known Member

browntwn Illustrious Member

Rufas Peon

browntwn Illustrious Member

Matthias Member

Blue Star Ent. Well-Known Member

Matthias Member

Blue Star Ent. Well-Known Member

contentboss Peon

contentboss Peon

Matthias Member

Blue Star Ent. Well-Known Member

webhustla Peon

nasimkhan Peon

Graham W Peon

Useful Searches