Is web scraping Legal?

Discussion in 'Legal Issues' started by idmindia, Aug 16, 2011.

  1. nasimkhan

    nasimkhan Peon

    Messages:
    17
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #21
    if it is posted on the Internet it doesn't belong to anybody.

    thanks.
     
    nasimkhan, Aug 20, 2011 IP
  2. Slincon

    Slincon Well-Known Member

    Messages:
    1,319
    Likes Received:
    44
    Best Answers:
    0
    Trophy Points:
    180
    #22
    It's generally illegal - both for copyright use and due to unauthorized use (disclaimer on how you can access their site).
    Sites that do allow content scraping usually have an API - so common rule of thumb is - no API, no copying
     
    Slincon, Aug 21, 2011 IP
  3. Rufas

    Rufas Peon

    Messages:
    27
    Likes Received:
    1
    Best Answers:
    3
    Trophy Points:
    0
    #23
    Rufas, Aug 21, 2011 IP
  4. Matthias

    Matthias Member

    Messages:
    88
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    48
    #24
    I just finished the article... Its a disturbing situation but not all that unexpected out of Facebook. Social networking sites are different then regular internet sites so additional care has to be used. Facebook, in particular, can be messy because of the private users wanting to keep their profiles private.

    The best way to handle Facebook and similar sites is to set up a fan page and make it clear that the whole intent of that page is to crawl and analyze profiles that like the page. Make sure you have a clearly defined purpose on how personal information will be used or if it will be aggregated into a larger picture. Post updates routinely and be transparent about the whole process.
     
    Matthias, Aug 21, 2011 IP
  5. Blue Star Ent.

    Blue Star Ent. Well-Known Member

    Messages:
    1,989
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    160
    #25

    The best way is to take the middle man ( Facebook ) out of the equation. No one needs someone else to tell them what they "like". Diaspora ( @joindiaspora ) does just that and includes encryption for a level of security Facebook will probably never have. It is open source and peer to peer.
     
    Blue Star Ent., Aug 21, 2011 IP
  6. browntwn

    browntwn Illustrious Member

    Messages:
    8,347
    Likes Received:
    848
    Best Answers:
    7
    Trophy Points:
    435
    #26
    That is not the highlight - that was Facebook's position and he didn't challenge it in any way. They called him and threatened him and he agreed with them. This is not even a case and stands for nothing whatsoever.
     
    browntwn, Aug 22, 2011 IP
  7. Rufas

    Rufas Peon

    Messages:
    27
    Likes Received:
    1
    Best Answers:
    3
    Trophy Points:
    0
    #27
    @browntwn Yes, you are right. It is not even a case as the lawyer haven't submit the document to the court yet. So it is more like a initial negotiation.

    But anyway, that guy did said,

    As they say, "Anyone can sue anybody for anything at any time, anywhere." Legal or not, I'll let the court decides. But be aware of any trouble you might get into.

    - Rufas
     
    Rufas, Aug 22, 2011 IP
  8. browntwn

    browntwn Illustrious Member

    Messages:
    8,347
    Likes Received:
    848
    Best Answers:
    7
    Trophy Points:
    435
    #28
    That case did not deal with just scraping, he was scraping and then he published the data. It is the publication that is the distinction in all of these cases. I've yet to see anything that indicates scraping itself is illegal.
     
    browntwn, Aug 22, 2011 IP
  9. Matthias

    Matthias Member

    Messages:
    88
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    48
    #29
    I agree there, but if you have to interact with FaceBook from a business standpoint, there is a safe way to do it. Unfortunately, Diaspora doesn't have the footprint FaceBook does.
     
    Matthias, Aug 22, 2011 IP
  10. Blue Star Ent.

    Blue Star Ent. Well-Known Member

    Messages:
    1,989
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    160
    #30

    We are getting off-topic...


    True about the footprint, but do you believe that software will not replace every "middle man"? I believe it will, therefore every "third person" mode of connection is unnecessary. If it is unnecessary, it will fall by the wayside, because already we do not have wireless resources to support everyone coming online. LINK


    Ideas are not breakable, and the "peer to peer" idea is superior to the "peer to middle man to peer" idea. :)
     
    Blue Star Ent., Aug 23, 2011 IP
  11. Matthias

    Matthias Member

    Messages:
    88
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    48
    #31
    I do agree with you to some extent. I don't necessarily see the middle man going away completely though. Web Directories are a good example of where the middle man is part of the equation even though most of the process is completely automated. Peer to peer I believe won't take a strong footprint compared to distributed processing. Search engine spiders distributed across large numbers of computers for web crawling and scraping will easily overshadow any peer to peer methods that may be employed.

    While peer to peer method have their place, there are severe restrictions involved in web scraping/crawling especially with the legalities involved with the content gathered scraped. I thing peer to peer web scraping/crawling is a law suit just waiting to happen because there is no inherent central authority to guarantee how the content will be managed, stored, and used.
     
    Matthias, Aug 23, 2011 IP
    Blue Star Ent. likes this.
  12. Blue Star Ent.

    Blue Star Ent. Well-Known Member

    Messages:
    1,989
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    160
    #32

    Any distributed processing software ( spider ) created by whatever group of people will need to be approved will it not? By whom? And... it is still created by humans, I hope.


    If the distributed processing is not "approved" by the same central authority you mentioned, then there is only one other choice. The choice by the so-called "central authority". We are created "free and independent", according to President Kennedy. Our spiders and software and web will need to reflect that basic fact. Distributed computing sounds great and.... aligns with who we are.


    Here is the video :

    [video=youtube;V9uDlOA_bNA]https://www.youtube.com/watch?v=V9uDlOA_bNA[/video]
     
    Blue Star Ent., Aug 24, 2011 IP
  13. contentboss

    contentboss Peon

    Messages:
    3,241
    Likes Received:
    54
    Best Answers:
    0
    Trophy Points:
    0
    #33
    yeah, right. Go straight to jail.
     
    contentboss, Aug 24, 2011 IP
  14. contentboss

    contentboss Peon

    Messages:
    3,241
    Likes Received:
    54
    Best Answers:
    0
    Trophy Points:
    0
    #34
    would be interested to see that tested.

    EG 'You may only access our site with a browser that uses Times Roman font'.

    Seeing as most scrapers fake a browser anyway, I'd like to see how they could determine it.
     
    contentboss, Aug 24, 2011 IP
  15. Matthias

    Matthias Member

    Messages:
    88
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    48
    #35
    There where the problems really began, when you have distributed crawling without a central authority, ie (Google for example), there is no way to control the use of the information collected. I am in no means suggesting a government mess, but rather what SETI has done. SETI acts as a central authority to hundreds of thousands of computers maintained by individuals. Even with SETI's distributed power, its footprint is small comparatively speaking in the context of a peer to peer setting. There would simply be too much data throttling on a large scale.

    The only way I could see this approach working in a peer to peer methodology, would be as a clusterized crawler whereby small clusters connecting to "command nodes" handle the crawling. The command node would have to connect to the central data storage or be a part of a distributed data storage, bur not peer to peer at this level.
     
    Matthias, Aug 24, 2011 IP
  16. Blue Star Ent.

    Blue Star Ent. Well-Known Member

    Messages:
    1,989
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    160
    #36

    The software would enhance that. Google has massive servers, but not security. A distributed system has security because if the login credentials are encrypted, or changed to a new form ( my idea ), a hacker has nothing to do. A central server will always be more vulnerable. This is no doubt a large topic, but I am sure the throttling problem could be addressed with success.



    I would have any node/computer capable of making authorizations, similar to the way Bitcoin is setup. I would also use machine learning to improve the system over time.
     
    Blue Star Ent., Aug 25, 2011 IP
  17. webhustla

    webhustla Peon

    Messages:
    80
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #37
    Technically, it's not illegal to "scrape" the content (it's like reading the yellow pages and writing down the numbers/businesses that interest you for example). But it's illegal if you reuse the data (for example publish it somewhere else) without the original owner's consent (and every bit published online is copyright of the writer, unless they explicitly gave up that right on a website like Flickr or Wikipedia).
     
    webhustla, Aug 29, 2011 IP
  18. nasimkhan

    nasimkhan Peon

    Messages:
    17
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #38
    this thing really force me to say excellent
    this reallly made me impressed
    this is really a awesome post
    thanks to post admin

    progresslightingdiscount.com
     
    nasimkhan, Dec 4, 2011 IP
  19. Graham W

    Graham W Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    1
    #39
    So, Google appears to be able to trawl through websites reading the content and recording key words that are then used by people when they do searches for the same or similar words. Google haven't really "re-published" the content but they do hold at least some of the content separate from the original website on their own servers.
    What happens if I have an application that needs to read through a large amount of information, sort of like what Google does, but instead of just looking at key words it is examining concepts and ideas and opinions about subjects and then uses that information to make recommendations to people. Just like Google you aren't really "re-publishing" the information you are just using the content as part of the body of knowledge that you build up from multiple sources to form a recommendation.
    For instance, if the application crawled this website and "absorbed" all the conversations in the various forums it might possibly use some of the convergent opinions expressed in the "Legal Issues" forum to add to a recommendation in response to a question posed to it.
    This application would improve in its ability to make accurate recommendations/suggestions as it absorbed more information, therefore it would be ideal if it was possible to have it crawl through the web in general just like google does accumulating more and more info from various sources.
    How does Google trawl the web without getting into copyright issues at every site it visits?
     
    Graham W, Dec 10, 2013 IP