1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Web Scraping - Advice Please

Discussion in 'Programming' started by ExoPaul, Sep 21, 2021.

  1. #1
    Hi there,

    Please bear in mind that I am not a web developer.

    My developer is running a web scraper to scrape specific content from websites. All is working pretty well except that it will only scrape from public-facing content, not content behind a login, even when logged in as an account.

    Can anyone give some tips, advice or point me in the direction of a web scraper that can be customised to scrape the content we are wanting AND bypassed the login requirements.

    Creating an account first with the relevant site is not a problem, it is just how to then scrape the data we need. How is the bypassing done?

    Again, I am not a developer so anything that helps such as an example or code snippet or a product that can be looked at would be so helpful for me to pass on to him.

    Thanks guys, and thank you for allowing me to post.
     
    ExoPaul, Sep 21, 2021 IP
  2. sarahk

    sarahk iTamer Staff

    Messages:
    28,500
    Likes Received:
    4,460
    Best Answers:
    123
    Trophy Points:
    665
    #2
    You want advice on how to steal intellectual property?
    You're at the WRONG forum, buddy.
     
    sarahk, Sep 21, 2021 IP
  3. sarahk

    sarahk iTamer Staff

    Messages:
    28,500
    Likes Received:
    4,460
    Best Answers:
    123
    Trophy Points:
    665
    #3
    Ok, so this user objected to my conclusion that he was stealing intellectual property and likened his scraping to a search engine's indexing.

    His developer will have used curl and any number of the open-source packages that exist, and have existed for decades.

    That the developer has encountered a site where they've made the login process so complex that a standard curl login won't work suggests that the site owner has gone to some effort to prevent scrapers. I presume the developer has checked to see if there's an API. When you've contacted the site owner have they offered a solution? After all, you're not doing anything wrong so contacting the site owner won't raise any problems.
     
    sarahk, Sep 21, 2021 IP
  4. MrAEL

    MrAEL Peon

    Messages:
    2
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    1
    #4
    Hi,
    to scrape an unauthorized link/ressources you need access (account or token ...) when you sign in to the website you receive multiple additional information (cookies)
    now you need only to get this cookies from your browser (export them) and add them to your scraper tool (as cookies header),
    your scrapper now can easily access the links as an authenticated user.
     
    MrAEL, Sep 29, 2021 IP