1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

How to scrap a site which uses login to display data?

Discussion in 'PHP' started by Ashish Revar, Jan 23, 2014.

  1. #1
    I want to download plaintext written in HTML table as well as the zip files attached in <a>.

    But the problem is the site is using login method to provide links to download pdf. I am using HTML Dom parser. The link to site is: http://www.gtucampus.com/exam-papers-cmbcourses-DegreeEngineering-cmbbranchname-AERONAUTICAL%20ENGINEERING-cmbsemester-Semester-I-stream-1-cmbcollegesearch-Search.html

    Without login, it will display only # in href of <a>.

    Really need to download all the papers. Please do reply.
     
    Ashish Revar, Jan 23, 2014 IP
  2. PoPSiCLe

    PoPSiCLe Illustrious Member

    Messages:
    4,623
    Likes Received:
    725
    Best Answers:
    152
    Trophy Points:
    470
    #2
    Do you have a login available that works?
     
    PoPSiCLe, Jan 24, 2014 IP
  3. Ashish Revar

    Ashish Revar Greenhorn

    Messages:
    25
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    23
    #3
    Yes I do have the userID and Pwd.

    Using curl method I can get webpage code as response. Will t be possible to run DOM over it? I need to copy few text from table and some zip files as well.

    Any suggestion?
     
    Ashish Revar, Jan 24, 2014 IP
  4. Rob Siecor

    Rob Siecor Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    1
    #4
    you can build the form on php, then build the post string and use curl to post it. you will be able to login. then you can use simple_html_dom class to parse the page and grab any links you want.
     
    Rob Siecor, Jan 29, 2014 IP
  5. Ashish Revar

    Ashish Revar Greenhorn

    Messages:
    25
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    23
    #5
    @Rob
    If I use CURL method than it will return a page only. That can't work as object in HTML DOM so this method is not useful. I tried it already, it doesn't help any.

    Any other suggestions.
     
    Ashish Revar, Jan 29, 2014 IP
  6. stephan2307

    stephan2307 Well-Known Member

    Messages:
    1,277
    Likes Received:
    33
    Best Answers:
    7
    Trophy Points:
    150
    #6
    I have plenty of scraping experience and curl is the way to go. However depending on how the website is written this can be easy or difficult. Will send you a private message to discuss.
     
    stephan2307, Jan 30, 2014 IP
  7. gaurav_

    gaurav_ Member

    Messages:
    54
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    41
    #7
    They are not much files, I hope just login once and download all one by one it will be done within few minutes.

    If you want to write PHP script, Then use CURL to login as you say you can get Source code of the page.

    Did you get successfully logged into the site with your CURL code?

    if yes then simply use regex if you are good with it or. Run a loop through the whole webpage string and find <a and </a> and get the URL and use a function to download file that is available on the URL you get
     
    gaurav_, Jan 30, 2014 IP
  8. Ashish Revar

    Ashish Revar Greenhorn

    Messages:
    25
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    23
    #8
    @gaurav_ There are around 100 pages where I want to run script because I want to download all the exam papers.

    Site uses login method to verify and allow user to download papers.
    Is there any solution for this?
     
    Ashish Revar, Jan 31, 2014 IP
  9. gaurav_

    gaurav_ Member

    Messages:
    54
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    41
    #9
    yes there is a way to get it done. do you have login id and password? if yes then please share
     
    gaurav_, Feb 1, 2014 IP
  10. Ashish Revar

    Ashish Revar Greenhorn

    Messages:
    25
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    23
    #10
    @gaurav_ Can you please provide the code?
     
    Ashish Revar, Feb 1, 2014 IP
  11. gaurav_

    gaurav_ Member

    Messages:
    54
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    41
    #11
    code will be something similar to the one posted above on this thread.
    im actually on week end. i have this code in office computer. So i can provide code on monday only
     
    gaurav_, Feb 1, 2014 IP
  12. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #12
    This will not work since the username and password are not via HTTP, it is a login system.

    You will need to:

    1) Submit a login request, which will then return you specific cookies. This will only need to be done once to get the cookie values for the sessions, then you can make as many requests as the user.
    2) Make the page request (another request) with the cookies, which will have the login mode.

    Cookie options are:

    
    <?php
    curl_setopt($ch, CURLOPT_COOKIESESSION,true);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
    
    PHP:
    Where the cookie file is a local location for an empty file (to start off with).
     
    ThePHPMaster, Feb 2, 2014 IP