I want to download plaintext written in HTML table as well as the zip files attached in <a>. But the problem is the site is using login method to provide links to download pdf. I am using HTML Dom parser. The link to site is: http://www.gtucampus.com/exam-papers-cmbcourses-DegreeEngineering-cmbbranchname-AERONAUTICAL%20ENGINEERING-cmbsemester-Semester-I-stream-1-cmbcollegesearch-Search.html Without login, it will display only # in href of <a>. Really need to download all the papers. Please do reply.
Yes I do have the userID and Pwd. Using curl method I can get webpage code as response. Will t be possible to run DOM over it? I need to copy few text from table and some zip files as well. Any suggestion?
you can build the form on php, then build the post string and use curl to post it. you will be able to login. then you can use simple_html_dom class to parse the page and grab any links you want.
@Rob If I use CURL method than it will return a page only. That can't work as object in HTML DOM so this method is not useful. I tried it already, it doesn't help any. Any other suggestions.
I have plenty of scraping experience and curl is the way to go. However depending on how the website is written this can be easy or difficult. Will send you a private message to discuss.
They are not much files, I hope just login once and download all one by one it will be done within few minutes. If you want to write PHP script, Then use CURL to login as you say you can get Source code of the page. Did you get successfully logged into the site with your CURL code? if yes then simply use regex if you are good with it or. Run a loop through the whole webpage string and find <a and </a> and get the URL and use a function to download file that is available on the URL you get
@gaurav_ There are around 100 pages where I want to run script because I want to download all the exam papers. Site uses login method to verify and allow user to download papers. Is there any solution for this?
code will be something similar to the one posted above on this thread. im actually on week end. i have this code in office computer. So i can provide code on monday only
This will not work since the username and password are not via HTTP, it is a login system. You will need to: 1) Submit a login request, which will then return you specific cookies. This will only need to be done once to get the cookie values for the sessions, then you can make as many requests as the user. 2) Make the page request (another request) with the cookies, which will have the login mode. Cookie options are: <?php curl_setopt($ch, CURLOPT_COOKIESESSION,true); curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file); curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file); PHP: Where the cookie file is a local location for an empty file (to start off with).