PHP cURL (Scraping a website)

Discussion in 'PHP' started by kidatum, Mar 25, 2012.

  1. #1
    Hey guys,

    I need a little help scraping a no-frills website. The main problem I have is sending headers or cookies to set a store. If you've never been to the website, the first time you visit it asks you to select Province, City, and the Store. Then I have access to viewing items and prices of that store. I've tried using various methods using cURL but I get "Received HTTP code 403 from proxy after CONNECT" error.

    Here is the link: http://www.nofrills.ca/LCLOnline/flyers_landing_page.jsp - you can select any province, city and store for testing.

    Please help me. Thank you in advance,

    - kidatum
     
    kidatum, Mar 25, 2012 IP
  2. sarahk

    sarahk iTamer Staff

    Messages:
    28,899
    Likes Received:
    4,555
    Best Answers:
    123
    Trophy Points:
    665
    #2
    It should be just a matter of setting up the fields that need to be submitted and posting the form.

    You may need to outline what exactly you have tried.
     
    sarahk, Mar 26, 2012 IP
  3. kidatum

    kidatum Peon

    Messages:
    61
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #3
    The site accepts post variables and processes them using javascript which makes this difficult. Either way I found a solution after 4 hours of research :)

    Thanks for looking at the topic though,


    - kidatum
     
    kidatum, Mar 26, 2012 IP
  4. Alex Roxon

    Alex Roxon Active Member

    Messages:
    424
    Likes Received:
    11
    Best Answers:
    7
    Trophy Points:
    80
    #4
    If you ever experience these issues again in the future, they key is to mimic, as much as you can, how a popular browser would access the websites. You may have to consider cookies, user agents, post/get data, encoding, etc. If you do all of that properly there's no real way for a website to deem you as anything other than a normal user (until you start hitting the server with a million requests heh)
     
    Alex Roxon, Mar 26, 2012 IP
  5. kidatum

    kidatum Peon

    Messages:
    61
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Good point, thanks. I start with the least requirements as possible and then build upon more as needed.
     
    kidatum, Mar 26, 2012 IP
  6. ROOFIS

    ROOFIS Well-Known Member

    Messages:
    1,234
    Likes Received:
    30
    Best Answers:
    5
    Trophy Points:
    120
    #6
    A further point to mention from Alex's post try to utilize a referer that matches the site's URI structures,
    ie:


    
    
    $url = "http://example.com/scrape-this-page"; 
    $ref = "[B][COLOR="#FF0000"]http://example.com[/COLOR][/B]";
    
    curl_setopt($curl, CURLOPT_URL, $url);
     ..// code
      curl_setopt($curl, CURLOPT_REFERER, "[B][COLOR="#FF0000"]$ref[/COLOR][/B]");
      ..// more code etc ...
    
    
    Code (markup):
    This way it appears from their server logs that you've navigated from one link to another (presumably from index to page of interest)
    just like a browser would do. ;)




    ROOFIS
     
    ROOFIS, Mar 29, 2012 IP