1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Error loading HTML in variable from remote site (scraping)

Discussion in 'PHP' started by pmf123, Feb 12, 2016.

  1. #1
    I am trying to scrape some content from a website, and it keeps returning errors, but if I change the URL it works OK on other sites.

    I have tried pretty much every method listed here:
    http://blog.oscarliang.net/six-ways-retrieving-webpage-content-php/

    The curl method is the only one that gets any result, and this is what you get:
    Object moved to here.

    Is there some method they may be using to prevent accessing the source via this method?

    I can send URL in PM if you have any suggestions.
     
    pmf123, Feb 12, 2016 IP
  2. PoPSiCLe

    PoPSiCLe Illustrious Member

    Messages:
    4,623
    Likes Received:
    725
    Best Answers:
    152
    Trophy Points:
    470
    #2
    Send the url AND the current, almost working code, and I can have a look.
     
    PoPSiCLe, Feb 12, 2016 IP
  3. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,998
    Best Answers:
    253
    Trophy Points:
    515
    #3
    Without knowing the URL you are trying to parse it's hard to say, but it could simply be the site you are trying to scrape is such a poorly written steaming pile of crap, it can't be processed well if at all by a normal processor.

    You've pointed at methods for LOADING the page content to a text file, but how are you actually trying to PROCESS it? DOMDocument? regex?
     
    deathshadow, Feb 14, 2016 IP
  4. deathshadow

    deathshadow Acclaimed Member

    Messages:
    9,732
    Likes Received:
    1,998
    Best Answers:
    253
    Trophy Points:
    515
    #4
    Alright, he sent me the URI via PM, and this is the same issue someone else ran into last week.

    preg_replace: How to remove it

    the problem is the URL is a redirect, and most of those techniques he linked to can't handle that... cURL can, but you have two ways of doing it and you have to code "around" that issue. The code presented in that thread SHOULD get you most if not all of the way to a solution. You have to tell cURL to follow redirects, and if that fails you have to trap the error code responses.

    Note the method presented in that thread ONLY works if you have PHP 5.4 or newer. Older versions do not provide one of the cURL response variables -- you could code around that with a regex, but to be frank that's NOT the correct answer. The correct answer is to update to a version of PHP from THIS century.

    That some people are tool lazy or cheap to update their codebases disgusts me no end... particularly since it seems 99.999% of my code of the past decade runs unmodified in PHP 7 because I *SHOCK* paid attention to the things they told us to stop doing. That web hosts are simply molly-coddling people by keeping decade out of date PHP versions in circulation only exacerbates the situation.
     
    deathshadow, Feb 15, 2016 IP
  5. PoPSiCLe

    PoPSiCLe Illustrious Member

    Messages:
    4,623
    Likes Received:
    725
    Best Answers:
    152
    Trophy Points:
    470
    #5
    Tested with the cURL redirect-follow, it doesn't work (well, at least it didn't when I tried, but I'm not very well versed in cURL, so it might be that there are ways to make it work). The redirect is done in code, not .htaccess or similar, as far as I can see (or was the code ASP... don't remember). Anyway, the plain follow redirect cURL setoption doesn't work.
     
    PoPSiCLe, Feb 15, 2016 IP
  6. pmf123

    pmf123 Notable Member

    Messages:
    1,447
    Likes Received:
    75
    Best Answers:
    0
    Trophy Points:
    215
    #6
    when i use curl, i get a redirect to an ASP error message... how do you get the correct redirect URL
     
    pmf123, Feb 15, 2016 IP