Curl Multi max download size

Discussion in 'PHP' started by EricBruggema, Jun 28, 2013.

  1. #1
    Hi there,

    I've been working on a little script that crawls the web, but i can't find a way to add download limit to my script.

    My script
    
    $master = curl_multi_init();
    $curl_arr = array();
     
    // add additional curl options here
    $std_options = array(CURLOPT_RETURNTRANSFER => true,
                         CURLOPT_FOLLOWLOCATION => true);
    $options = ($custom_options) ? ($std_options + $custom_options) : $std_options;
     
    // start the first batch of requests
    foreach ($urls AS $uId => $url)
    {
        $ch = curl_init();
        $options[CURLOPT_URL] = $url['url'];
        curl_setopt_array($ch, $options);
        curl_multi_add_handle($master, $ch);
        
        // set handle so we can find back the releated data...
        $handles[$ch] = $uId;
    }
     
    do 
    {
        while(($execrun = curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);
     
        if($execrun != CURLM_OK)
            break;
     
        // a request was just completed -- find out which one
        while ($done = curl_multi_info_read($master)) 
        {   
            $info = curl_getinfo($done['handle']);
            $curHandle = $handles[$done['handle']];
            
            $urls[$curHandle]['code'] = $info['http_code'];
            
            switch ($info['http_code'])
            {
                case 200:
                    $output = curl_multi_getcontent($done['handle']);
                break;
                
                case 301:
                case 302:
                break;
                
                case 404:
                break;
                
                default:
                    $urls[$curHandle]['errno'] = $curl_errno($done['handle']);
                    $urls[$curHandle]['error'] = $curl_error($done['handle']);
                break;
            }
            
            // remove the curl handle that just completed
            curl_multi_remove_handle($master, $done['handle']);
        }
    } 
    while ($running);
     
    curl_multi_close($master);
    print_r($urls);
    
    PHP:
    I've found a piece of PHP code that would do the job, but don't know how to add it so it works as expected.

    URL: http://www.phpkode.com/source/s/multicurl-class-library/multicurl-class-library/MultiCurl.class.php
    Line: 136
    Code:
    
    if (!$active || $mrc != CURLM_OK || curl_getinfo($ch, CURLINFO_SIZE_DOWNLOAD) >= $this->maxSize) {
                $this->closeSession($i);
            }
    
    PHP:
    I'm missing something, but can't seem to find the way to add this (looking for over 2 days now...) can anyone help me here?
     
    EricBruggema, Jun 28, 2013 IP
  2. edduvs

    edduvs Well-Known Member

    Messages:
    394
    Likes Received:
    31
    Best Answers:
    3
    Trophy Points:
    160
    #2
    As I've explained this to someone else who had the same problem on SO community, there is no way to do this with PHP's built in curl functions, without making a separate request to the webserver hosting file. How about a file_get_contents from the current URL you're looping with, and just checking it's length.

    You could potentially make a request with curl_setopt($ch, CURLOPT_NOBODY, true); and read the Content-Length header, and then make a second request to download only if Content-Length is smaller than your max.
    This wouldn't be foolproof anyway.
     
    edduvs, Jun 28, 2013 IP
  3. sorindsd

    sorindsd Well-Known Member

    Messages:
    201
    Likes Received:
    3
    Best Answers:
    2
    Trophy Points:
    118
    #3
    Replace line 26 with if($execrun != CURLM_OK || curl_getinfo($ch, CURLINFO_SIZE_DOWNLOAD) >= size_to_replace)
     
    sorindsd, Jun 28, 2013 IP
  4. EricBruggema

    EricBruggema Well-Known Member

    Messages:
    1,740
    Likes Received:
    28
    Best Answers:
    13
    Trophy Points:
    175
    #4
    One per time, while curl handles multiple connetions..

    I see, but it should be possible imho. Hopefully it is, someway...

    Thanks, but it doesn't work, i need to close the connection to stop the data flow. When a handle has reached the 'limit' the handle should be 'ended' and the 'content' should be still available.
     
    Last edited: Jun 28, 2013
    EricBruggema, Jun 28, 2013 IP
  5. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #5
    Do you want to do a limit per file or overall (cumulative)?
     
    ThePHPMaster, Jun 28, 2013 IP
  6. EricBruggema

    EricBruggema Well-Known Member

    Messages:
    1,740
    Likes Received:
    28
    Best Answers:
    13
    Trophy Points:
    175
    #6
    Per file (connection)
     
    EricBruggema, Jun 28, 2013 IP
  7. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #7
    It might just be me overlooking your request, but your code already does curl_getinfo. Is there a reason why can't you just check the size as well?

    
    $info = curl_getinfo($done['handle']);
    $size = curl_getinfo($done['handle'], CURLINFO_CONTENT_LENGTH_DOWNLOAD);
    
    PHP:
     
    ThePHPMaster, Jun 29, 2013 IP
  8. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #8
    Incase you are wondering if you can get the size before you make the request with CURL, then that is not possible. curl_getinfo only works after exec.

    Like edduvs said, you can go to a different route and in your first foreach loop for $urls, make a HEAD request for the content length. This will be lighter/faster:

    
    foreach ($urls AS $uId => $url)
    {
        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $url['url']);
        curl_setopt($curl, CURLOPT_FILETIME, true);
        curl_setopt($curl, CURLOPT_NOBODY, true);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        curl_exec($curl);
        $size = curl_getinfo($curl, CURLINFO_CONTENT_LENGTH_DOWNLOAD);
        curl_close($curl);
        if ($size > X) {
             // your old code
        }
    }
    
    PHP:
    Additionally I noticed an issue with your existing code. The way you are setting your $options will cause a notice to occur, it should be isset($custom_options) ? ... instead.
     
    ThePHPMaster, Jun 29, 2013 IP
  9. EricBruggema

    EricBruggema Well-Known Member

    Messages:
    1,740
    Likes Received:
    28
    Best Answers:
    13
    Trophy Points:
    175
    #9
    Thanks, but the documents i try to load doesn't give content-length... so that's not a option... to bad... so i see that this isn't available in PHP (yet :p)

    And about the notice, i agree, this was just a quick example... :D
     
    EricBruggema, Jul 3, 2013 IP
  10. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #10
    How exactly would you get the size of a document in any programming or scripting language if you don't have the content length headers?
     
    ThePHPMaster, Jul 4, 2013 IP