CFHTTP / CFImage Retrieving Photos From URL

Discussion in 'Programming' started by twalters84, Dec 18, 2009.

  1. #1
    Greetings,

    I have a rather interesting programming problem right now. I recently joined the Google Affiliate Network (GAN) and got accepted into some major affiliate programs such as KMart, Target, and Sears. These businesses provide product feeds, which I downloaded and wrote a script to input everything into my database. One of the fields is a product image URL.

    Using the product image URL, I had a script that retrieved the image, stored it on my local server, and then performed image manipulations such as resizing photos. Here is a quick example of this code:

    
    
    <cfset myPath = '#ExpandPath("..\img\products\2\")#'>
    
    <cfquery datasource="#dsnName#" name="PHOTO_LIST" maxrows="1">
    SELECT PRODUCT.PRODUCT_GOOGLE_IMAGE_URL, PRODUCT.PRODUCT_ID, BUSINESS.BUSINESS_NAME, PRODUCT.PRODUCT_NAME, BUSINESS.BUSINESS_URL
    FROM PRODUCT, BUSINESS
    WHERE PRODUCT.BUSINESS_ID = BUSINESS.BUSINESS_ID 
    AND PRODUCT.PRODUCT_PHOTO IS NULL
    AND PRODUCT.PRODUCT_GOOGLE_IMAGE_URL IS NOT NULL
    </cfquery>
    
    <cfif #PHOTO_LIST.RecordCount# NEQ 0>
    
      <cfloop index="i" from="1" to="#PHOTO_LIST.RecordCount#">
      
        <cfset businessURL = '#PHOTO_LIST.BUSINESS_URL[i]#'>
        <cfset photoURL = '#PHOTO_LIST.PRODUCT_GOOGLE_IMAGE_URL[i]#'>
        <cfset productID = '#PHOTO_LIST.PRODUCT_ID[i]#'>
        <cfset bizName = '#PHOTO_LIST.BUSINESS_NAME[i]#'>
        <cfset productName = '#PHOTO_LIST.PRODUCT_NAME[i]#'>
      
        <cfhttp method="get" url="#photoURL#" useragent="#CGI.http_user_agent#" getasbinary="no" result="objGET">    
         <cfhttpparam type="HEADER" name="referer" value="#businessURL#" />
        </cfhttp> 
    
        <cfif FindNoCase("200",objGET.StatusCode)>
          
          <cfset acceptChars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'>      
          <cfset photoNameTemp = '#productName#'>
          <cfset photoName = ''>
            
          <cfset photoNameTemp = #Replace(photoNameTemp, "&quot;", "", "ALL")#>
    
          <cfloop index="i" from="1" to="#Len(photoNameTemp)#" step="1">
             
            <cfset strChar = Mid(photoNameTemp,i,1)>
              
            <cfif #Find(strChar,acceptChars)#>          
              <cfset photoName = '#photoName##strChar#'>
            </cfif>
              
          </cfloop>
    
          <cfif #Len(photoName)# GT 45>
            <cfset photoName = '#Left(photoName,45)#'>
          </cfif>
                    
          <cfset loopIndex = 1>
          <cfset nameAccepted = false>
          <cfset photoNameTemp = '#photoName#'>
            
          <cfloop condition="nameAccepted eq false">
            
            <cfset photoNameTemp = '#photoName##loopIndex#.jpg'>
            <cfset myFile = '#myPath##photoNameTemp#'>
            
            <cfif NOT FileExists(myFile)>
              <cfset nameAccepted = true>
              <cfset photoName = '#photoNameTemp#'>
            <cfelse>
              <cfset loopIndex = #loopIndex#+1>
            </cfif>
            
          </cfloop>
     
          <cffile action="write" file="#myPath##photoName#" output="#objGET.FileContent#"/>
    
          <!-- IMAGE MANIPULATIONS AND DATABASE UPDATE CODE HERE -->
                
        </cfif>
           
      </cfloop>
    
    </cfif>
    
    
    Code (markup):
    The code above is slightly modified from the following URL:

    http://www.bennadel.com/blog/903-Passing-Referer-AS-ColdFusion-CFHttp-CGI-Value-vs-HEADER-Value-.htm

    At first, I was using the CGI cfhttpparam type to retrieve the photos. This worked great for downloading 2,000 photos. However, the product catalogue for the major website I am working with has about 200,000 products. After I downloaded about 2,000 photos, I started getting forbidden errors (objGET.StatusCode).

    I was downloading 1 photo per minute from the affiliate server. I did not want to overload their server and thought that was a reasonable rate, but I guess I flipped a switch and got the forbidden messages.

    At this point, I am wondering how do I get the rest of the photos without having to download each one manually.

    I have tried something like this:

    Original Script Modification:

    
    
    <cfset myLink = 'http://www.#myDomain#/scripts/displayProductPhoto.cfm?productID=#productID#'>
    
    <cfhttp method="get" url="#myLink#" useragent="#CGI.http_user_agent#" getasbinary="no" result="objGET">    
      <cfhttpparam type="HEADER" name="referer" value="#businessURL#" />
    </cfhttp> 
    
    
    Code (markup):
    Display Product Photo Page:

    
    
    <cfparam name="URL.productID" default="-1" type="integer">
    <cfset URL.productID = '#HTMLEditFormat(URL.productID)#'>
    
    <cfif #URL.productID# NEQ -1>
    
      <cfquery datasource="#dsnName#" name="PHOTO_CHECK" maxrows="1">
      SELECT PRODUCT.PRODUCT_GOOGLE_IMAGE_URL
      FROM PRODUCT
      WHERE PRODUCT.PRODUCT_ID = <cfqueryparam cfsqltype="CF_SQL_NUMERIC" value="#URL.productID#">
      </cfquery>
      
      <cfif #PHOTO_CHECK.RecordCount# EQ 1>
      
        <cfset imageURL="#PHOTO_CHECK.PRODUCT_GOOGLE_IMAGE_URL#">
        
        <cfoutput>
        
          <img src="#imageURL#" />   
       
        </cfoutput>
        
      </cfif>
    
    </cfif>
    
    
    Code (markup):
    On the display product webpage, the image is displaying correctly. However, the object returned in the cfhttp call is actually the binary webpage and not actual product photo.

    Thus, I tried modifying the display product page as follows:

    
    
    <cfparam name="URL.productID" default="-1" type="integer">
    <cfset URL.productID = '#HTMLEditFormat(URL.productID)#'>
    
    <cfif #URL.productID# NEQ -1>
    
      <cfquery datasource="#dsnName#" name="PHOTO_CHECK" maxrows="1">
      SELECT PRODUCT.PRODUCT_GOOGLE_IMAGE_URL
      FROM PRODUCT
      WHERE PRODUCT.PRODUCT_ID = <cfqueryparam cfsqltype="CF_SQL_NUMERIC" value="#URL.productID#">
      </cfquery>
      
      <cfif #PHOTO_CHECK.RecordCount# EQ 1>
      
        <cfset imageURL="#PHOTO_CHECK.PRODUCT_GOOGLE_IMAGE_URL#"> 
        <cfimage source="#imageURL#" name="img" action="read">
        <cfset blob = ImageGetBlob(img)>
        <cfcontent type="image/jpeg" variable="#blob#">
        
      </cfif>
    
    </cfif>
    
    
    Code (markup):
    When it tries reading the cfimage it gives me the following error:

    The URL I am trying to receive the photo from appears to be on an image server. The actual URL for the product image has no file extension. When I view the source on the product image URL, it appears to be in binary format. It looks like an image might be created on the image server when somebody visits the URL and outputs it in binary format.

    So my question to you guys is how do I dynamically retrieve 198,000 photos at a reasonable rate and store the photos on my local server?

    Thanks in advance for any suggestions or assistance.

    Sincerely,
    Travis Walters
     
    twalters84, Dec 18, 2009 IP
  2. FCM

    FCM Well-Known Member

    Messages:
    669
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    155
    #2
    What you are doing is considered crawling content, you should also adhere to the robots.txt -- do not copy images without permission, these images are copyrighted. Do you have express written consent to use the copyrighted images in the way that you are doing? -- Just because you are a Google Affiliate does not give you permission to take there content.

    Just Saying
     
    FCM, Dec 20, 2009 IP
  3. twalters84

    twalters84 Peon

    Messages:
    514
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Hey there,

    I was looking at the following content:

    Product Feed Structure

    I was under the impression that when a business registers in the GAN, I can use host images specified in the Image URL in their product feeds.

    Sincerely,
    Travis Walters
     
    twalters84, Dec 21, 2009 IP