How to Scrape Pages with Coldfusion ...Create over 20,000 records in 5 minutes:

Discussion in 'Programming' started by digga121, Aug 26, 2008.

  1. #1
    How to Scrape Pages with Coldfusion ...Create over 20,000 records in 5 minutes:

    Pretty Self explanitory:


    <cfloop from="500" to="5000" index="LoopCount">
    <cfhttp url="http://www.articles-hub.com/Article/#loopcount#.html" method="GET">
    <cfset sDoc = trim(cfhttp.fileContent)>
    <cfset regExp = '<span class="article_display_title" > 
    
            ([\s\S]*?)</span>[\s\S]*?<div align=[\s\S]*?
    </div>
        ([\s\S]*?)
              </div>
                </div>'>
    <cfset q_srch = queryNew("title, article")>
    <cfset start = 1>
    <cfloop condition="#start#">
      <cfset stResult = REfindNoCase(regExp,sDoc,start,"Yes")>
      <cfif stResult.pos[1]>
         <cfset queryAddRow(q_srch)>
         <cfset querySetCell(q_srch,"article",mid(sDoc,stResult.pos[3],stResult.len[3]))>
         <cfset querySetCell(q_srch,"title",mid(sDoc,stResult.pos[2],stResult.len[2]))>
      </cfif>
      <cfset start = stResult.pos[1] + stResult.len[1]>
    </cfloop>
    <cfquery name="insert_data" datasource="localdev">
    INSERT article_dump(title,content) VALUES('#q_srch.title#','#q_srch.article#')
    </cfquery>
    </cfloop>
    Code (markup):
     
    digga121, Aug 26, 2008 IP
  2. cpamoney

    cpamoney Peon

    Messages:
    23
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    You could speed this up 100-1000 times using cfthreads
     
    cpamoney, Sep 12, 2008 IP
  3. digga121

    digga121 Active Member

    Messages:
    158
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    53
    #3
    Nice, I'll have to look into it, can you elaberate a little more on the subject pls ? :D
     
    digga121, Sep 15, 2008 IP