Scraping RSS Feeds with Ruby on Rails

Discussion in 'Ruby' started by db_man, May 12, 2007.

  1. #1
    I've been promising on my blog to write a tutorial on how to scrape with Ruby on Rails, so in advance of writing this on my blog, Im gonna show you all some real basic stuff on getting started with RoR. This is a method for scraping RSS feeds and replacing them with your own content and link and repinging them to aggregators. Great way to get lots of links!

    I recommend you all read up on the following article so you understand the basics of ruby and you have all the software. This way,you will be able to understand what the hell I’m going on about when you read the tutorial.

    1) ONLamp.com -- Rolling with Ruby on Rails
    This is a CRUCIAL primer for installing and working with rails. Make sure you go through this tutorial step by step and follow along. This will provide you with a solid understanding of rails structure and syntax, as well as how to connect rails apps to your database.
    This tutorial will also hold your hand in installing all the required software. You simply MUST read at least the first four pages of this tutorial to udnerstand anything else Im going to talk about.

    OK.....that was alot of stuff to wade through. Congrats if you read it all. Good luck if you didnt.

    The first step in setting up a ruby application is telling rails to set up the apps infrastructure. We are calling this application "feedscraper" and to set it up, we run the folowing at the command line (in your root directory ie: C:\)

    C:\rails feedscraper

    this will automatically create the application infrastructure to run your app.

    Now that we have the app setup, we want to config it to work with our database......open up the following file in your notepad:

    C:\feedscraper\config\database.yml

    You'll see this:

    
    development:
      adapter: mysql
      database: feedscraper_development
      username: root
      password:
      host: localhost
    
    test:
      adapter: mysql
      database: feedscraper_test
      username: root
      password:
      host: localhost
    
    production:
      adapter: mysql
      database: feedscraper_production
      username: root
      password:
      host: localhost
    Code (markup):
    Lets change ALL the database names to just feedscraper. So your new file will look like this:

    
    development:
      adapter: mysql
      database: feedscraper
      username: root
      password:
      host: localhost
    
    test:
      adapter: mysql
      database: feedscraper
      username: root
      password:
      host: localhost
    
    production:
      adapter: mysql
      database: feedscraper
      username: root
      password:
      host: localhost
    Code (markup):
    Later on when you are ready to take your tool live (which I will not cover in this tutorial) you can edit your production database values to apply to your websites live database.

    Now you have to open up whatever tool you use to interface with you local mySQL server (you should have installed HeidiSQL if you read the first tutorial link at the top of this post) Once it is open, create a database called feedscraper and in it create a table called "feeds" with the following fields: id (set to int auto increment) title, url, post. Our program is going to grab the titles, put them in the DB, plunk in our own content into the content field and our own URL into the URL field. Now remember how I siad this was going to be a VERY basic tutorial? I meant it. There are a million other possibilities for how you can tweak this script, but for now, we are focusing on the bare-bones basics.

    The last step in setting up our DB is creating a model that lets our ruby app interface with the "feeds" table. Run the following at your C:\feedscraper\

    C:\ruby script\generate model feed

    NOTE: pay attention to the fact that the table is called "feeds" and the model is called "feed" Im too lazy to get into the details of this and if you read the first tutorial at the top of this post, you'll already understand.

    OK...so the DB is setup. Now we just have one more thing to do before we start some actual coding. We have to install the Hpricot library.

    goto C:\feedscraper and run:

    C:\gem install hpricot

    And for good measure, you might as well install the following other libraries.

    C:\gem install mechanize

    OK! Down to some coding.

    Goto C:\feedscraper\app and right click > new > Ruby Program

    That will open up SciTE and you'll have a blank canvas.

    Here we go. The firs thing we have to do is to require all the necessary libraries and out database as follows:

         require 'hpricot'
        require 'open-uri'
         require 'rubygems'
         require 'active_record'
        require '../app/models/feed' 
    Code (markup):
    To break it down, Hpricot performs the parsing, open-uri performs to fetching of the document you are parsing (mechanize performs much more sophisticated tasks but we don't need it for this simple application). Rubygems is something I have to require on my setup in order for hpricot to work...it may not be the same on yours. Active Record is the library that lets us interface with our "feeds" table through our "feed" model we created earlier.

    After we have done those declarations, we have to tell it where our database is

          db_config = YAML::load(File.open("../config/database.yml"))
          ActiveRecord::Base.establish_connection(db_config['development'])
    Code (markup):
    we just told it to look for our config file and to use the database values under development

    Now, since we don't want (or at least I don't want) to accrue a totally massive database full of tons of rewritten links...I tell theapp to wipe the database clean at the start of every run:

         Feed.delete_all "id > 0"
    Code (markup):
    and then we define the variable that contains the content we want to replace the posts content with. Also define your URL (you can set up much more sophisticated means of cycling URLS..)

         @content = 'YOUR CONTENT HERE'
        @url = 'YOUR URL HERE'
    Code (markup):
    Forging ahead, we tell Hpricot to open up Google Blog Search and search or the keywords we want (Just to be contrary, let use kittens)

         doc = Hpricot(open("http://blogsearch.google.com/blogsearch_feeds?hl=en&as_drrb=q&as_qdr=t&q=kittens&ie=utf-8&num=100&output=rss"))
    
    Code (markup):
    Next, we start picking apart the page. RSS makes our life easy cause everything is wrapped in a tag, so we just started from the outside and work our way in tag by tag. Like this:

         main = doc.search("channel")
    Code (markup):
    So now, whatever comes out of a search performed on "main" will only be what is inside the <channel> tag. See how wasy that was?
    Next we want to tell it to find each <item> and perform a set of actions to the contents of each <item> tag it encounters.
    
        main.search("item").each { |post|
        title = (post/"title").inner_text
    Code (markup):
    This is telling it to find each <item> tag and give it the name "post". The second line says for each "post" you encounter, search inside it for a <title> tag and get the text from inside it. That is how easy it was for you to grab a posts title.
    Just to make sure we are grabbing titles while its running....we tell it to print it in the display window.

         puts title
    Code (markup):
    Now we want to save both our newly scraped titles and with each of them save a copy of your previously defined @url and @content variables. We want to plunk them into our databsae that we set up earlier.

         db = Feed.new
        db.title = title
        db.url = @url
        db.post = @content
        db.save    
        }
    Code (markup):
    That code let's us save every single scraped title and its accompanying pre-defined URL and content to our database!

    now, to automate the pinging, I recommend the following:

           @ping = Hpricot(open("Take the URL you get from a ping result on pingomatic and put it here.."))
       puts @ping
    Code (markup):
    Go on! Press F5 and run your app! It'll run as long as I didn't make any typos.....lol


    Now the next step is to create a PHP file that references that database and creates an XML feed so you can ping it to the aggregators.


    Here, just cuz Im a nice guy:

     <?php
    Header('Content-type: text/xml');
     echo "<rss version=\"2.0\">n";
    echo "   <channel>n";
    echo "       <title></title>\n";
    echo "       <link></link>\n";
    echo "       <description></description>\n";
    echo "       <language>en-us</language>\n";
     $dbh=mysql_connect ("localhost", "USERNAME", "PASSWORD") or die ('I cannot connect to the database because: ' . mysql_error());
    mysql_select_db ("feedscraper");
        $pop_query = "SELECT * FROM feeds";
        $pop_result = mysql_query($pop_query) or die("Couldn't execute query");
        while ($poprow=mysql_fetch_array($pop_result)) {
        $titlex = $poprow["title"];
        $url = $poprow["url"];
        $post = $poprow["post"];
        $date = date("Y/m/d");
    echo "       <item>\n";
    echo "        <title>$titlex</title>n";
    echo "               <link>$url</link>\n";
    echo "               <pubDate>$date</pubDate>\n";
    echo "               <description> <![CDATA[ $post ]]> </description>\n";
    echo "       </item>\n";
    
    
    }
    
    
     echo "     </channel>\n";
    echo "</rss> ";
    
    ?>
    PHP:
    OK! Thats it fellas, Ive put this whole thing on a silver platter for you. Excuse my typos, bad coding (Im NOT a pro-coder...just a hobbyist) and bad grammar.
     
    db_man, May 12, 2007 IP
  2. butterfingers

    butterfingers Well-Known Member

    Messages:
    1,142
    Likes Received:
    54
    Best Answers:
    0
    Trophy Points:
    128
    #2
    Thanks for the nice program. I should try it out!.
     
    butterfingers, May 15, 2007 IP