Weird characters throwning off rss feed scrape

Discussion in 'PHP' started by soggy, Feb 10, 2008.

  1. #1
    I was hoping someone could help modify this.

    The php below works well when scraping until it runs across a headline with a weird character like a "&" and a few others. Is there a fix for the code below?

    Thanks in advance.

    <?php
    
    // Screen scraping your way into RSS
    // Example script, by Dennis Pallett
    // http://www.phpit.net/tutorials/screenscrap-rss
    
    // Get page
    $url = 
    
    "http://www.urlgoeshere.com/";
    $data = implode("", file($url)); 
    
    // Get content items
    preg_match_all ("/<div class=\"headline\">([^`]*?)<\/a/", $data, $matches);
    
    // Begin feed
    header ("Content-Type: text/xml; charset=ISO-8859-1");
    echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>\n";
    ?>
    <rss version="2.0"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:content="http://purl.org/rss/1.0/modules/content/"
      xmlns:admin="http://webns.net/mvcb/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <channel>
            <title>News</title>
            <description>The latest news from</description>
            <link>http://www.urlgoeshere.com</link>
            <language>en-us</language>
    
    
    <?
    // Loop through each content item
    foreach ($matches[0] as $match) {
        // First, get title
        preg_match ("/\>([^`]*?)<\/a/", $match, $temp);
        $title = $temp['1'];
        $title = strip_tags($title);
        $title = trim($title);
    
        // Second, get url
        preg_match ("/<a href=\"([^`]*?)\">/", $match, $temp);
        $url = $temp['1'];
        $url = trim($url);
    
        // Echo RSS XML
        echo "<item>\n";
            echo "\t\t\t<title>" . strip_tags($title) . "</title>\n";
            echo "\t\t\t<link>http://www.urlgoeshere.com" . strip_tags($url) . "</link>\n";
            echo "\t\t\t<description>" . strip_tags($text) . "</description>\n";
            echo "\t\t\t<content:encoded><![CDATA[ \n";
            echo $text . "\n";
            echo " ]]></content:encoded>\n";
            echo "\t\t\t<dc:creator>" . strip_tags($author) . "</dc:creator>\n";
        echo "\t\t</item>\n";
    }
    ?>
    </channel>
    </rss>
    PHP:
     
    soggy, Feb 10, 2008 IP
  2. imvain2

    imvain2 Peon

    Messages:
    218
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #2
    replace & with &amp;
     
    imvain2, Feb 10, 2008 IP