Parsing DMOZ XML files

Discussion in 'ODP / DMOZ' started by Circuitpreacher, Aug 1, 2007.

  1. #1
    Hello all,

    I'm a newbie to the forum here and a newbie to xml (but I did read the book :) but you still need to "type slowly" please so I can understand :). I am trying to mine the data from the dmoz rdf files that they provide for the categories and links. They provide a small "sample" file (which is what I am working with) from http://rdf.dmoz.org/rdf/structure.example.txt.

    But I wasn't joking when I said I read the book. It is a book on using xml and php together and I copied a script out of there (see below) that, so far, parses other xml files but doesn't touch the dmoz one. I get the error message Warning: xmldocfile(): xmlParsePI : no target name in /home/bungeebo/public_html/xml_test/run_test3.php on line 3

    Warning: xmldocfile(): Start tag expected, '<' not found in /home/bungeebo/public_html/xml_test/run_test3.php on line 3
    error in xml doc



    This is going to be quite a bit of code, but is probably the best way to figure it out. I color coded below

    1) in green text the code of the php page parsing the xml, that produces the error when used on the subject dmoz data. The only change I make is the file name in line 2 of the two different xml files it is looking at (the next line 3 is referenced in the error msg). The file name used below (dmoz_cat_ex.xml) produces the error while "test.xml" doesn't.
    2) in blue text the code of the xml that works (test.xml) and
    3) in red text a section of the code that causes the error (dmoz_cat_ex.xml). The rest of the code is the same code as on http://rdf.dmoz.org/rdf/structure.example.txt).

    <?php
    $xml_file = "dmoz_cat_ex.xml";
    if (!$doc = xmldocfile($xml_file))

    {
    die ('error in xml doc');
    }
    $root = $doc->root();
    $children = get_children($root);
    $elementCount = 1;
    print_tree($children);

    function print_tree($nodeCollection)
    {
    global $elementCount;
    echo '<ul>';
    for($x=0; $x<sizeof($nodeCollection); $x++)
    {
    $elementCount++;
    echo '<li>' . $nodeCollection[$x]->tagname;
    $nextCollection = get_children($nodeCollection[$x]);
    print_tree($nextCollection);
    }
    echo '</ul>';
    }

    function get_children($node)
    {
    $temp = $node->children();
    $collection = array();
    for($x=0; $x<sizeof($temp); $x++)
    {
    if($temp[$x]->type == XML_ELEMENT_NODE)
    {
    $collection[] = $temp[$x];
    }
    }
    return $collection;
    }
    echo 'Total number of elements : ', $elementCount;
    ?>



    <?xml version="1.0" encoding="ISO-8859-1"?>
    <rss version="2.0">
    <channel>
    <title>CBC | Top Stories News</title>
    <link>http://www.cbc.ca/news/</link>
    <description>FOR PERSONAL USE ONLY</description>
    <language>en-ca</language>
    <lastBuildDate>Fri, 12 Jan 2007 16:28:57 EST</lastBuildDate>
    <copyright>Copyright: (C) Canadian Broadcasting Corporation, http://www.cbc.ca/aboutcbc/discover/termsofuse.html#Rss</copyright>
    <docs>http://www.cbc.ca/rss/</docs>
    <image>
    <title>CBC.ca</title>
    <url>http://www.cbc.ca/rss/image/cbc_144.gif</url>
    <link>http://www.cbc.ca</link>
    </image>
    <item>
    <title>Pakistan objects to U.S. claims it shelters leaders of al-Qaeda</title>
    <link>http://www.cbc.ca/world/story/2007/01/12/pakistan-alqaeda-070112.html?ref=rss</link>
    <author>CBC</author>
    <pubDate>Fri, 12 Jan 2007 12:30:21 EST</pubDate>
    <description>Pakistani officials have sharply rejected U.S. claims that the country is harbouring senior al-Qaeda leaders and serving as a nerve centre controlling terror operations.</description>
    </item>
    <item>
    <title>Extreme wind chill shuts down Manitoba schools</title>
    <link>http://www.cbc.ca/canada/manitoba/story/2007/01/12/mba-cold.html?ref=rss</link>
    <author>CBC</author>
    <pubDate>Fri, 12 Jan 2007 11:14:31 EST</pubDate>
    <description>Classes were cancelled at many schools Friday as most of southern Manitoba woke up under a deep freeze, with a wind chill making it feel like -48 C.</description>
    </item>
    <item>
    <title>Ontario man accused of bilking investors of $8M</title>
    <link>http://www.cbc.ca/money/story/2007/01/12/spencer.html?ref=rss</link>
    <author>CBC</author>
    <pubDate>Fri, 12 Jan 2007 15:32:06 EST</pubDate>
    <description>Police in Toronto have issued a Canada-wide warrant for the arrest of a 25-year-old man who they say defrauded friends and relatives of $8 million.

    </description>
    </item>
    <item>
    <title>Beckham set to invade America</title>
    <link>http://www.cbc.ca/sports/soccer/story/2007/01/12/david-beckham.html?ref=rss</link>
    <author>CBC</author>
    <pubDate>Fri, 12 Jan 2007 14:33:15 EST</pubDate>
    <description>David Beckham is one of the most famous athletes in the world, but he's hardly a household name in the United States. He has five years to change that.</description>
    </item>
    <item>
    <title>Water isn't only source of Montezuma's revenge</title>
    <link>http://www.cbc.ca/consumer/story/2007/01/12/travellers-diarrhea.html?ref=rss</link>
    <author>CBC</author>
    <pubDate>Fri, 12 Jan 2007 15:29:28 EST</pubDate>
    <description>Many Canadians, adhering only to the old travellers' maxim of 'don't drink the water,' are putting themselves at risk by ignoring other potential diarrhea triggers, a new study suggests.</description>
    </item>
    </channel>
    </rss>





    <? xml version='1.0' encoding='UTF-8' ?>

    <RDF xmlns:r="http://www.w3.org/TR/RDF/"
    xmlns:d="http://purl.org/dc/elements/1.0/"
    xmlns="http://dmoz.org/rdf">

    <!-- Generated at 2004-05-04 01:05:15 GMT on dust -->

    <Topic r:id="Top">
    <catid>1</catid>
    <d:Title>Top</d:Title>
    <lastUpdate>2004-04-13 23:40:59</lastUpdate>
    <narrow r:resource="Top/Arts"/>
    <narrow r:resource="Top/Shopping"/>
    <narrow r:resource="Top/Science"/>
    <narrow r:resource="Top/Games"/>
    <narrow r:resourc
     
    Circuitpreacher, Aug 1, 2007 IP
  2. PlantNut

    PlantNut Active Member

    Messages:
    18
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    93
    #2
    PlantNut, Aug 1, 2007 IP