Html parser

Discussion in 'Programming' started by ma0, May 28, 2007.

  1. #1
    I'm writing an Html parser to get information from web pages, but I haven't found a fast&decent parser.
    Any language is ok. Right now I'm testing PHP and Java.
    Any Idea?
     
    ma0, May 28, 2007 IP
  2. krt

    krt Well-Known Member

    Messages:
    829
    Likes Received:
    38
    Best Answers:
    0
    Trophy Points:
    120
    #2
    What are you trying to get from the web pages? I think this is close to what you are looking for:
    http://www.phpclasses.org/browse/package/1754.html
    Code (markup):
    Look around that site for plenty more related classes, I'm sure one will do the job.
     
    krt, May 31, 2007 IP
  3. ma0

    ma0 Peon

    Messages:
    218
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Again that site. :) I need to subscribe then.

    You can see an example of what I want to do on my blog. The Technorati Tool you can see on the top menu of my blog was made by hand. I'd like to have some more general purpose stuff to do that kind of things. Just imagine a tree where you can look for specific ids..

    my blog
     
    ma0, May 31, 2007 IP
  4. jetbrains

    jetbrains Well-Known Member

    Messages:
    1,747
    Likes Received:
    137
    Best Answers:
    0
    Trophy Points:
    133
    #4
    There has a excellent Html parser in Lucene which is written entirely in Java.
     
    jetbrains, May 31, 2007 IP
  5. ma0

    ma0 Peon

    Messages:
    218
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #5
    I'm testing JTidy right now. I wonder if there is a speed comparison somewhere on the net.
    Also a simple PHP parser would be helpful for my blog because I'd like to explain to my reader what I can do with a parser, but I don't want to teach Java.
     
    ma0, Jun 1, 2007 IP
  6. slawek

    slawek Peon

    Messages:
    48
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    I used internet explorer COM object for parsing. There are problems with performance, but as this COM object as part of browser, most decent tool for parsing I found.
     
    slawek, Jun 5, 2007 IP