Regular Expression Help Html scrapping project

Discussion in 'Programming' started by fernan, Oct 7, 2009.

  1. #1
    Hi all , been doing this thing for days but i cant solve it. This is a webs craping project.

    I want to extract the Phrase Below using regular expression. It's in C#.net

    Health Niche blogs

    <TD class=info width="80%" noWrap><STRONG>Author:</STRONG> Health Niche Blogs | <STRONG>Published:</STRONG> Sep 04, 2009<BR><STRONG>License:</STRONG> FREE | <SPAN style="WHITE-SPACE: normal"><STRONG>O/S:</STRONG> Windows NT/2000/XP/2003/Vista </SPAN></TD>

    Here's my regex code but i think there's something wrong

    (?<=<STRONG>Author.*(?=<STRONG>)

    Thank you
     
    fernan, Oct 7, 2009 IP
  2. ohteddy

    ohteddy Member

    Messages:
    128
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    28
    #2
    Try this:
    
    Author:<\/STRONG> ([^|]+) | <STRONG
    
    Code (markup):
    I tested this regex on http://rubular.com/
     
    ohteddy, Oct 7, 2009 IP
  3. petividi

    petividi Greenhorn

    Messages:
    41
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    16
    #3
    scraping is much easier with xpath , there is a firefox xpath testing add on google it
     
    petividi, Oct 8, 2009 IP
  4. ohteddy

    ohteddy Member

    Messages:
    128
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    28
    #4
    I agree. TagSoup is another great one, and I use it a lot.

    For example the following Haskell code scraps a page and gets all the links that have rar
    as an extension:
    
    [rar | TagOpen "a" atts <- parseTags txt
         , ("href",rar) <- atts
         , takeExtension rar == ".rar"]
    
    Code (markup):
     
    Last edited: Oct 9, 2009
    ohteddy, Oct 9, 2009 IP