Regular Expression Help Html scrapping project

fernan Well-Known Member

Messages:: 127

Likes Received:: 14

Best Answers:: 0

Trophy Points:: 125

#1

Hi all , been doing this thing for days but i cant solve it. This is a webs craping project.

I want to extract the Phrase Below using regular expression. It's in C#.net

Health Niche blogs

<TD class=info width="80%" noWrap>Author: Health Niche Blogs | Published: Sep 04, 2009 License: FREE | O/S: Windows NT/2000/XP/2003/Vista </TD>

Here's my regex code but i think there's something wrong

(?<=Author.*(?=)

Thank you

fernan, Oct 7, 2009 IP

ohteddy Member

Messages:: 128

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 28

#2

Try this:
Author:<\/STRONG> ([^|]+) | <STRONG
Code (markup):
I tested this regex on http://rubular.com/

ohteddy, Oct 7, 2009 IP

petividi Greenhorn

Messages:: 41

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 16

#3

scraping is much easier with xpath , there is a firefox xpath testing add on google it

petividi, Oct 8, 2009 IP

ohteddy Member

Messages:: 128

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 28

#4

petividi said: ↑

scraping is much easier with xpath , there is a firefox xpath testing add on google it
Click to expand...

I agree. TagSoup is another great one, and I use it a lot.

For example the following Haskell code scraps a page and gets all the links that have rar
as an extension:
[rar | TagOpen "a" atts <- parseTags txt
 , ("href",rar) <- atts
 , takeExtension rar == ".rar"]
Code (markup):

Last edited: Oct 9, 2009

ohteddy, Oct 9, 2009 IP

Log in or Sign up

Regular Expression Help Html scrapping project

fernan Well-Known Member

ohteddy Member

petividi Greenhorn

ohteddy Member

Useful Searches