Let's say I have some code like this Code: <a target="_blank" href="http://link.com"> abc</a> </p> But I of course have a lot more. How can I automatically extract the URLs? I know the task is simply to return all values between href=" and "> but no idea what code I could use for it.
You could use regular expressions. You will beyond a shadow of a doubt come across some HTML that won't fit the RegEx though . For fast development (but might break easily): Regular Expressions For bulletproof code (but is harder to learn and longer to code): XPath or some selector library like phpQuery Edit: sorry, I just came out of the PHP forum and assumed you're talking about PHP.
Thanks for ur reply Well I just need what i wrote above cus it's just one site i need to be able to do it from. it's a site that checks backlinks and i need the URLs by themselves like www.link1.com www.link2.com www.link3.com I don't think it's a hard task as I basically just need somewhere where I can insert the code, and that it then finds all values that are inbetween href=" and "> and that it then echoes them. But i'm no programmor so it's not that easy
Ive made a tool that is able to grab all regex matches http://rapidshare.com/files/432240761/GetMeAllMatches.exe set as matcher href="([^"]*?)" and group 1 (so it outputs the match in the brakes (URL)), paste the source code in the input box and hit get matches here is a short video http://rapidshare.com/files/432242006/getmeallmatches.wmv
Your tool doesn't work :/ but some other guys made a script that works perfectly fine so i dont need it anymore
You can make a scraper in a multitude of languages, you could use Java + Regex(Easiest way to search IMO) to do it easily.