Hey guys, I have the need for some type of script(perl, php) that I can point to a website, and will spider the internal domain links, and return whatever strings I need. For example, there is a site that has a ton of videos that I want to embed, so instead of manually navigating to each page, I want to scrape each page for <embed src=" example"> </embed> code similar to this. I also want the script to use the page title or tags to identify it when it outputs the URLS to a text file. Can someone point me towards a script I can customize( I can code perl and PHP a tiny bit), or give me a quote on something. I have looked into beautifulsoup along with python, which is supposed to work well but I need something more tailored to this, as I dont want to spend a ton of time developing it.
Well, In my opinion, BeautifulSoup is the best way to do it. It really simple to get info from html source. And python itself is a very simple language, so if you know php/perl, it shouldnt be a problem for you to code with python. I dont think PHP is good for make such things, and perl is (imho) much more complicated then python
But isn't it the convention to use SWFObject (or other javascript libraries) to do the Flash embedding dynamically? You may need to mechanize a browser in order to work with the generated DOM - There's JSBridge for Python (http://code.google.com/p/jsbridge/) which sets up a communications bridge using a Mozilla plugin over TCP, or you could use bindings for gtkmozembed (available in Perl, Ruby, Python). However, it doesn't have methods for communicating with the javascript engine, but I found it easy to hack my way around this by sending the browser to URI's using the javascript: scheme and then receiving results back (in JSON format) by setting document.title (and catching them with the title signal). Hope this helps. - Andy.
Now while it may be best practice to use the SWF javascript code, the sites Im looking to scrape links from are using standard <embed> tags for the media. Thats why I was thinking I could just do an html get and strip out those tags dynamically.