Hi, I am trying to figure out just for the sake of knowledge, I hope you all will help me out. If I have a website where I have publishers registered and I provide them the javascript code to keep on their site. I want to verify that the javascript is there on publishers' websites. And I want to do it regularly, once in a day. To confirm that the code is still there on the website. My robot or bot (if these are terms used), should be able to check the my javascript code existence on different websites everyday. What all I would have to know to do this? Can you come up with your suggestions or links to resources.
Your bot would read publishers' URL from a database; request the URL and scan the returned HTTP response for <script> tags; then it would extract the content of these tags and compare it with what you have for this publisher (you will need to normalize it first); if the database contains only website entry URLs, and you want to check the entire website, you will need to parse all links on the returned pages and follow the internal links, extracting the contents of all <script> tags you find; You can do this in any language that supports TCP/IP or HTTP - Java, C++, C#, VB, Perl, etc. J.D.
The thing with java script if I understand correctly, you can have it on the page and not have it execute. You may want to think about this.
It won't get executed yeah, but I'd think that's what he wants... i.e. to check the source code he gave them to put on their pages is there ad verbatim. Btw I'd recommend PHP/CURL. It'd take you like 8 lines of code maybe.
No need for all these mess maverick. Just include a tracker. For example create a tracker page in php or perl and put it in ur server. Then in tha javascript code u provide ur publishers, create a line to call that page. I just woke up and my brain is slow. So cant explain more. You can PM me and I will help u out.
If your gonna do this and have alot of people using your code, your gonna have to do it quickly, use some kind of data structure, maybe trees
Is that a challange? mysql_connect(DB_HOST, DB_USER, DB_PASS); mysql_select_db(DB_NAME); $result = mysql_query('SELECT publisher_id, url FROM publishers'); while($publisher = mysql_fetch_assoc($result)) echo $publisher['id'] . ': ' . (strpos(file_get_contents($publisher['url']), JAVASCRIPT) !== false ? 'Validates' : 'Failed') . '<br />'; PHP: Yeah, it's messy, but still lines to spare
I recommend Perl & LWP or Spidering Hacks from O'reilly. You can check them out by signing up for a free trial at safari.oreilly.com (it's like 15 days) They'll both provide you with enough sample code to get started. I have both books, but I recently subscribed to safari anyways. Makes my life easier to be able to search through the books
This doesn't do the job, though. The idea was to validate the script, not just to make sure that there's a script on the page. That is, to make sure that publishers use the provided script and to catch those who just pretend to. You would have to scan the script, ignore differences in line endings and whitespace and compare it to the one that is correct for this publisher or a publisher group. You also will need to handle errors properly or some publishers may get away with not using the script or get penalized for no reason. You also have to implement a retry policy - in case if some publisher is temporarily offline for some technical reason. You also will have to store the results somewhere to make them usable for pinpointing those that fail. Can you do this with a dozen of *readable* lines? J.D. P.S. Don't waste your time - it's not a challange. It's simply not a 10-line project.
20 then maybe... I wasn't bragging, I was trying to point out PHP is ideally suited to this kind of thing. In C++ it would be significantly more bloated I imagine, unless you used .NET's regular expression functionality. Actually Perl would be just as good (duh). Hehe
Thanks all for your inputs.. I would check out each post tomorrow due to some time constraint... I may come up with some doubts I think..