I'm slowly creating a database of sites to get information that could be useful in the future to see trends in technology and other stuff. I started it just to see how many people use Wordpress vs. Blogger. What I am asking you is: Do you know what information I could gather automatically from a site? I've already asked to many people, but most of the people I ask doesn't have a "programmer mind" so they don't get what I'm asking. An example of what I'm reading right now is the <meta name="generator" value="Wordpress|blogger" > I'm asking if you think there is something else useful to get from the code. Don't say email addresses. I'm not a spammer
You can gather what ever information you want from a site automatically. Directly from the HTML output is easy, other elements can be more difficult.
Thanks, but what I'd like to know is what you consider useful to gather. I know it's easy with HTML. Actually I was thinking is there is an easy way to get a "visual" representation of the page, so I can ask for distances between words in mm/inches.
Well if you were using a desktop application, you could use a hidden internet explorer activex component, load the page in there, get a web shot of the internet explorer frame, and then analyze it from there....or perhaps you could inject two invisible <div> tags at the postions you want to analyze and then set their style.position to absolute, and then access their style.left and style.top variables and get the values from there
You cannot get distance between words in mm as this is subject to screen resolution, font size and wordspacing settings - fonts may well be relative and therefore it depends on the text size set in the browser
Right, but you can still imagine that there are some default settings. With those settings you can have a distance that can be useful for getting information from the site. You don't need absolute precision but just a rough idea. Anyway I don't think that such library exists. It's better to think other stuff to look for. Like feedburner chicklets..
it would be possible to estimate it by looking at the font size, number of spaces and from this the pixel distance - it then up to you to decide what basis you want to estimate the resolution and screen size as
I was wondering why no one has already done it. It could be useful to detect spam/mfa sites. I think that google is probably working on something like that.
Just curious, could somebody give me a practical example when distances between words on the webpage could be useful?
You can use PHP to fetch data from websites: <?php echo file_get_contents('www.google.com'); ?> Code (markup):
Distances between words could be useful to detect "bad" sites. If you have a spammy site with 20 times the same word and a good site with 20 time the same word, how you detect which one is spam? If the words are too near or if they are at a fixed distance, it's 90% a spam site. It could be useful for a search engine to see how two words are related. How many times have you searched two words on Google and when you click the first link you get, you discover that those two words exists on the page, but they are very distant (and not related) between each other.