Get data from sites

ma0 Peon

Messages:: 218

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 0

#1

I'm slowly creating a database of sites to get information that could be useful in the future to see trends in technology and other stuff. I started it just to see how many people use Wordpress vs. Blogger.

What I am asking you is:
Do you know what information I could gather automatically from a site? I've already asked to many people, but most of the people I ask doesn't have a "programmer mind" so they don't get what I'm asking.
An example of what I'm reading right now is the

<meta name="generator" value="Wordpress|blogger" >

I'm asking if you think there is something else useful to get from the code.
Don't say email addresses. I'm not a spammer

ma0, Aug 7, 2007 IP

AstarothSolutions Peon

Messages:: 2,680

Likes Received:: 77

Best Answers:: 0

Trophy Points:: 0

#2

You can gather what ever information you want from a site automatically. Directly from the HTML output is easy, other elements can be more difficult.

AstarothSolutions, Aug 7, 2007 IP

ma0 Peon

Messages:: 218

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 0

#3

Thanks, but what I'd like to know is what you consider useful to gather.

I know it's easy with HTML. Actually I was thinking is there is an easy way to get a "visual" representation of the page, so I can ask for distances between words in mm/inches.

ma0, Aug 8, 2007 IP

domado16 Active Member

Messages:: 152

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 53

#4

Well if you were using a desktop application, you could use a hidden internet explorer activex component, load the page in there, get a web shot of the internet explorer frame, and then analyze it from there....or perhaps you could inject two invisible <div> tags at the postions you want to analyze and then set their style.position to absolute, and then access their style.left and style.top variables and get the values from there

domado16, Aug 8, 2007 IP

AstarothSolutions Peon

Messages:: 2,680

Likes Received:: 77

Best Answers:: 0

Trophy Points:: 0

#5

You cannot get distance between words in mm as this is subject to screen resolution, font size and wordspacing settings - fonts may well be relative and therefore it depends on the text size set in the browser

AstarothSolutions, Aug 8, 2007 IP

ma0 Peon

Messages:: 218

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 0

#6

Right, but you can still imagine that there are some default settings. With those settings you can have a distance that can be useful for getting information from the site. You don't need absolute precision but just a rough idea.
Anyway I don't think that such library exists.

It's better to think other stuff to look for. Like feedburner chicklets..

ma0, Aug 9, 2007 IP

AstarothSolutions Peon

Messages:: 2,680

Likes Received:: 77

Best Answers:: 0

Trophy Points:: 0

#7

it would be possible to estimate it by looking at the font size, number of spaces and from this the pixel distance - it then up to you to decide what basis you want to estimate the resolution and screen size as

AstarothSolutions, Aug 9, 2007 IP

ma0 Peon

Messages:: 218

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 0

#8

I was wondering why no one has already done it. It could be useful to detect spam/mfa sites.

I think that google is probably working on something like that.

ma0, Aug 10, 2007 IP

henryb Member

Messages:: 65

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 43

#9

Just curious, could somebody give me a practical example when distances between words on the webpage could be useful?

henryb, Aug 10, 2007 IP

exesteam Guest

Messages:: 27

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#10

You can use PHP to fetch data from websites:
<?php
echo file_get_contents('www.google.com');
?>
Code (markup):

exesteam, Aug 10, 2007 IP

ma0 Peon

Messages:: 218

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 0

#11

Distances between words could be useful to detect "bad" sites. If you have a spammy site with 20 times the same word and a good site with 20 time the same word, how you detect which one is spam? If the words are too near or if they are at a fixed distance, it's 90% a spam site.

It could be useful for a search engine to see how two words are related. How many times have you searched two words on Google and when you click the first link you get, you discover that those two words exists on the page, but they are very distant (and not related) between each other.

ma0, Aug 12, 2007 IP

Log in or Sign up

Get data from sites

ma0 Peon

AstarothSolutions Peon

ma0 Peon

domado16 Active Member

AstarothSolutions Peon

ma0 Peon

AstarothSolutions Peon

ma0 Peon

henryb Member

exesteam Guest

ma0 Peon

Useful Searches