how to verify if a URL is a web page

adolix Peon

Messages:: 787

Likes Received:: 32

Best Answers:: 0

Trophy Points:: 0

#1

Hi,

I'm working at a new site, and I need to verify if a URL given by the user is a webpage, and not a PDF, MP3 or other formats.... this is because i have to work with the given URL, and i would not like to parse stupid files, that newbee users might enter....

Thanks,
adolix

adolix, Dec 13, 2006 IP

T0PS3O Feel Good PLC

Messages:: 13,219

Likes Received:: 777

Best Answers:: 0

Trophy Points:: 0

#2

Look for a common HTML tag like <head>, <html>, <body>, <p>, <div>, <h1> etc. You can also check the extension of the file.

T0PS3O, Dec 13, 2006 IP

Barti1987 Well-Known Member

Messages:: 2,703

Likes Received:: 115

Best Answers:: 0

Trophy Points:: 185

#3

T0PS3O said: ↑

Look for a common HTML tag like <head>, <html>, <body>, <p>, <div>, <h1> etc. You can also check the extension of the file.
Click to expand...

That won't work, not all HTML pages have HTML markups.

The only I see is that you download the file and then check the file type.

Peace,

Barti1987, Dec 13, 2006 IP

T0PS3O Feel Good PLC

Messages:: 13,219

Likes Received:: 777

Best Answers:: 0

Trophy Points:: 0

#4

That won't work, not all HTML pages have extensions that would indicate the filetype. And doing it manually would be no-go on bulk indexing.

I'd confidently say 99% of HTML pages contain HTML mark-up so it's a safe bet.

T0PS3O, Dec 13, 2006 IP

clancey Peon

Messages:: 1,099

Likes Received:: 63

Best Answers:: 0

Trophy Points:: 0

#5

The first step is to make sure the submitted URL does not contain the extension for a pdf, mp3 type file, etc. This will allow you to tell the user up front that links must be to ordinary pages web pages and not PDFs, music, video, graphic and similar files.
After that you need to do as azizny suggests -- download and check file type.

clancey, Dec 13, 2006 IP

adolix Peon

Messages:: 787

Likes Received:: 32

Best Answers:: 0

Trophy Points:: 0

#6

dowloading is exactly what i want NOT TO DO because if the links is of a 15 MB pdf.... the user will wait very much, and then he will be given an error...

other ideas ?
thanks

adolix, Dec 13, 2006 IP

T0PS3O Feel Good PLC

Messages:: 13,219

Likes Received:: 777

Best Answers:: 0

Trophy Points:: 0

#7

Clancy has the best order:

1. Check extension
2. file_get_contents() (Or fopen and read in just N KB) and check for <html>

T0PS3O, Dec 13, 2006 IP

crazyden Guest

Messages:: 15

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#8

You can use cURL library. this will help you to get only headers for your request (to the page you want to verify) - you can specify this option with curl_setpot(). If you do not understand me - write me back, I will describe

By the way, PHP 5 has cURL bugfixed (Anouncement here)

crazyden, Dec 13, 2006 IP

adolix Peon

Messages:: 787

Likes Received:: 32

Best Answers:: 0

Trophy Points:: 0

#9

crazyden, I am using exactly curl, because I need to search for a certain string in the file..... right now I am doing this:

function parseit($url)
{
set_time_limit(0);
$ch = curl_init();
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_MAXREDIRS,100);
curl_setopt($ch,CURLOPT_URL,$url);
$buffer = curl_exec($ch);
return $buffer;
}

but I would like to see if the file really is a normal webpage, and not a PDF/MP3 etc, without downloading the entire file, which can be 15 MB...

i am looking forward to your help

thanks,
adolix

adolix, Dec 14, 2006 IP

Log in or Sign up

how to verify if a URL is a web page

adolix Peon

T0PS3O Feel Good PLC

Barti1987 Well-Known Member

T0PS3O Feel Good PLC

clancey Peon

adolix Peon

T0PS3O Feel Good PLC

crazyden Guest

adolix Peon

Useful Searches