is there any code to extract urls without getting broken links ?

Discussion in 'PHP' started by ramysarwat, Sep 3, 2010.

  1. MyVodaFone

    MyVodaFone Well-Known Member

    Messages:
    1,048
    Likes Received:
    42
    Best Answers:
    10
    Trophy Points:
    195
    #2
    What do you mean, links that are not working or when you extract the link its not full ?

    try
    urlencode();
    PHP:
     
    MyVodaFone, Sep 3, 2010 IP
  2. ramysarwat

    ramysarwat Peon

    Messages:
    164
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    i mean broken urls like

    /language_tools?hl=en

    in this example i don't have the full url
     
    ramysarwat, Sep 3, 2010 IP
  3. MyVodaFone

    MyVodaFone Well-Known Member

    Messages:
    1,048
    Likes Received:
    42
    Best Answers:
    10
    Trophy Points:
    195
    #4
    Yeah, well then you just use
    
    urlencode(http://www.google.ie/search?q=digitalpoint&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a);
    
    PHP:
    When that gets passed it looks like this:
    
    http%3A%2F%2Fwww.google.ie%2Fsearch%3Fq%3Ddigitalpoint%26ie%3Dutf-8%26oe%3Dutf-8%26aq%3Dt%26rls%3Dorg.mozilla%3Aen-GB%3Aofficial%26client%3Dfirefox-a 
    
    PHP:
    Which is what your script needs to keep it all together so to speak.

    You can store the urls encoded or decoded pending on which way you want to look at them you can urldecode(); which puts it back as the original.
     
    MyVodaFone, Sep 3, 2010 IP
  4. danx10

    danx10 Peon

    Messages:
    1,179
    Likes Received:
    44
    Best Answers:
    2
    Trophy Points:
    0
    #5
    @MyVodaFone

    I believe (if I understood correctly) the OP wants a way/method to extract links that contain the full url (ie. host/domain - and not just links to pages.
     
    danx10, Sep 3, 2010 IP
  5. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #6
    This should pick up all pages, scripts, etc..:

    
    $content = file_get_contents('http://forums.digitalpoint.com/showthread.php?t=1927248');
    preg_match_all('/href="(\/?.*)"/Uism',$content,$results);
    print_r($results[1]);
    
    PHP:
     
    ThePHPMaster, Sep 3, 2010 IP
  6. ramysarwat

    ramysarwat Peon

    Messages:
    164
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    good code but i still get broken urls
     
    ramysarwat, Sep 3, 2010 IP
  7. ramysarwat

    ramysarwat Peon

    Messages:
    164
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #8
    this is exactly what i mean

    so instead of i get the link like this "/preferences?hl=en"

    i get the link like this "http://www.google.com/preferences?hl=en"
     
    ramysarwat, Sep 3, 2010 IP
  8. danx10

    danx10 Peon

    Messages:
    1,179
    Likes Received:
    44
    Best Answers:
    2
    Trophy Points:
    0
    #9
    Ok how exactly are you getting these links, from a remote site? or what?
     
    danx10, Sep 3, 2010 IP
  9. MyVodaFone

    MyVodaFone Well-Known Member

    Messages:
    1,048
    Likes Received:
    42
    Best Answers:
    10
    Trophy Points:
    195
    #10
    Looks like he's only getting the query string up to & which is why I thought he needed urlencode(), although its not very clear what the OP wants or is getting from his script.

    @ramysarwat please post your code here, so people can help you more efficiently.
     
    MyVodaFone, Sep 3, 2010 IP
  10. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #11
    I get it now, you will have to use this then:

    
    
    $link = 'http://forums.digitalpoint.com/showthread.php?t=1927248';
    $hostURL = substr($link,0,strpos($link,'/',10));
    
    $content = file_get_contents($link);
    preg_match_all('/href="(\/?.*)"/Uism',$content,$results);
    
    foreach($results[1] as &$curLink)
    {
    	if(!stristr($curLink,'http://'))
    	{
    		$curLink = $hostURL.'/'.$curLink;
    	} 
    }
    print_r($results[1]);
    
    PHP:
     
    ThePHPMaster, Sep 3, 2010 IP
  11. ramysarwat

    ramysarwat Peon

    Messages:
    164
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #12
    this is exactly what i needed thank you very mush "ThePHPMaster" thank you very mush every body for help
     
    ramysarwat, Sep 4, 2010 IP