1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Crawling text from another website

Discussion in 'PHP' started by khan11, Apr 3, 2012.

  1. #1
    Hello,

    I need some help. I am creating an educational website on which I will be keeping the information of universities. What I would like to do is to create a script that visits the university websites and fetches the admission date from a specified page.

    Could someone please guide as how can i achieve this?


    Thanks
    SEMrush
     
    khan11, Apr 3, 2012 IP
    SEMrush
  2. denisjames

    denisjames Peon

    Messages:
    46
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    You need a good developer to build such a script. you can find free scripts, but i wouldn't count on them
     
    denisjames, Apr 3, 2012 IP
  3. khan11

    khan11 Active Member

    Messages:
    615
    Likes Received:
    15
    Best Answers:
    0
    Trophy Points:
    58
    #3
    Thanks. Can't i use curl_init() to get text?
     
    Last edited: Apr 3, 2012
    khan11, Apr 3, 2012 IP
  4. BMR777

    BMR777 Well-Known Member

    Messages:
    145
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    140
    #4
    If I were to do this I would use something like PHP's file_get_contents to read the content of the web page. Then I would do something like stristr() to get any content that is after phrases such as "Admission Date" , "Start Date", etc.

    Then once you have everything past "Admission Date" it's a matter of checking the rest of the string for the next available date. You'll have to do some logic to grab the date, and you may not know if you're going to get a date like 03-30-2012 or March 30th, 2012 so you'll have to do something to account for all of the variations in dates out there.
     
    BMR777, Apr 3, 2012 IP
    khan11 likes this.
  5. chanif.alfath

    chanif.alfath Active Member

    Messages:
    148
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    88
    #5
    try using simple_html_dom parser class..

    i'm using it for a lot of data scraping project,
    it's very easy to use :)
     
    chanif.alfath, Apr 4, 2012 IP
    khan11 likes this.
  6. khan11

    khan11 Active Member

    Messages:
    615
    Likes Received:
    15
    Best Answers:
    0
    Trophy Points:
    58
    #6
    Thank you guys. I'll try these and will post results. And sorry for late reply.
     
    khan11, Apr 5, 2012 IP
  7. stephan2307

    stephan2307 Well-Known Member

    Messages:
    1,268
    Likes Received:
    30
    Best Answers:
    7
    Trophy Points:
    150
    #7
    I do a lot of scraping and I would suggest the following.

    1) use curl to get the page
    2) check the html and do a simple preg_match on the date and a little bit of code before and after the date
    3) if it can't find the date send yourself an email so you can check if the website layout changed.

    I would keep it really simple because you will have to custom code the script for every website as every website is structured differently.

    If you are smart then you can create a small system that lets you reuse as much of the code as possible.
     
    stephan2307, Apr 5, 2012 IP
    khan11 likes this.
  8. dropdrop

    dropdrop Active Member

    Messages:
    142
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    53
    #8
    yes, I agree with this answer, I think the way this is the most simple and perfect
     
    dropdrop, Apr 6, 2012 IP
  9. webdove

    webdove Peon

    Messages:
    58
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #9
    You can use curl to get contents directly from their website and you can than arrange yoiur contents in the mysql table to fetch the data later in your website
     
    webdove, Apr 6, 2012 IP
    khan11 likes this.
  10. khan11

    khan11 Active Member

    Messages:
    615
    Likes Received:
    15
    Best Answers:
    0
    Trophy Points:
    58
    #10
    Ok guys, I've used this approach to get my thing..

    I used Simple Html Dom to fetch plain text from the website. And then, I split the text from a word before date. After that, i fetched date using preg_match and fetched the required date.

    Here's the code:
    
    include('../simple_html_dom.php');
    
    $url = "";   // link
    $text_check = "";  // Added on
    $format = 0; // dd mm yy
    $pattern = ""; // date pattern check
    $display_date = "none";
    // form data
    if(isset($_POST['submit']))
    {
    	$url = $_POST['url'];
    	$text_check = $_POST['text'];
    	$format = $_POST['format'];
    
    
    // end form data
     
    
    
    
    
    // http://www.ilmkidunya.com/admission_notices/admission-in-fsc-icom-ics-9356.aspx
    // http://www.jobz.pk/opf-girls-college-islamabad-admissions-_admissions-95.html
    
    $plain_text = file_get_html($url)->plaintext;
    //echo $plain_text;
    
    $after_pruning = strstr($plain_text, $text_check);
    
    if($format==01)
    {
    	$pattern  = "/([a-z]{1,10}|[A-Z]{1,10})+[\s]+[0-9]{1,2}+[,\s]+[0-9]{1,4}/i"; //Semptember 01, 2003
    }
    elseif($format==02)
    {
    	$pattern = "/[0-9]{1,2}+[\s]+[a-zA-Z0-9]{3}+[\s]+[0-9]{1,4}/i"; //05 Mar 2000
    }
    
    if(preg_match($pattern, $after_pruning, $res))
    {
    	
    
    		$display_date = $res[0];
    
    	
    }
    
    PHP:

    Can you please guide if there is any better approach than this? Moreover, I've little problem that is if the website has two dates side by side, how will I fetch the second date? As the approach which i use is helpful for fetching that date which has some alphabatic text beside it. But how if the dates were like july 20, 2010 july 10, 2012

    Thanks
     
    Last edited: Apr 7, 2012
    khan11, Apr 7, 2012 IP