Retrieve website's sitemap programatically C# Windows.Forms

Discussion in 'Programming' started by andreimc, Aug 6, 2010.

  1. #1
    Hello,

    Is there a way to browse the full content of a website in C#?
    I mean, I do know how to retrieve the homepage of a website and put its contents into a string, by using WebRequest/WebResponse objects, but I need to find a way to retrieve ALL pages in a website and put them in an array.

    I struggled a bit with this, and realised that I could make use of a site's sitemap file, and parse it - then I could get a list of all pages and their paths. But, how could I retrieve the sitemap of a website? I can't just assume the sitemap file is located in the root folder:

    http://www.example.com/sitemap.xml

    .. that would be hardcoding which i don't want.

    So my question is: could I locate the sitemap file of a website programatically?

    Thank you for any suggestion in advance.

    Regards,

    /Andrei
     
    andreimc, Aug 6, 2010 IP
  2. s_ilyas786

    s_ilyas786 Active Member

    Messages:
    282
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    58
    #2
    I am not sure about going by the way of sitemap.xml as it would be difficult to find it and few sites might not have it at all...

    But i guess you can try this?
    -> WebRequest to homepage
    -> WebResponse should be page`s source code
    -> Use regular expression and find out all links(inner pages) from Response received.
    -> Only keep links which are from this site(inner pages)(WebResponse.contains("http://website.com/"))
    -> Store these links in array or file or db(Where ever)
    -> Take out each link from above step and do WebRequest to find more inner pages
    Repeat till all are done...

    If you search net you should be able to find code for Step 3. Let me know if you can`t find it, i ll try to do a quick dirty code for your reference(i can only write C#.net). Rest all is your logic...

    Hope that helps...
     
    s_ilyas786, Aug 6, 2010 IP
  3. andreimc

    andreimc Peon

    Messages:
    36
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Hello s_ilyas786,

    Your solution is exactly what I was looking for, I will try this.
    Thank you!
     
    andreimc, Aug 6, 2010 IP
  4. andreimc

    andreimc Peon

    Messages:
    36
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Hi, could you please help me with step 3 you mentioned? I searched but could not locate a regex that searches links from a given domain.
    Afterwards, I shall need to make this regex dynamic, since I don't know from start what the domain name will be - I will be iterating a list of URL's, so will need to rebuild the regex each time, for each URL.

    Or, I think you are correct - I could simply store ALL the links it finds, in some List<string>, and then when I have a full list, I could go through it and remove items that don't belong to the same domain. I could avoid the dynamic regex by doing so.

    Thank you.

    Regards,
     
    andreimc, Aug 7, 2010 IP
  5. s_ilyas786

    s_ilyas786 Active Member

    Messages:
    282
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    58
    #5
    andreimc,

    I am sorry for delay, i was busy till now...

    Below you will find code you were looking for:
    String url = "http://forums.digitalpoint.com/"; //Change url as needed
    WebRequest request = WebRequest.Create(url);
    WebResponse response = request.GetResponse();
    StreamReader reader = new StreamReader(response.GetResponseStream());
    string data = reader.ReadToEnd();
    reader.Close();
    response.Close();

    string RegexPattern = @"<a.*?href=[""'](?<url>.*?)[""'].*?>(?<name>.*?)</a>";
    MatchCollection matches = Regex.Matches(data, RegexPattern, RegexOptions.IgnoreCase);
    string[] MatchList = new string[matches.Count];
    int c = 0;

    foreach (System.Text.RegularExpressions.Match match in matches)
    {
    MatchList[c] = match.Groups["url"].Value;
    c++;
    }

    Then next step is to remove all link in MatchList which are not related to the url, hope you can code that yourself
    Hint: If you loop through MatchList.Contain("http://forums.digitalpoint.com/") you should get it...

    Lemme know if you need more details...

    Warm Regards,
    ilyas
     
    s_ilyas786, Aug 12, 2010 IP
  6. andreimc

    andreimc Peon

    Messages:
    36
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Hi ilyas,

    Thank you very much for your help.
    In the meantime I figured it out myself, and my method is almost the same with your code snippet, except that my regex was not made with IgnoreCase :)

    Thanks again.
    Best regards,

    /Andrei
     
    andreimc, Aug 12, 2010 IP
  7. krishnaswarna

    krishnaswarna Peon

    Messages:
    2
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    can i know how to find each sublink in the html of a website recursively till we find a page where no link availbale which traces back to same website and store it in an array
     
    krishnaswarna, Nov 10, 2010 IP
  8. krishnaswarna

    krishnaswarna Peon

    Messages:
    2
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #8
    can i know how to remove all link in MatchList which are not related to the url,
     
    krishnaswarna, Nov 11, 2010 IP