1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

ASP.NET/C# Web Spider (in the making)

Discussion in 'C#' started by Ferbal, Jul 21, 2006.

  1. #1
    Heylo all,

    Okay it's not a webspider or anything like that *YET*. Right now it takes in a url/web page, makes a new
    instance of webClient and downloads the data (web page) into a string.
    SEMrush

    
            string url = Text1.Value;
            WebClient browser = new WebClient();
            UTF8Encoding enc = new UTF8Encoding();
            string fContents = enc.GetString(browser.DownloadData(url));
            int len = fContents.Length;
            char c;
            string linkList = "";
    
            for (int i = 0; i < len; i++)
            {
                c = Convert.ToChar(fContents.Substring(i, 1));
                if (c == 'a')
                {
                    i++;
                    c = Convert.ToChar(fContents.Substring(i, 1));
                    if (c == ' ')
                    {
                        i++;
                        c = Convert.ToChar(fContents.Substring(i, 1));
                        if (c == 'h')
                        {
                            i = i + 6; // move our string counter to after the quotes ref="h
                            c = Convert.ToChar(fContents.Substring(i, 1));
                            while (c != '"')
                            {
                                c = Convert.ToChar(fContents.Substring(i, 1));
                                if (c == '"')
                                {
                                    break;
                                }
                                linkList = linkList + c;
                                i++;
                            }
                            linkList = linkList + "\n";
                            TextArea1.Value = linkList;
                            
                        }
                    }
                }
            }
            
           
        }
    
    Code (markup):
    As you can see you start off with the entire string and just go through it character by character. On some sites it works and it will display each link, however half the time it will fail and give an error of:

    Index and length must refer to a location within the string.


    The error occurs at THIS line:

    
    c = Convert.ToChar(fContents.Substring(i, 1)); <-----------errors here
    if (c == '"')
    
    Code (markup):
    Now I don't understand why this would be erroring. i is the position (character its on) in the string (or webpage) and 1 is the length of how many characters to put into my variable c.

    Thanks in advance for all help, much appreciated!!
     
    Ferbal, Jul 21, 2006 IP
    SEMrush
  2. Ferbal

    Ferbal Peon

    Messages:
    9
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Anyone have any ideas at all?
     
    Ferbal, Jul 22, 2006 IP
  3. benjymouse

    benjymouse Peon

    Messages:
    39
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #3
    You need to learn something about parsing. What you are doing is not top-down nor bottom-up parsing.

    You need to divide the task. Usually compilers etc. will use a scanner to divide the input stream into symbols, so that the parser rules are free to concern themselves with grammar without having to deal with individual characters.

    A scanner will also recognize whitespace correctly. Your attempt will fail on this text:

    a h

    an "a", two spaces and a "h".
     
    benjymouse, Jul 24, 2006 IP
  4. Ferbal

    Ferbal Peon

    Messages:
    9
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Parsing requires a lot more work that doesn't need to be done with what I am trying to accomplish, thanks though!

    And if it finds an 'a', it then looks at the next character and if it is a space, it checks to see if the next character is an 'h'. If it is a space, it will just continue on with the loop looking for the next 'a'.
     
    Ferbal, Jul 24, 2006 IP
  5. Free Born John

    Free Born John Guest

    Messages:
    111
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #5
    why not put a couple of displays in to show the value of i and the the substring length.
     
    Free Born John, Jul 24, 2006 IP
  6. benjymouse

    benjymouse Peon

    Messages:
    39
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #6
    A couple of points then.

    You do not need to use the Substring method. It returns a string. If all you are interested in is character by character then just use the [index] indexer of the string. It will return the character in the position indicated by index (counted from 0 i believe).

    Your code will *not* just continue. You advance the index beyond what you know is safe. If the string ends right after an "a" you'll have an indexing error.

    At the very least you should guard the condition with a shortcut boolean and like

    if (i<fContents.Length && fContents=='a')
     
    benjymouse, Jul 24, 2006 IP
  7. Ferbal

    Ferbal Peon

    Messages:
    9
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Thanks, I did change the code to not use substring, however, now I cannot connect to any remote website, kinda odd since it was not doing this before :(. If I somehow get it working I will see if the original error continues happening, thanks again!

    -Ferbal
     
    Ferbal, Jul 24, 2006 IP
  8. Darrin

    Darrin Peon

    Messages:
    123
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Have you looked at using regular expressions? The RegEx class in C# is really good and really fast at finding patterns. The syntax is a little tricky at first, but once you get it, you can pass in a large string and it will return an array of all the strings that matched your pattern.

    It's very flexible and might work really well for what you are doing...
     
    Darrin, Jul 25, 2006 IP
  9. Ferbal

    Ferbal Peon

    Messages:
    9
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Yeah, I recently stumbled upon some stuff on Regex. However, I still need to be able to connect to websites first lol :) Thanks guys!
     
    Ferbal, Jul 25, 2006 IP