some messy... but useful code

Discussion in 'C#' started by camjohnson95, Nov 6, 2008.

  1. #1
    I have just created this VB ASP.NET code for handling HTML source code..
    It will retrieve HTML elements and attributes of those elements from a remote (or local) webpage.
    It gets a little bit messy but it should work most of the time (if the HTML is poorly coded it might play up).
    I have coded an example in the page_load to demonstrate it's uses.

    Well here is the code:
    first:
    Imports System.Net
    Imports System.IO

    
       Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
            'an example of how to use the code:
            'outputs actual links, anchor texts and the HREF attribrute from all links on www.google.com
    
            Dim i As Integer = 0
            Dim google As HTMLDoc
            Dim alink As HTMLElement
    
            google.Source = GetWebPage("http://www.google.com")
            alink = google.getElementByTagName("a")
    
            For i = 0 To alink.count - 1
                Response.Write("Actual Link: " & alink.Item(i).outerHTML & "<br>")
                Response.Write("Anchor Text: " & alink.Item(i).innerHTML & "<br>")
                Response.Write("HREF: " & alink.Item(i).getAttributeValue("href") & "<br><br>")
            Next
    
        End Sub
    
        Function GetWebPage(ByVal strURI As String) As String
            Dim r As WebResponse
            r = WebRequest.Create(New Uri(strURI)).GetResponse()
            Dim sr As New StreamReader(r.GetResponseStream())
            Do Until sr.EndOfStream
                GetWebPage = sr.ReadToEnd
            Loop
            r.Close()
            sr.Close()
        End Function
    
    
        Structure HTMLDoc
            Dim Source As String
            Function getElementByTagName(ByVal tagname As String) As HTMLElement
                Dim p1, p2, p3, p4, p5 As Integer
                Dim c As Integer
                c = 0
                p1 = 0
                tagname = LCase(tagname)
    
                With getElementByTagName
                    Do
                        p1 = InStr(p1 + 1, LCase(Source), "<" & tagname)
                        If p1 = 0 Then Exit Do
                        p2 = InStr(p1, Source, ">")
                        ReDim Preserve .Item(c)
                        If Mid(Source, p2 - 1, 1) = "/" Then
                            .Item(c).innerHTML = ""
                            .Item(c).outerHTML = Mid(Source, p1, (p2 + 1) - p1)
                        Else
                            p3 = InStr(p2, LCase(Source), "</" & tagname)
                            p4 = p3 + Len(tagname) + 3
                            .Item(c).innerHTML = Mid(Source, p2 + 1, p3 - (p2 + 1))
                            .Item(c).outerHTML = Mid(Source, p1, p4 - p1)
                        End If
                        c = c + 1
                    Loop Until p1 = 0 Or p2 = 0
                    .count = c
                End With
            End Function
        End Structure
    
        Structure HTMLElement
            Dim count As Integer
            Dim Item() As HTMLElementItem
        End Structure
    
        Structure HTMLElementItem
            Dim outerHTML As String
            Dim innerHTML As String
            Function getAttributeValue(ByVal attr As String) As String
                Dim p1, p2, p3 As Integer
                Dim i As Integer = 0
                Dim formats(2) As String
                Dim endchars(3) As String
                attr = LCase(attr)
    
                formats(0) = attr & "=" & Chr(34)
                formats(1) = attr & "='"
                formats(2) = attr & "="
                endchars(0) = Chr(34)
                endchars(1) = "'"
                endchars(2) = Chr(32)
                endchars(3) = ">"
                For i = 0 To 2
                    p1 = InStr(LCase(outerHTML), formats(i))
                    If p1 > 0 Then
                        p2 = InStr(p1 + Len(formats(i)), outerHTML, endchars(i))
                        If i = 2 Then
                            p3 = InStr(p1 + Len(formats(i)), outerHTML, endchars(3))
                            If p3 < p2 And p3 > 0 Or p2 = 0 And p3 > 0 Then
                                p2 = p3
                            End If
                        End If
                        If p2 > 0 Then
                            getAttributeValue = Mid(outerHTML, p1 + Len(formats(i)), p2 - (p1 + Len(formats(i))))
                            Exit For
                        End If
                    End If
                Next
            End Function
        End Structure
    
    Code (markup):

    I know it's ugly but it seems to work alright so far and I have been looking for something like this for a while and couldn't find it. So hopefully it is useful to someone else also.
     
    camjohnson95, Nov 6, 2008 IP
  2. ranabra

    ranabra Peon

    Messages:
    125
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #2
    ranabra, Nov 7, 2008 IP
  3. web-bod

    web-bod Guest

    Messages:
    17
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #3
    ugly, but necessarily so. Nice code, thanks.
     
    web-bod, Nov 15, 2008 IP
  4. camjohnson95

    camjohnson95 Active Member

    Messages:
    737
    Likes Received:
    17
    Best Answers:
    0
    Trophy Points:
    60
    #4
    Yeah i've implemented it and ended up just writing seperate functions most of the time for different tags. It's a little bit buggy... but the idea is there (work's great for <a> tags though)
     
    camjohnson95, Nov 15, 2008 IP