Collecting Internet Favorites

<< Click to Display Table of Contents >>

Navigation:  Imports, Exports and XML >

Collecting Internet Favorites

Doug Steele          

This month, Doug Steele continues to show how you can gather data from unexpected places. In this article, he gets URLs from the Internet and a file of exported Opera bookmarks.
 

There are some Web pages on the Internet that are essentially collections of URLs. Can I read the links from those pages and store them in a table?

Some of you might remember that I showed a technique to get the contents of a Web page using the XMLHTTP object in my November 2003 column. It turns out that if you can read the HTML from the page into a variable, you can use the technique I mentioned in last month's column virtually unchanged to extract URLs on the page.

If you're using the XMLHTTP object (which is installed with Internet Explorer), then you must include error checking in your code to handle those cases where the URL is invalid:

Private Function GetHTMLFromURL( URL As String _

) As String

On Error GoTo Err_GetHTMLFromURL

Dim objWeb As Object

Dim strHTML As String

  Set objWeb = CreateObject("Microsoft.XMLHTTP")

  objWeb.Open "GET", URL, False

  objWeb.Send

  strHTML = objWeb.responseText

End_GetHTMLFromURL:

  GetHTMLFromURL = strHTML

  Set objWeb = Nothing

  Exit Function

Err_GetHTMLFromURL:

  Err.Raise Err.Number, "GetHTMLFromURL", _

    Err.Description

  Resume End_GetHTMLFromURL

End Function

Hopefully, you read last month's column, so that I don't need to reproduce the code that reads through the HTML, parsing out each of the links and their associated titles. If you didn't, the code in the accompanying sample database is sufficiently well commented that you should be able to figure it out. What needs to be done to the ReadBookmarkFile routine from last month is change it so that, rather than getting its HTML from the GetFileContent helper function, it gets it from the GetHTMLFromURL function above. This routine expects to be handed a URL for the page to be analyzed:

Sub ReadWebpage(URL As String)

Dim lngLoop As Long

Dim strCurrElement As String

Dim strFile As String

Dim strHREF As String

Dim strHTML As String

Dim strName As String

Dim strOutput As String

Dim strSQL As String

Dim strURL As String

Dim varElements As Variant

The first thing to do is ensure that the string passed to the routine has some characters to use as the URL. This check doesn't, by the way, guarantee that there are tags in the string: GetHTMLFromURL will handle that by returning a zero-length string if there's no page at the URL:

  If Len(URL) > 0 Then

Once the page is retrieved, I split it into an array of text strings that begin with Anchor elements using the Split function:

    strHTML = GetHTMLFromURL(HTMLString)

    If Len(strHTML) > 0 Then

      varElements = MySplit(strHTML, _

        "<A", -1, vbTextCompare)

For each Anchor element found, I use the helper functions from last month's article to get the URL being pointed to and, if it's a URL in which we're interested (see the discussion that follows), the text associated with that URL. I write that information to a table:

      For lngLoop = LBound(varElements) To_

        UBound(varElements)

        strCurrElement = varElements(lngLoop)

        If InStr(1, strCurrElement, "</A>", _

          vbTextCompare) > 0 Then

          If InStr(1, strCurrElement, "HREF", _

            vbTextCompare) > 0 Then

            strHREF = GetHREF(strCurrElement)

            If StrComp(Left(strHREF, 4), _

              "http", vbTextCompare) = 0 Then

              strName = _

                GetLinkName(strCurrElement)

              If Len(strName) > 0 Then

                strSQL = "INSERT INTO " & _

                  "Favorites " & _

                  "(FavoriteFolderNm, " & _

                  "SiteNm, URLTx) " & _

                  "VALUES(" & FixText(URL) & _

                  ", " & _

                  FixText(strName) & ", " & _

                  FixText(strHREF) & ")"

                CurrentDb.Execute strSQL, _

                  dbFailOnError

              End If

            End If

          End If

        End If

      Next lngLoop

    End If

  End If

End Sub

What I do need to point out, though, is an issue that can occur when parsing the Anchors from "live" HTML. In the bookmark files from last month's column, I could be confident that the Anchor elements contained only URLs. Unfortunately, you can't make that same assumption when you're reading an actual Web page, since the specifications for the Anchor element allow for any URI to be used, not just URLs. So, in the previous paragraph when I said that I was retrieving URLs, I was making a naïve simplification.

What kinds of URIs aren't URLs? Some HREF elements may point to addresses within the same page, not to other pages: (<A HREF="#Section2">Go to Section 2</A>). Some HREF elements may point to e-mail addresses: (<A HREF="mailto:AccessHelp@rogers.com">Send Doug an email</A>). That means that we need some additional logic to ensure that only "legitimate" URLs are stored.

Depending on what you want the list for, you could try and come up with a comprehensive list of items to ignore (# in position 1, mailto:, news:, and so on), or you could simply allow those records that start with http: or https:. If you look at the code I showed earlier, you'll see I took the latter approach.

Another issue is that sometimes the "name" of the Anchor link is more than just text–or not even text at all. For instance, even though I know that it's not recommended, on my page on the Access MVP Web site (www.accessmvp.com/djsteele/AccessIndex.html), I have an anchor tag with an image tag embedded inside of it, rather than any text:

<A HREF="http://support.microsoft.com/support/mvp/"

TARGET="_blank">

<IMG SRC="Graphics/logomvp.gif" WIDTH=108 HEIGHT=89

BORDER=0 ALT='MVP logo'></A>

In my defense, I also have the same link with a text label and no logo elsewhere on the page. For anchor tags without text, you may want to have the GetLinkName function only return the content of the link if it contains usable text. So if you're processing the link on my Access MVP page, the function would return a zero-length string. I'll leave including this enhancement as an exercise for you.

What about bookmarks from Opera? Can I get them into a table as well?

I have to confess that, even in the interests of this column, I didn't bother downloading and installing the Opera browser. However, a colleague of mine indicated that while she wasn't able to find a bookmark file per se, she was able to export the bookmarks from Opera into a text file (with an .adr extension).

The .adr file she gave me looks something like this:

Opera Hotlist version 2.0

Options: encoding = utf8, version=3

#FOLDER

  ID=11

  NAME=Trash

  TRASH FOLDER=YES

-

#FOLDER

  ID=12

  NAME=-- Opera Software --

#URL

  ID=13

  NAME=Buy Opera

  URL=http://www.opera.com/buy/

#URL

  ID=14

  NAME=Distribute Opera

  URL=http://distribute.opera.com/distribution/

#URL

  ID=15

  NAME=Download Opera

  URL=http://www.opera.com/download/

Since that's just a plain text file, it's relatively straightforward to read it and load the results into a table. The only "trick" is that you need to keep track of the folder to which each URL is associated. Since the name of the folder always precedes the URL(s) associated with it, though, it's a simple matter of storing the Folder name when you read it, and then using that Folder name with each of the following URLs until you encounter the next Folder.

In case you found that description hard to follow, here's the code to clarify it:

Sub ReadOperaADRFile(ADRFile As String)

Dim booReadingFolder As Boolean

Dim intFile As Integer

Dim intKeyword As Integer

Dim strBuffer As String

Dim strCurrFolder As String

Dim strCurrName As String

Dim strCurrURL As String

Dim strSQL As String

I then check to ensure that a legitimate file name was passed. If it was, I grab the next available FreeFile number, and use it to open the file. The major part of the code loops through the file, reading one line at a time:

  If Len(ADRFile) > 0 Then

    If Len(Dir(ADRFile)) > 0 Then

      intFile = FreeFile

      Open ADRFile For Input As #intFile

      Do While Not EOF(intFile)

        Line Input #intFile, strBuffer

I next remove any unnecessary white space from the line of text that I just read and then look at its content. Since both the Folder and URL sections use Name to identify the information, you have to keep track of which type of section is being read. One way to do this is to use a Boolean variable to keep track of what situation you're in. This code sets the value of the Boolean to True every time a #FOLDER row is read, and False every time a #URL row is read:

        strBuffer = Trim(strBuffer)

        If InStr(1, strBuffer, "#FOLDER", _

          vbTextCompare) > 0 Then

          booReadingFolder = True

        ElseIf InStr(1, strBuffer, "#URL", _

          vbTextCompare) > 0 Then

          booReadingFolder = False

        End If

Now, anytime a row with NAME= is read, I know whether it's a Folder name or a URL name and can store the data in the appropriate variable:

        intKeyword = InStr(1, strBuffer, _

          "NAME=", vbTextCompare)

        If intKeyword > 0 Then

          If booReadingFolder Then

            strCurrFolder = _

              Mid(strBuffer, intKeyword + 5)

          Else

            strCurrName = _

              Mid(strBuffer, intKeyword + 5)

          End If

        End If

Due to the nature of how the data's stored, if a row with URL= is read, I know that I should now have the name of the folder, the name of the link, and the link itself, which means that I can update the table. I toyed with the idea of resetting strCurrName and strCurrURL to zero-length strings after each insert into the table, and then writing to the table when I had values for the two variables, but that seemed unnecessarily complicated:

        intKeyword = InStr(1, strBuffer, _

          "URL=", vbTextCompare)

        If intKeyword > 0 Then

          strCurrURL = _

            Mid(strBuffer, intKeyword + 4)

          strSQL = "INSERT INTO Favorites " & _

            "(FavoriteFolderNm, SiteNm, " & _

            "URLTx) " & _

            "VALUES(" & _

            FixText(strCurrFolder) & ", " & _

            FixText(strCurrName) & ", " & _

            FixText(strCurrURL) & ")"

          CurrentDb.Execute strSQL, _

            dbFailOnError

        End If

      Loop

    End If

  End If

End Sub

Now that I have all these URLs in a table, is there some way to determine if any of them have expired?

You've already seen the use of the MSXML library in the GetHTMLFromURL function. In addition to returning the HTML of the page as the responseText property of the request, there are both status and statusText properties returned that represent the HTTP status code returned by the request. These will have one of the values shown in Table 1.

Table 1. Status codes from the XMLHTTP object.

Status

Status text

100

Continue

101

Switching Protocols

200

OK

201

Created

202

Accepted

203

Non-Authoritative Information

204

No Content

205

Reset Content

206

Partial Content

300

Multiple Choices

301

Moved Permanently

302

Found

303

See Other

304

Not Modified

305

Use Proxy

307

Temporary Redirect

400

Bad Request

401

Unauthorized

402

Payment Required

403

Forbidden

404

Not Found

405

Method Not Allowed

406

Not Acceptable

407

Proxy Authentication Required

408

Request Timeout

409

Conflict

410

Gone

411

Length Required

412

Precondition Failed

413

Request Entity Too Large

414

Request-URI Too Long

415

Unsupported Media Type

416

Requested Range Not Suitable

417

Expectation Failed

500

Internal Server Error

501

Not Implemented

502

Bad Gateway

503

Service Unavailable

504

Gateway Timeout

505

HTTP Version Not Supported

Typically, any status code of less than 300 is considered to be okay, while a code of 300 or more indicates a problem. I take advantage of this in some additional helper functions. The first function, when passed a URL, returns True if the URL points to an active page:

Public Function IsActiveURL( _

  URL As String) As Boolean

Dim booStatus As Boolean

Dim objWeb As Object

  booStatus = False

  Set objWeb = CreateObject("Microsoft.XMLHTTP")

  objWeb.Open "GET", URL, False

  objWeb.Send

  booStatus = (objWeb.status < 300)

End_IsActiveURL:

  IsActiveURL = booStatus

  Set objWeb = Nothing

  Exit Function

Err_IsActiveURL:

  Resume End_IsActiveURL

End Function

This function, when passed a URL, returns the message for the error code:

Public Function GetStatusTextForURL( _

  URL As String _

) As String

Dim strStatus As String

Dim objWeb As Object

  strStatus = vbNullString

  Set objWeb = CreateObject("Microsoft.XMLHTTP")

  objWeb.Open "GET", URL, False

  objWeb.Send

  strStatus = objWeb.statusText

End_GetStatusTextForURL:

  GetStatusTextForURL = strStatus

  Set objWeb = Nothing

  Exit Function

Err_GetStatusTextForURL:

  strStatus = Err.Description

  Resume End_GetStatusTextForURL

End Function

Armed with these two functions, I can create a query that will let me know whether there's a problem with any of the Favorites in my table and display a message describing the problem:

SELECT SiteNm _

, URLTx_

, IsActiveURL([URLTx]) AS IsActive_

, GetStatusTextForURL([URLTx]) AS StatusText

FROM Favorites;

You can have this strictly return those URLs that you need to check into, or you can create a report to show you the details, like I did in the accompanying sample database. Be aware that the query can take a long time to run, depending on how many URLs you have stored in the Favorites table and what their status is. For some error codes, MSXML simply waits for a specified period of time and, if it doesn't get a response, assumes that something's gone wrong. Lots of URLs with that state will give you a long running query.

Okay, I've talked about Favorites and the like for long enough now. I hope you found it useful. Next month, I have some new questions to answer.

Your download file is called 509STEELE.ZIP in the file SA2005-10down.zip

This can be purchased with all the other downloads on this page