Automating Internet Explorer to Find All Links on a Web Page

By: Corbin Dunn

Abstract: You may want to find all links on a given web page. This document has source code and directions on how to do this.

Automating Internet Explorer to Find All Links on a Web Page
By Corbin Dunn
Delphi Developer Support

You may be faced in a situation where you need to find all links on a given web page. This document describes how to find all the links by automating Internet Explorer (IE). It was written with IE 5, but the concepts should work with IE 4 too.

Download the complete source for this project from Borland CodeCentral
Note: MSHTML_TLB.pas is not included with the above project because it is so large and one can import it by themselves. Read this document for how to import it.

  1. First, create a new application in Delphi. I saved Form1 as MainFrm.pas, and the application as FindingLinks.dpr.
  2. To get the most power out of automating IE, first import the MSHTML type library.
    To do this, select Project->Import Type Library:
    Import type library

    You should now see "Microsoft HTML Object Library (Version 4.0)" listed:
    Importing the TLB

    Your version may be different if you have a different version of Internet Explorer (I have 5.0). If you don't see it listed, search on your computer for mshtml.tlb. After you find this file, click on the "Add..." button and select mshtml.tlb. If you don't find mshtml.tlb, then you may not have IE installed, or your version is outdated (so, update IE before continuing).

    Finally, select "Microsoft HTML Object Library" and then click on the "Create Unit" button. It will take quite a while to create the unit because the type library is very large. My MSHTML_TLB file that it created is 241,899 lines long! This should give you an idea about much you can automate Internet Explorer.
  3. Once you have control of Delphi, go back to Form1. Create a layout similar to this:

    Form Layout

    with these components:
    Component Class Name Caption or Text
    TLabel lblURL URL:
    TEdit edtURL http://www.borland.com
    TButton btnFindLinks Find All Links
    TListBox lstbxLinks n/a
  4. Add OleCtrls, SHDocVw, and OleServer to the uses list of the Form1's interface section. This allows us to create an instance of TInternetExplorer, which wraps the Internet Explorer ActiveX Object. We don't want to show the Internet Explorer control in this application. If you do want to, then place a TWebBrowser control on the form, as it will work in the same way and allow you to see the resulting web page.
  5. Add the following to the private section of your form:
        FInternetExplorer: TInternetExplorer;
        procedure WebBrowserDocumentComplete(Sender: TObject; var pDisp: OleVariant;
          var URL: OleVariant);      
    
    Press Ctrl-Shift-C to complete the class declaration.
  6. Add
      uses MSHTML_TLB, ComObj;
    under the implementation section of Form1.
  7. Double click on Form1 to go to the OnCreate event of the form. Add the lines:
      FInternetExplorer := TInternetExplorer.Create(Self);
      FInternetExplorer.OnDocumentComplete := WebBrowserDocumentComplete;
    
  8. For TForm1.WebBrowserDocumentComplete add the following code:
    procedure TForm1.WebBrowserDocumentComplete(Sender: TObject;
      var pDisp: OleVariant; var URL: OleVariant);
    var
      Doc: IHTMLDocument2;
      ElementCollection: IHTMLElementCollection;
      HtmlElement: IHTMLElement;
      I: Integer;
      AnchorString: string;
    begin
      lstbxLinks.Clear;
      // We will process the document at this time. Trying to do
      // so earlier won't work because it hasn't fully loaded.
      Doc := FInternetExplorer.Document as IHTMLDocument2;
      if Doc = nil then
        raise Exception.Create('Couldn''t convert the ' +
          'FInternetExplorer.Document to an IHTMLDocument2');
      // First, grab all the elements on the web page
      ElementCollection := Doc.all;
      for I := 0 to ElementCollection.length - 1 do
      begin
        // Get the current element
        HtmlElement := ElementCollection.item(I, '') as IHTMLElement;
        // Next, check to see if it is a link (tagName will be A).
        // You could easily find other tags (such as TABLE, FONT, etc.)
        if HTMLElement.tagName = 'A' then
        begin
          // Now grab the innerText for this particular link. The innerText is
          // all text that is inside of the particular tag. For example,
          // this would give us "Go To Borland" from the HTML:
          // <a href="http://www.borland.com"><b>Go To Borland</b></a>.
          // If you want "<b>Go To Borland</b>" use innerHTML.
          AnchorString := HtmlElement.innerText;
          if AnchorString = '' then
            AnchorString := '(Empty Name)';
          // We know that the element is an IHTMLAnchorElement since the tagName
          // is 'A'. 
          AnchorString := AnchorString + ' -  ' +
            (HtmlElement as IHTMLAnchorElement).href;
          lstbxLinks.Items.Add(AnchorString);
        end;
      end;
    end;
  9. Next, double click on TButton btnFindLinks and add the following code in the OnClick event:
      // Simply browse to the given page
      FInternetExplorer.Navigate(edtURL.Text, EmptyParam, EmptyParam,
        EmptyParam, EmptyParam);
  10. Compile the application and run it. It will probably take a while to compile because it has to compile the large MSHTML_TLB.pas file. You will probably get a lot of warnings about the MSHTML_TLB.pas file which you can ignore. You should then be able to find all links on a given web page by clicking the button. If you get a Variant exception when running in debug mode, you can ignore it (it doesn't effect anything).
Download the complete source for this project from Borland CodeCentral
Note: MSHTML_TLB.pas is not included with the above project because it is so large and one can import it by themselves.

Server Response from: ETNASC03