Wojciech Iłowiecki programming in .NET: Web Scraping class

Web scraping is a computer software technique of extracting information from websites
In my project I need to calculate some results using specific external web site.
I checked what .NET framework offers me to perform operations on websites

1) System.Web HTTPRequest
http://msdn.microsoft.com/en-us/library/system.web.httprequest.aspx
2) System.Net WebClient
http://msdn.microsoft.com/en-us/library/system.net.webclient(VS.80).aspx
3) System.Windows.Forms .WebBrowser
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.aspx

All 3 classes you can use, when you need just read html from the specific website.
I used most advanced WebBrowser class from System.Windows.Forms namespace.
It allows performing more advanced operations on the website by accessing its elements with HTML DOM and also work as a JavaScript runtime.

With WebBrowser control I can perfom following operations:

- input data into website controls
- perform click on the buttons
- reading values

My web scraping implementation:
At first main method, to calculate results using external website with help of WebBrowser control:

 1:  [STAOperationBehavior]

2:

   3:  public XYZResultDO Calculate(XYZRequestDO xyzRequestDO)

   4:  {

   5:      Log.DebugFormat("Calculate: {0}", gefahrstoffRequestDO);

6:

   7:      var result = new XYZResultDO();

   8:      try

   9:      {

  10:          InitBrowser();

  11:          LoadPage(PageUrl);

12:

  13:          if (!ValidatePage())

  14:          {

  15:              Log.Fatal("online tool is not available or not valid!");

  16:              throw new ApplicationException("online tool is not available or not valid!");

  17:          }

18:

  19:          foreach (XyzDO xyzDO in requestDO.xyzs)

  20:          {

  21:              InsertInput(xyzDO);

  22:          }

23:

  24:          result = ReadResults();

  25:          Log.DebugFormat("Page Web scraping. Calculated result {0}", result.ToString());

26:

  27:          return result;

  28:      }

  29:      catch (Exception e)

  30:      {

  31:          Log.FatalFormat("Fehler bei Berechnung {0}", e.ToString());

  32:          return result;

  33:      }

  34:      finally

  35:      {

  36:          if(_webBrowser !=null)

  37:          {

  38:              _webBrowser.Dispose();

  39:              _webBrowser = null;

  40:          }

  41:      }

  42:  }

This way I initialize my browser control

   1:  private void InitBrowser()
   2:  {
   3:      Log.Debug("InitBrowser() InitializeBrowser");
   4:      _webBrowser = new WebBrowser();
   5:   
   6:      _webBrowser.DocumentCompleted += WebBrowser_DocumentCompleted;
   7:      _webBrowser.ProgressChanged += webBrowser_ProgressChanged; 
   8:  }


Loading page into the browser control:
   1:  private void LoadPage(string uri)
   2:  {
   3:      if (_webBrowser != null)
   4:      {
   5:          Log.DebugFormat("LoadPage()  Page wird geladen von {0}", info.FullName);
   6:          _webBrowser.Navigate(url);
   7:          WaitToLoadDocument();
   8:      }
   9:   

handle problems by loading
  1:  private void WaitToLoadDocument()
   2:  {
   3:      _browserBusy = true;
   4:      DateTime startSeiteLaden = DateTime.Now;
   5:      Log.DebugFormat("WaitToLoadDocument() Laden von Webseite in Browser.");
   6:      while (_browserBusy && startSeiteLaden.AddSeconds(30) > DateTime.Now)
   7:      {
   8:          Application.DoEvents(); 
   9:      }
  10:  }
  11:   
  12:  private void WebBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
  13:  {
  14:      _browserBusy = false;
  15:      Log.DebugFormat("ValidateKolumbusPage() Dokument {0} geladen in Browser.", e.Url);
  16:  }


Validate loaded page, to ensure that all required elements are found
   1:  private bool ValidatePage()
   2:  {
   3:      Log.Debug("ValidatePage() Validiere geladenene Page");
   4:      //check if document is loaded into the browser
   5:      if (_webBrowser == null  _webBrowser.Document == null)
   6:      {
   7:          Log.Warn("ValidatePage() Browser oder Page fehlt.");
   8:          return false;
   9:      }
  10:   
  11:      //check if all document elements existis
  12:      if (_webBrowser.Document.GetElementById("Ergebnis") == null 
  13:      _webBrowser.Document.GetElementById("fehler") == null 
  14:      _webBrowser.Document.GetElementById("bild1") == null 
  15:      _webBrowser.Document.GetElementById("bild2") == null 
  16:      _webBrowser.Document.GetElementById("value") == null  
  17:      _webBrowser.Document.GetElementById("neu") == null 
  18:      _webBrowser.Document.GetElementById("Absenden") == null)
  19:      {
  20:          Log.Warn("ValidatePage() Page ist nicht valid.");
  21:          return false;
  22:      }
  23:   
  24:      return true;
  25:  }

Perfom operations on the website using HTML DOM
- Reading element value, 
- set element value, 
- click operation

   1:  private string GetPageElementValue(string elementName)
   2:  {
   3:      HtmlElement element = _webBrowser.Document.GetElementById(elementName);
   4:      return element.GetAttribute("value");
   5:   
   6:  }
   7:   
   8:   
   9:   
  10:   
  11:  private void SetPageElementValue(string elementName, string elementValue)
  12:  {
  13:      HtmlElement element = _webBrowser.Document.GetElementById(elementName);
  14:      if (element != null)
  15:      {
  16:          element.SetAttribute("value", elementValue);
  17:      }
  18:  }
  19:   
  20:   
  21:   
  22:  private void ClickPageElement(string elementName)
  23:  { 
  24:      var element = _webBrowser.Document.GetElementById(elementName); 
  25:      element.InvokeMember("click"); 
  26:  }

2 comments:

AnonymousNovember 24, 2010 at 7:38 PM
exactly what I was looking for. Awesome. Thanks.
Web ScraperMay 2, 2012 at 7:43 AM
Hello,

Web scraping is a computer software technique of extracting information from websites. It is a method of pulling information from the seemingly infinite number of locations on the web where it is stored. Thanks a lot...

Wojciech Iłowiecki programming in .NET

Thursday, November 5, 2009

Web Scraping class - using WebBrowser .NET control

2 comments:

About Me

Wojti's shared items

Labels

Blog Archive