Getting Only The Text Displayed On A Webpage Using C#

After looking around for months at various ways to get only the text displayed on a web browser using C#, it all boiled down to only a few simple lines of code.  I looked at several very robust solutions such as the HTML Agility Pack and Majestic 12 open source .NET solutions.  However, for applications which only require getting tag free / html free text from a web page, these solutions seem to be overkill, at least in my case.

Here are three very simplistic ways to get only the displayed text on a web page:

Method 1 – In Memory Cut and Paste

Use WebBrowser control object to process the web page, and then copy the text from the control…

Use the following code to download the web page:

 //Create the WebBrowser control

WebBrowser wb = new WebBrowser();

//Add a new event to process document when download is completed   

wb.DocumentCompleted +=

    new WebBrowserDocumentCompletedEventHandler(DisplayText);

//Download the webpage

wb.Url = urlPath;

Use the following event code to process the downloaded web page text:

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)

{

WebBrowser wb = (WebBrowser)sender;

wb.Document.ExecCommand(“SelectAll”, false, null);

wb.Document.ExecCommand(“Copy”, false, null);

textResultsBox.Text = CleanText(Clipboard.GetText());

}

Method 2 – In Memory Selection Object

This is a second method of processing the downloaded web page text.  It seems to take just a bit longer (very minimal difference).  However, it avoids using the clipboard and the limitations associated with that.

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)

{   //Create the WebBrowser control and IHTMLDocument2

WebBrowser wb = (WebBrowser)sender;

IHTMLDocument2 htmlDocument =

wb.Document.DomDocument as IHTMLDocument2;

//Select all the text on the page and create a selection object

wb.Document.ExecCommand(“SelectAll”, false, null);

IHTMLSelectionObject currentSelection = htmlDocument.selection;

//Create a text range and send the range’s text to your text box

IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange

textResultsBox.Text = range.text;

}

Method 3 – The Elegant, Simple, Slower XmlDocument Approach

A good friend shared this example with me.  I am a huge fan of simple, and this example wins the simplicity contest hands down.  It was unfortunately very slow compared to the other two approaches.

The XmlDocument object will load / process html files with only 3 simple lines of code:

XmlDocument document = new XmlDocument();

document.Load(“www.yourwebsite.com”);

string allText = document.InnerText;

There you have it!  Three simple ways to scrape only displayed text from web pages with no external “packages” involved.

Packages

I have recently used the Waitin web application testing package to get website text using C#. Watin was not the easiest package to get set up for website text retrieval from C# as it required references to the Waitin core dll, Microsoft.mshtml, windows.forms, and then several additional classes classes included in my project. However, I still think it is worth mentioning, because I like the results it produces. The package is stable and very simple to use once you get it set up. In fact, the website text can be obtained using only 3 lines of code:

var browser = new MsHtmlBrowser();
browser.GoTo(“www.YourURLHere.com”);
commandLog.Text = browser.Text;

I have included a simple visual studio asp.net project for download here.

Links

About these ads

6 thoughts on “Getting Only The Text Displayed On A Webpage Using C#

  1. Could you please elaborate on your line of code in method one.

    textResultsBox.Text = CleanText(Clipboard.GetText());

    How do i get the code for cleantext? What does this method do? Could you provide the coding for that method?

  2. How would i go about using this in a console app? This is wad i have come up with so far but the problem i am having is that its displaying all text more than once. Here is my code:

    public class Program
    {
    private bool completed = false;
    private static WebBrowser wb;

    [STAThread]
    private static void Main(string[] args)
    {
    Program p = new Program();

    wb = new WebBrowser();
    wb.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(Displaytext);

    wb.Navigate(“www.pcwatch.cc);

    while (!p.completed)
    {
    Application.DoEvents();
    Thread.Sleep(1);
    }

    Console.WriteLine();
    Console.ReadLine();

    }

    private static void Displaytext(object sender, WebBrowserDocumentCompletedEventArgs e)
    {
    WebBrowser wb = (WebBrowser)sender;
    wb.Document.ExecCommand(“SelectAll”, false, null);
    wb.Document.ExecCommand(“Copy”, false, null);
    Console.WriteLine(Clipboard.GetText().ToString());
    }

    }
    }

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s