After looking around for months at various ways to get only the text displayed on a web browser using C#, it all boiled down to only a few simple lines of code. I looked at several very robust solutions such as the HTML Agility Pack and Majestic 12 open source .NET solutions. However, for applications which only require getting tag free / html free text from a web page, these solutions seem to be overkill, at least in my case.
Here are three very simplistic ways to get only the displayed text on a web page:
Method 1 – In Memory Cut and Paste
Use WebBrowser control object to process the web page, and then copy the text from the control…
Use the following code to download the web page:
//Create the WebBrowser control
WebBrowser wb = new WebBrowser();
//Add a new event to process document when download is completed
wb.DocumentCompleted +=
new WebBrowserDocumentCompletedEventHandler(DisplayText);
//Download the webpage
wb.Url = urlPath;
Use the following event code to process the downloaded web page text:
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand(“SelectAll”, false, null);
wb.Document.ExecCommand(“Copy”, false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
}
Method 2 – In Memory Selection Object
This is a second method of processing the downloaded web page text. It seems to take just a bit longer (very minimal difference). However, it avoids using the clipboard and the limitations associated with that.
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{ //Create the WebBrowser control and IHTMLDocument2
WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
//Select all the text on the page and create a selection object
wb.Document.ExecCommand(“SelectAll”, false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
//Create a text range and send the range’s text to your text box
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
}
Method 3 – The Elegant, Simple, Slower XmlDocument Approach
A good friend shared this example with me. I am a huge fan of simple, and this example wins the simplicity contest hands down. It was unfortunately very slow compared to the other two approaches.
The XmlDocument object will load / process html files with only 3 simple lines of code:
XmlDocument document = new XmlDocument();
document.Load(“www.yourwebsite.com”);
string allText = document.InnerText;
There you have it! Three simple ways to scrape only displayed text from web pages with no external “packages” involved.
Packages
I have recently used the Waitin web application testing package to get website text using C#. Watin was not the easiest package to get set up for website text retrieval from C# as it required references to the Waitin core dll, Microsoft.mshtml, windows.forms, and then several additional classes classes included in my project. However, I still think it is worth mentioning, because I like the results it produces. The package is stable and very simple to use once you get it set up. In fact, the website text can be obtained using only 3 lines of code:
var browser = new MsHtmlBrowser();
browser.GoTo(“www.YourURLHere.com”);
commandLog.Text = browser.Text;
I have included a simple visual studio asp.net project for download here.
Links
- Learn more about me at: http://www.jakemdrew.com
- Other articles you might be interested in: http://www.jakemdrew.com/Blog.aspx
- Download the demo here: http://jakemdrew.com/blog/WebSiteText.zip
Method 3 will not work if there are html errors
Thanks Fred! Good to know 🙂 I am not a fan of method 3 myself. XmlDocument was very slow anytime I tried to use it.
Could you please elaborate on your line of code in method one.
textResultsBox.Text = CleanText(Clipboard.GetText());
How do i get the code for cleantext? What does this method do? Could you provide the coding for that method?
CleanText() is just an example method that could be used to remove things such as control characters and duplicate spaces from the text that you download. I don’t really have code for such a method. However, two things come to mind: char.iswhitespace() will identify whitespace characters in text and char.iscontrol() will identify control characters from text. You could loop through each character removing these and other characters as needed.
See:
http://msdn.microsoft.com/en-us/library/system.char.iswhitespace.aspx
http://msdn.microsoft.com/en-us/library/18zw7440.aspx
Thanks! Awesome work! Been looking around for something like this after endless problems with html agility pack and regex
How would i go about using this in a console app? This is wad i have come up with so far but the problem i am having is that its displaying all text more than once. Here is my code:
public class Program
{
private bool completed = false;
private static WebBrowser wb;
[STAThread]
private static void Main(string[] args)
{
Program p = new Program();
wb = new WebBrowser();
wb.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(Displaytext);
wb.Navigate(“www.pcwatch.cc);
while (!p.completed)
{
Application.DoEvents();
Thread.Sleep(1);
}
Console.WriteLine();
Console.ReadLine();
}
private static void Displaytext(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand(“SelectAll”, false, null);
wb.Document.ExecCommand(“Copy”, false, null);
Console.WriteLine(Clipboard.GetText().ToString());
}
}
}
First off I want to say superb blog! I had a
quick question which I’d like to ask if you don’t mind.
I was curious to know how you center yourself and clear your
head before writing. I’ve had a tough time clearing my mind in getting my thoughts out.
I do enjoy writing but it just seems like the first 10 to 15 minutes
are usually lost just trying to figure out how to begin.
Any ideas or hints? Appreciate it!
My biggest suggestion would be to iterate as many times as needed until you are happy with content. You must start anywhere to start making improvements from anywhere. Just get your scattered thoughts on paper, and then take a break, organize, and repeat until satisfied!
none of these examples will work if you cut and paste into visual studio, WHY????
Of course they work Wink | 😉 There’s a demo app link at the bottom of the page a well. Download it and run the demo. Enter the the url you want text from and click the button. I suggest using the code in the demo app as it performs best. Here is the link in case you missed it the first time:
http://jakemdrew.com/blog/WebsiteText.zip%5B%5D
I think the admin of this web site is truly working hard in supplort off
his web page, forr the reason that here every
material is quality based information.