How to Select Nodes in C# using HTMLAgilityPack SelectSingleNode and SelectNodes functions?


(afree) #1

Hi there.

I am in the middle of making a data scraper, that scans a website for some specific information. I am using HtmlAgilityPack library to do the basic page HTML parsing tasks. but so far i am stuck at figuring out some way to get it to work. to select nodes inside a foreach block while outputing the resulting nodes into an array outside.

Here is the code to understand my issue a bit better:

string html;
string url = "http://stopbyte.com";
WebClient wc = new WebClient();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
 
html = wc.DownloadString(url);
doc.LoadHtml(html);

string header_title = doc.DocumentNode.SelectSingleNode("//h1[@class='header-title']").InnerText;
List<string> vb = new List<string>();
foreach (HtmlNode node in nodes)
{
    HtmlNode nd = node.SelectSingleNode(".//img/@alt");
    vb.Add(nd.SelectSingleNode("img").Attributes["alt"]);
}

The code works perfect, just this line:

HtmlNode nd = node.SelectSingleNode(".//img/@alt");
doesn’t always return a node. i’m i missing something or the xPath is wrong?

can you please provide some helpful resource to better understand how to use xPath alongside HtmlAgilityPack?


How To Parse HTML Page In C#?
(SAM) #2

Here are some tips:

  • The correct method to select an Element or set of elements through its attribute is using this syntax:
    Element-Tag-Name[@Attribute-Name='Value-Desired']

  • I never seen this syntax "//img/@alt" and i believe if you are trying to select an img html element that has an alt attribute, then this is the correct syntax:

Element-Tag-Name[@Attribute-Name] ==> that means any element of tag name with the given Attribute, ignoring the value.
in your case that makes it:
img[@alt]

i guess that should put you on the right track.


(afree) #3

Great, thanks a lot, that answered my exact question
and this is the final code line that worked for me:

HtmlNode nd = node.SelectSingleNode(".//img[@alt]");

(this.is.sparta) #4

Here is how to read, parse and store html page from scratch, using HTMLAgilityPack:

 // Create a new HTML doc (empty)
 HtmlDocument doc = new HtmlDocument();

 // Load your page (html file) it can be a web URL or UNC (loca) path.
 doc.Load("C:\\website1\data\pages\home.html");

 // Acquire all <img/> elements with a valid "src" property in our page.
 var my_img_nodes = doc.DocumentElement.SelectNodes("//img[@src"];

 // Iterate through the entire list acquired (of img elements) and do whatever you want to do.
 foreach(HtmlNode img in my_img_nodes)
 {
    HtmlAttribute src = img["src"];
    
    // A useful usecase, like we have here, is changing all our urls to images in our page to https.
    src.Value = src.Value.Replace("http:", "https:");
 }

 // Finally we can store back our page to local path (same path optionally).
 doc.Save("C:\\website1\data\pages\home.html");