I am in the middle of making a data scraper, that scans a website for some specific information. I am using HtmlAgilityPack library to do the basic page HTML parsing tasks. but so far i am stuck at figuring out some way to get it to work. to select nodes inside a foreach block while outputing the resulting nodes into an array outside.
Here is the code to understand my issue a bit better:
string html;
string url = "http://stopbyte.com";
WebClient wc = new WebClient();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
html = wc.DownloadString(url);
doc.LoadHtml(html);
string header_title = doc.DocumentNode.SelectSingleNode("//h1[@class='header-title']").InnerText;
List<string> vb = new List<string>();
foreach (HtmlNode node in nodes)
{
HtmlNode nd = node.SelectSingleNode(".//img/@alt");
vb.Add(nd.SelectSingleNode("img").Attributes["alt"]);
}
The code works perfect, just this line:
HtmlNode nd = node.SelectSingleNode(".//img/@alt");
doesn’t always return a node. i’m i missing something or the xPath is wrong?
can you please provide some helpful resource to better understand how to use xPath alongside HtmlAgilityPack?
The correct method to select an Element or set of elements through its attribute is using this syntax: Element-Tag-Name[@Attribute-Name='Value-Desired']
I never seen this syntax "//img/@alt" and i believe if you are trying to select an img html element that has an alt attribute, then this is the correct syntax:
Element-Tag-Name[@Attribute-Name] ==> that means any element of tag name with the given Attribute, ignoring the value.
in your case that makes it:
img[@alt]
Here is how to read, parse and store html page from scratch, using HTMLAgilityPack:
// Create a new HTML doc (empty)
HtmlDocument doc = new HtmlDocument();
// Load your page (html file) it can be a web URL or UNC (loca) path.
doc.Load("C:\\website1\data\pages\home.html");
// Acquire all <img/> elements with a valid "src" property in our page.
var my_img_nodes = doc.DocumentElement.SelectNodes("//img[@src"];
// Iterate through the entire list acquired (of img elements) and do whatever you want to do.
foreach(HtmlNode img in my_img_nodes)
{
HtmlAttribute src = img["src"];
// A useful usecase, like we have here, is changing all our urls to images in our page to https.
src.Value = src.Value.Replace("http:", "https:");
}
// Finally we can store back our page to local path (same path optionally).
doc.Save("C:\\website1\data\pages\home.html");