Friday 15 June 2007

HtmlAgilityPack rocks (again)

I've said it before, but have to say it again... the HtmlAgilityPack is awesome.

If you are doing any sort of scraping or other Html processing and aren't using it, you're writing too much code!

Today's task was parsing 38 html pages of html-tabular data into SQL INSERT statements (37,919 statements in the end). Create a ConsoleApplication, using System.IO and HtmlAgilityPack, and use this Main (redirecting the output to a textfile program.exe > inserts.sql)

string[] filenames = Directory.GetFiles(@"C:\Temp\", "*.htm", SearchOption.TopDirectoryOnly);
foreach(string filename in filenames)
{
string file = File.OpenText(filename).ReadToEnd(); // parsing out the <table> i want here
int tableStart = file.IndexOf("<table"); tableStart = file.IndexOf("<table", tableStart + 1);
int tableEnd = file.IndexOf("</table>") + "</table>".Length;

string table = file.Substring(tableStart, tableEnd - tableStart) + Environment.NewLine;
HtmlAgilityPack.HtmlDocument hd = new HtmlDocument();
hd.LoadHtml(table); // you can load a fragment, in this case just a <table></table>

HtmlNode tableNode = hd.DocumentNode.ChildNodes[0];
foreach (HtmlNode rowOrText in tableNode.ChildNodes)
{
if (rowOrText.NodeType == HtmlNodeType.Element)
{
if (rowOrText.Name == "tr")
{
if (rowOrText.Attributes["class"] == null) // only 'header' rows have a class in my example
{
String Place = rowOrText.ChildNodes[0].InnerText.Trim();
String Name = rowOrText.ChildNodes[3].InnerText.Trim();
String Time = rowOrText.ChildNodes[4].InnerText.Trim();

Console.WriteLine("INSERT INTO RaceRunner ([Name], [Time]) VALUES ('{1}', '{2}')", Name, Time);
}
}
}
}
}

p.s. in case you were wondering, the source Html was NOT valid Xml that could be loaded into XmlDocument or other built-in .NET class
p.p.s. I really must find a reliable code prettifier...

No comments:

Post a Comment

Note: only a member of this blog may post a comment.