If you are doing any sort of scraping or other Html processing and aren't using it, you're writing too much code!
Today's task was parsing 38 html pages of html-tabular data into SQL INSERT statements (37,919 statements in the end). Create a ConsoleApplication,
using System.IO and HtmlAgilityPack
, and use this Main
(redirecting the output to a textfile program.exe > inserts.sql
)
string[] filenames = Directory.GetFiles(@"C:\Temp\", "*.htm", SearchOption.TopDirectoryOnly);
foreach(string filename in filenames)
{
string file = File.OpenText(filename).ReadToEnd(); // parsing out the <table> i want here
int tableStart = file.IndexOf("<table"); tableStart = file.IndexOf("<table", tableStart + 1);
int tableEnd = file.IndexOf("</table>") + "</table>".Length;
string table = file.Substring(tableStart, tableEnd - tableStart) + Environment.NewLine;
HtmlAgilityPack.HtmlDocument hd = new HtmlDocument();
hd.LoadHtml(table); // you can load a fragment, in this case just a <table></table>
HtmlNode tableNode = hd.DocumentNode.ChildNodes[0];
foreach (HtmlNode rowOrText in tableNode.ChildNodes)
{
if (rowOrText.NodeType == HtmlNodeType.Element)
{
if (rowOrText.Name == "tr")
{
if (rowOrText.Attributes["class"] == null) // only 'header' rows have a class in my example
{
String Place = rowOrText.ChildNodes[0].InnerText.Trim();
String Name = rowOrText.ChildNodes[3].InnerText.Trim();
String Time = rowOrText.ChildNodes[4].InnerText.Trim();
Console.WriteLine("INSERT INTO RaceRunner ([Name], [Time]) VALUES ('{1}', '{2}')", Name, Time);
}
}
}
}
}
p.s. in case you were wondering, the source Html was NOT valid Xml that could be loaded into XmlDocument or other built-in .NET class
p.p.s. I really must find a reliable code prettifier...
No comments:
Post a Comment
Note: only a member of this blog may post a comment.