Came across this article today; wondering whether this is a good solution for improving Searcharoo.net - both it's ability to "spider" the web by finding links in Html, and also parsing the Html into words for indexing (eg. pulling out the META tags, etc)... Parsing html markup text using MSHTML By Hendrik Swanepoel
I really want something lightweight that will help parse Html (a) links for spidering and (b) words for indexing... Other than some complex Regex, MSHTML is the only other option I've come across...
No comments:
Post a Comment
Note: only a member of this blog may post a comment.