Parsing text-content out of different file formats for Searcharoo (or other search engines) can be accomplished a number of ways, including writing your own parser (eg. not too difficult for Html) or using an IFilter loader.
However there's always going to be new document types or formats where you want to build a custom parser... and today the new Word 2007 DOCX format is an example: I don't have Word 2007 installed on my PC so I doubt there's any IFilter implementations for it lying around here either.
A bit of background: the DOCX format is basically a ZIP file containing a directory-tree of Xml files, and from what I can gather the main body of a (Word 2007) DOCX file is located in
word/document.xml within the main ZIP archive.
Using a .NET ZIP library based on System.IO.Compression it's relatively simple to open a DOCX file, extract the
document.xml and read the
InnerText, like this:
using System;If you're using NET 3.0, the System.IO.Packaging.ZipPackage class is probably a better bet than the open source ZIP library for 2.0.
... your code to populate the DOCX filename here ...
using (ZipFile zip = ZipFile.Read(filename))
MemoryStream stream = new MemoryStream();
stream.Seek(0, SeekOrigin.Begin); // don't forget
XmlDocument xmldoc = new XmlDocument();
string PlainTextContent = xmldoc.DocumentElement.InnerText;
Now to do some reading on XLSX and PPTX formats...