Parsing text-content out of different file formats for Searcharoo (or other search engines) can be accomplished a number of ways, including writing your own parser (eg. not too difficult for Html) or using an IFilter loader.
However there's always going to be new document types or formats where you want to build a custom parser... and today the new Word 2007 DOCX format is an example: I don't have Word 2007 installed on my PC so I doubt there's any IFilter implementations for it lying around here either.
A bit of background: the DOCX format is basically a ZIP file containing a directory-tree of Xml files, and from what I can gather the main body of a (Word 2007) DOCX file is located in word/document.xml within the main ZIP archive.
Using a .NET ZIP library based on System.IO.Compression it's relatively simple to open a DOCX file, extract the document.xml and read the InnerText, like this:
using System;
using System.IO;
using System.Xml;
using ionic.utils.zip;
... your code to populate the DOCX filename here ...
using (ZipFile zip = ZipFile.Read(filename))
{
MemoryStream stream = new MemoryStream();
zip.Extract(@"word/document.xml", stream);
stream.Seek(0, SeekOrigin.Begin); // don't forget
XmlDocument xmldoc = new XmlDocument();
xmldoc.Load(stream);
string PlainTextContent = xmldoc.DocumentElement.InnerText;
}
If you're using NET 3.0, the
System.IO.Packaging.ZipPackage class is probably a better bet than the
open source ZIP library for 2.0.
Now to do some reading on XLSX and PPTX formats...