However there's always going to be new document types or formats where you want to build a custom parser... and today the new Word 2007 DOCX format is an example: I don't have Word 2007 installed on my PC so I doubt there's any IFilter implementations for it lying around here either.
A bit of background: the DOCX format is basically a ZIP file containing a directory-tree of Xml files, and from what I can gather the main body of a (Word 2007) DOCX file is located in
word/document.xmlwithin the main ZIP archive.
Using a .NET ZIP library based on System.IO.Compression it's relatively simple to open a DOCX file, extract the
document.xmland read the
InnerText, like this:
using System;If you're using NET 3.0, the System.IO.Packaging.ZipPackage class is probably a better bet than the open source ZIP library for 2.0.
... your code to populate the DOCX filename here ...
using (ZipFile zip = ZipFile.Read(filename))
MemoryStream stream = new MemoryStream();
stream.Seek(0, SeekOrigin.Begin); // don't forget
XmlDocument xmldoc = new XmlDocument();
string PlainTextContent = xmldoc.DocumentElement.InnerText;
Now to do some reading on XLSX and PPTX formats...