using System;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using System.Windows.Documents;
using System.Windows.Xps.Packaging;
using System.Windows.Media.Imaging;
using System.Windows.Media;
// class definition...
static public void SaveXpsToPng(string xpsFileName)
{
XpsDocument xpsDoc = new XpsDocument(xpsFileName, System.IO.FileAccess.Read);
FixedDocumentSequence docSeq = xpsDoc.GetFixedDocumentSequence();
Dictionary<string, string> docPageText = new Dictionary<string, string>();
string txtPage;
for (int pageNum = 0; pageNum < docSeq.DocumentPaginator.PageCount; pageNum++)
{
DocumentPage docPage = docSeq.DocumentPaginator.GetPage(pageNum);
txtPage = string.Empty;
foreach (System.Windows.UIElement uie in ((FixedPage)docPage.Visual).Children)
{
if (uie is System.Windows.Documents.Glyphs)
{
txtPage += ((System.Windows.Documents.Glyphs)uie).UnicodeString;
}
}
BitmapImage bitmap = new BitmapImage();
RenderTargetBitmap renderTarget =
new RenderTargetBitmap((int)(docPage.Size.Width * 300/96),
(int)(docPage.Size.Height * 300/96),
300,
300,
PixelFormats.Pbgra32);
renderTarget.Render(docPage.Visual);
BitmapEncoder encoder = new PngBitmapEncoder();
encoder.Frames.Add(BitmapFrame.Create(renderTarget));
string filename = xpsFileName + ".Page" + pageNum;
FileStream pageOutStream = new FileStream(filename+".png", FileMode.Create, FileAccess.Write);
encoder.Save(pageOutStream);
pageOutStream.Close();
// oh, and save the text too
System.IO.File.WriteAllText(filename + ".txt", txtPage);
}
}
The output files look like this. Two things to note:
1) the PNGs are between 1 and 3Mb in size (around 300dpi I think)
2) there is a .txt file for each image (look carefully in the code above).
Here is one of the images and its associated .txt file
Why output a text file you ask? For indexing and searching! It's debateable whether it makes sense to use DeepZoom as a mechanism for published 'documents' when you can use PDF or the Silverlight 2 XPS Viewer, however for graphically heavy content (say... magazines, photo books) a searchable, indexable DeepZoom collection could actually be a better user experience (particularly for browsing).
Anyway, this is only half the solution - we've got images and text OUT of the Xps document with that code, but we haven't yet processed them IN to DeepZoom via the DeepZoom Composer (Seadragon/Mermaid)... stay tuned...
Hi Craig,
ReplyDeleteThanks for posting this code. It is very usefull.
I got particulary interted in the way you loop through the UIElement of the FixedPage visuals to extract the text from each page.
I am trying to do the same with a FlowDocumentPageViewer... in witch case the equivalent to FixedPage is the PageVisual that is part of a MS.Internal namespace.
Do you have any idea how do to the same - extract the page text - from a FlowDocumentPageViewer ?
Thanks,
Adriano
Adriano, sorry I am not really an expert in XPS - I just hacked around with the internal documents from some XPS samples I generated (renamed to .ZIP and opened up all the XML) to figure that out.
ReplyDeleteSorry I can't really offer any better help than Googling. If I come across any ideas I'll be sure to post them up.
foreach (System.Windows.UIElement uie in ((FixedPage)docPage.Visual) fives me this error:
ReplyDeleteforeach cannot operate on variables of type System.Windows.Documents.FixedPage because System.Windows.Documents.FixedPage does not contain a public definition for GetEnumerator()
I notice that there's a closing parenthesis mising on that statement, am i missing something?
Thanks
Charlymoon, unfortunately the fixed-width blog template has made the rest of the line 'invisible'. If you'd copy-pasted the code rather than re-typing, you should have got the rest of the line...
ReplyDeleteforeach (System.Windows.UIElement uie
in ((FixedPage)docPage.Visual).Children)
{
if (uie is System.Windows.Documents.Glyphs)
{
txtPage += ((System.Windows.Documents.Glyphs)uie).UnicodeString;
}
}