Thursday 15 May 2008

DeepZoom "Publisher" with Xps (part 1)

Further to this post about turning a PDF document into a Silverlight 2.0 DeepZoom image, here is some code to parse an XPS document into individual PNG images which can then be turned into DeepZoom content. It's slightly updated from this MSDN forum post: How to convert xps documents to other formats, for example bmp.

using System;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using System.Windows.Documents;
using System.Windows.Xps.Packaging;
using System.Windows.Media.Imaging;
using System.Windows.Media;
// class definition...
static public void SaveXpsToPng(string xpsFileName)
{
XpsDocument xpsDoc = new XpsDocument(xpsFileName, System.IO.FileAccess.Read);
FixedDocumentSequence docSeq = xpsDoc.GetFixedDocumentSequence();
Dictionary<string, string> docPageText = new Dictionary<string, string>();
string txtPage;

for (int pageNum = 0; pageNum < docSeq.DocumentPaginator.PageCount; pageNum++)
{
DocumentPage docPage = docSeq.DocumentPaginator.GetPage(pageNum);
txtPage = string.Empty;

foreach (System.Windows.UIElement uie in ((FixedPage)docPage.Visual).Children)
{
if (uie is System.Windows.Documents.Glyphs)
{
txtPage += ((System.Windows.Documents.Glyphs)uie).UnicodeString;
}
}

BitmapImage bitmap = new BitmapImage();
RenderTargetBitmap renderTarget =
new RenderTargetBitmap((int)(docPage.Size.Width * 300/96),
(int)(docPage.Size.Height * 300/96),
300,
300,
PixelFormats.Pbgra32);
renderTarget.Render(docPage.Visual);

BitmapEncoder encoder = new PngBitmapEncoder();
encoder.Frames.Add(BitmapFrame.Create(renderTarget));
string filename = xpsFileName + ".Page" + pageNum;
FileStream pageOutStream = new FileStream(filename+".png", FileMode.Create, FileAccess.Write);
encoder.Save(pageOutStream);
pageOutStream.Close();
// oh, and save the text too
System.IO.File.WriteAllText(filename + ".txt", txtPage);
}
}

The output files look like this. Two things to note:
1) the PNGs are between 1 and 3Mb in size (around 300dpi I think)
2) there is a .txt file for each image (look carefully in the code above).



Here is one of the images and its associated .txt file



Why output a text file you ask? For indexing and searching! It's debateable whether it makes sense to use DeepZoom as a mechanism for published 'documents' when you can use PDF or the Silverlight 2 XPS Viewer, however for graphically heavy content (say... magazines, photo books) a searchable, indexable DeepZoom collection could actually be a better user experience (particularly for browsing).

Anyway, this is only half the solution - we've got images and text OUT of the Xps document with that code, but we haven't yet processed them IN to DeepZoom via the DeepZoom Composer (Seadragon/Mermaid)... stay tuned...

4 comments:

  1. Hi Craig,

    Thanks for posting this code. It is very usefull.

    I got particulary interted in the way you loop through the UIElement of the FixedPage visuals to extract the text from each page.

    I am trying to do the same with a FlowDocumentPageViewer... in witch case the equivalent to FixedPage is the PageVisual that is part of a MS.Internal namespace.

    Do you have any idea how do to the same - extract the page text - from a FlowDocumentPageViewer ?

    Thanks,

    Adriano

    ReplyDelete
  2. Adriano, sorry I am not really an expert in XPS - I just hacked around with the internal documents from some XPS samples I generated (renamed to .ZIP and opened up all the XML) to figure that out.

    Sorry I can't really offer any better help than Googling. If I come across any ideas I'll be sure to post them up.

    ReplyDelete
  3. foreach (System.Windows.UIElement uie in ((FixedPage)docPage.Visual) fives me this error:
    foreach cannot operate on variables of type System.Windows.Documents.FixedPage because System.Windows.Documents.FixedPage does not contain a public definition for GetEnumerator()

    I notice that there's a closing parenthesis mising on that statement, am i missing something?
    Thanks

    ReplyDelete
  4. Charlymoon, unfortunately the fixed-width blog template has made the rest of the line 'invisible'. If you'd copy-pasted the code rather than re-typing, you should have got the rest of the line...

    foreach (System.Windows.UIElement uie
    in ((FixedPage)docPage.Visual).Children)
    {
    if (uie is System.Windows.Documents.Glyphs)
    {
    txtPage += ((System.Windows.Documents.Glyphs)uie).UnicodeString;
    }
    }

    ReplyDelete

Note: only a member of this blog may post a comment.