Saturday, October 15, 2016

Preparing an EPUB as a Web project

[minor edits on 5 Feb 2017]

Having explored various approaches to publishing an e-book, as described in my previous post, I came to the conclusion that I needed to manually produce Thursday’s Lotus according to an EPUB standard.  This meant working directly on the HTML and CSS sources, which I broadly describe here with a little illustration — the discussion assumes some experience of Web development.

To get started, I browsed the KDP online help on formatting files. The guidance is more about the overall workflow, with only basic information about HTML.  For the details it directs you to the Kindle Publishing Guidelines, which should serve as the main reference source.  I was slow to come across this guide, which helpfully discusses the pathways that I determined in a meandering way for myself, but I don’t think it would have affected the path I took.

However, these guidelines are not so practical as a means to actually learn how to produce EPUB-compliant files as a novice.  To fill this gap I bought a copy of Paul Salvette’s The eBook Design and Development Guide, which is very much a hands-on manual, informative and well laid out.  What’s more his company generously provides many helpful resources for developers, including boilerplates, some of which I used as a basis for my own templates. The main limitation is that it was published in 2012 and oriented around the EPUB2 standard; version 3 of the standard has since emerged and the previous version IDPF marks as obsolete. Nevertheless, this guide clearly describes the principles behind the EPUB standards, so that I’m quite sure that it will be relatively straightforward to take on board future iterations.  Hence I had no hesitation in writing a positive review.  In practice so far I’ve found the online publishers, not just Amazon, readily support this (it’s rather like web browsers will happily display old Web pages, though in most cases this doesn’t go as far as the distracting <blink> tag!)

Document styles in MS Word and Filtered HTML outputs

Salvette is quite adamant that one should write an e-book from the ground up.  Certainly it makes for a much cleaner document from the outset.  However, on this occasion I acted contrary to his advice, starting with a Word document and exporting it as filtered HTML (but not saving as filtered HTML, which keeps a lot of internal MS Office code and can increase the file size severalfold).  One of the reasons I did this was to avoid having to re-enter over 1000 hyperlinked bookmarks in the index or else modify the Word document in ways that are best left to a text editor.  Besides, I could learn a lot about the structure of a Word document and its characteristics.  I then gradually simplified the HTML.  Along the way I realised that the wysiwyg interface made it easy to develop bad authoring habits such as the manual insertion of spaces and the inconsistent use of styles.

I can illustrate this with a small example.  Thursday’s Lotus includes a glossary of terms with entries in Thai and Romanised Pali (this is included Amazon’s ‘Look Inside’ preview).  The first entry is for Abhidhamma and in Word it looks like:

I had realised that it makes sense to define a style for every distinct type of content, so here I defined one for ‘Glossary Definition’ and chose to display it as indented.  The ‘Modify Styles’ dialog, shows these and other details.

Word style dialog box for 'glossary definition' with various customisation options

An indication of the potential issues that would arise in exporting to HTML becomes readily apparent.  For instance, I chose the font face, Gentium Plus, because it has a classical look that works will on paper and it has broad support for glyphs, including characters with diacritics. However, it’s not included as an option in the Kindle.  Underneath the preview window there is a specification of the font support, with explicit reference to Thai language and exact line spacing of 12pt, which I had deliberately chosen as a means to help control the vertical layout because it includes many photographs.  These definitions would not make sense for an e-book. 

When saved by Word as filtered HTML this glossary item was rendered as:

Word filtered HTML code for a glossary entry

It’s a little bloated and the use of physical markup tags, and  , are slightly deprecated, but it is not unmanageable and the main thing is that the structure is sound and it retains all the semantics, including denoting the Thai language instance with a lang=“th” attribute inside a tag.  In fact, surveying the entire document, it appeared that every Thai language instance was denoted this way — useful because it can be used by browsers, e-readers and so on to interpret Thai content distinctly.   One other thing to notice: actually, I had defined two styles: one for the entry term and another for the actual definition, but it became evident that I never got round to using the Glossary entry style.

For a full document, even one that’s of modest sized, the HTML export produces a large single file that can look quite intimidating because the content appears to be engulfed by CSS formatting, starting with masses of font definitions, with numerous built-in Microsoft classes, defining long hierarchies of selectors with many declarations.  Furthermore, as even this small example suggests, there was extensive use of tags, liberally sprinkled across the document.  A lot of this very fine-grained control is to ensure precise print-ready documents, most of which is unwanted for reflowable content. But there extraneous material is applied consistently and it’s a matter of understanding the logic of what is applied and how it was applied.

Cleaning and Preparing (X)HTML for EPUB

So I then began the process of cleaning the HTML, removing font face definitions, unused CSS selectors (there are tools like Dust-Me selectors that give the lowdown) and gradually applying regular expression search/replace to successively removing excess spaces.  There are quite a few tools available that clean HTML, but for EPUB, there are additional assumptions that can be made that affect what to delete and what to replace.  For example, fonts may be converted to use relative sizing (‘em’ units) from fixed points (apart perhaps from titles and headings being positioned relative to top of screen).

After a number of such transformations the glossary definition became:

Cleaned HTML code for a glossary entry

I deleted almost all style definitions from the <span> tags, but not the language attribute, whilst declarations that I thought relevant, such as indentation, I folded into the corresponding named classes.


The output also includes a folder of images, which are generally rather small.  Compared with the proliferation in the main text, this may seem incidental, but I decided to attend to this early on because the images, of which there are about 100, are an important part of the book.  It’s important to note that whereas there’s practically no limit on image file size for CreateSpace and they thus should ideally be of high resolution (Amazon recommends at least 300dpi), there is a caveat with KDP. Depending on the royalty options, the resulting file incurs a region-dependent ‘delivery cost’, e.g. 10p per MB for books sold in the UK.

To keep costs modest, starting with high resolution sources, I used Irfanvew to carry out some batch image conversions to output images to have long side with at most 800 pixels and I also applied some compression.  The net result was a folder with a total of about 5MB, but I suspect I could have reduced this further because I forgot to strip metadata.  It was dealing with issues like this that already indicated it was going to take me a few weeks rather than a few days!

(X)HTML Validation

I initially worked towards HTML5, with the intention of ensuring that as a late step I correctly serialize this as XML.  Only later did I discover that EPUB2 specifies as its base content format XHTML1.1, whereas EPUB3 specifies XML serialisation of HTML5, i.e. XHTML5 (aka “polyglot” HTML5).  Fortunately, for the relatively simple structure in my book, it meant little more than changing the Web document headers.

For Web standards I generally make a beeline for the World Wide Web Consortium (W3C), which has for a long time helpfully provided an online validator, currently However, after using this repeatedly I felt I ought not to keep draining its resources, so I downloaded and installed a copy from GitHub, via the project’s About page, following the instructions to build the service

One particular error I ran across because of my use of accented characters was:

“Warning: Text run is not in Unicode Normalization Form C.”

This refers to one or more characters that can be generated in multiple ways – notably as a precomposed (single) character or as a decomposed sequence of characters.  There are several normalization methods; NFC normalized text, which is recommended for Web pages, opts for precomposed.  I found a helpful thread with a nice solution on Stackoverflow: in BabelPad just copy and paste across and then select: Convert -> Normalisation form -> To NFC.

EPUB Preparation and Validation

Having refined the HTML, previewed and validated, I then considered the production of the EPUB as a package, i.e. a nested series of folders with sources and metadata. It was time to write some PHP and I loaded up Netbeans, which is my usual IDE, partly because it is cross-platform.  The book itself had to be split into a number of files and so I wrote a script to do that according to sections whilst ensuring that anchors continued to work afterwards.  To carry out manipulations on a DOM object created from the single file (before it was split), I used PHP Simple HTML DOM Parser,  and then to assemble the XML files I used the SimpleXML extension.   I created a logical table of contents and also duly specified basic core metadata such as title, publisher, unique identifier and subjects, though I’m not sure how much these are actually used.  In addition to Salvette’s guide, I browsed various fora, finding some succinct and sage advice, such as that from Micah Sisk and others.

For EPUB the final step in packaging is to bundle everything as a ZIP file with a .epub extension; one caveat here is to ensure that the mimetype file is not compressed (there is usually an option available to do this, e.g -0 for compression-level 0).  Then to validate what I had, I turned to the EPUB Validator,  Unfortunately at the times I tried the online service it seemed to be weighed down and couldn’t even complete the processing on the files I submitted.  So, as with the W3C validator, I proceeded to download and install my own copy, in this case epubcheck.  Once in place, I had it available to run on the command line and added an alias to .bash_profile for convenience.  The tool provides precise feedback about the errors: what they are and where they occur in your code, and with the help of occasional online searches I found all the answers I needed, learning, for example, that every file used in the publication must be included correctly in the manifest!

Conversion to Kindle formats and Preview Tools

Whilst correctness is reassuring, it still remains necessary to inspect what’s presented and I used a number of applications.  Even before it gets converted I ran some initial tests on a desktop Web browser.  Although this interprets the HTML quite differently, it’s a familiar interface and can highlight issues of inconsistency in fonts and images, notably in their sizes and alignments.  

Once I had prepared an ePub file I could then upload to KDP, which carries out a conversion to its own formats and offers a browser-based preview tool.  I found it slightly disconcerting process because it would give the impression of some errors that I hadn’t expected.  Then I read various comments on fora recommending downloading the resulting .mobi file and a preview application, Kindle Previewer.  Furthermore, once downloaded you can use ‘Send to Kindle’ for a more reliable test on actual hardware.  For the conversion, Amazon provides a tool called KindleGen from various files (as listed in its page on supported formats).  It’s freely available to download and use and comes bundled with the offline previewer.  Taken together there are numerous preview options and I don’t think I used them so effectively; next time I need to be more systematic and thorough.  

Note there are two versions of KindlePreviewer (2 and 3); working on a Mac, I was reluctant to install version 2 because of the requirement to use an old version of Java (6) and it seemed a bit fiddly.  So I was relieved to find the beta version 3, which has no such issue and the installation worked fine.  It was then a matter of refining the HTML and CSS and rebuilding, which was simple to do with the scripts in place the e-book was ready for publication, well almost...

Wednesday, October 12, 2016

Publication of Thursday’s Lotus as an e-book: finding a suitable approach

Thursday’s Lotus is now available as an e-book through Kindle Store and Kobo, with others to follow. I’m glad I made the effort and hope this post will encourage other newcomers to explore further some of the options to contribute to the world of electronic publishing.

As someone whose career has been largely in technology, I found the preparation and publication of a hardcopy book a fascinating process, resulting in the very satisfying experience of holding a physical copy in my hands. But, of course, nowadays it’s expected that books be made available in digital formats for reading on handheld devices, i.e. as e-books. As I had other work commitments I concentrated initially on the paperback, making it as polished as I could; I felt pursuing an electronic version in parallel would have negated against that, so I put that thought to one side and merely indicated that I expected an e-book version to emerge in 6-12 months. Then, after a little break, I decided to start focusing on the e-book version in August and I published it in September.

Thursday's Lotus paperback alongside Kindle viewer on Android tablet and a Kindle e-reader

For those who have already self-published a paperback, I’ll share a few general observations, but my thoughts are mainly for those who are interested in do-it-yourself (DIY) and have some experience of creating Web pages. It actually took me quite a bit longer than I had anticipated largely because I chose to pursue the DIY route and along the way I wrote some software to do some ‘heavy lifting’, but I found this rewarding and it means I can now help other authors produce both hard copy and electronic copy.

Just as Amazon’s CreateSpace offers a dedicated service for publishing paperbacks, Kindle Direct Publishing (KDP) is the analogue for e-books, and unsurprisingly CreateSpace points to KDP, easing the transition by offering to share the front cover and metadata. So, yes, it seemed a natural next step. However, one of the first tasks was to understand what exactly I was meant to be preparing for it wasn’t immediately obvious. After some searching, I learnt about an open technical standard called EPUB, the work of the International Digital Publishing Forum, a trade and standards association for the digital publishing industry. It consists of a markup specification, where document content is specified as (X)HTML, so it’s like preparing a special class of Web pages. Then there’s the specification of how everything should be packaged, table of contents, navigation etc., which is in a series of XML files. These files get bundled in a ZIP and renamed with .epub extension, and voilà! You have an EPUB instance, or ePub file. Just as there are a range of authoring tools for Web development in general, so there are various options for producing EPUB.

All well and good, except the list of members, doesn’t appear to include Amazon. Indeed Amazon has its own proprietary formats, Mobipocket (abbreviated .mobi) and Kindle Format 8 (.kf8). However, Amazon is pragmatic, and its KDP service supports uploads in various formats, including MS Word, HTML, ePub, MOBI and so on. Nevertheless, expectations need to be tempered as the range of options hides the fact that for all but a simple text document, without revision of that document, these options will produce results that vary a great deal in quality. In almost all cases the original manuscript, typically authored in Word, will need to be carefully edited to ensure tidy and consistent formatting.

Plenty of pointers are available; Amazon provides a Simplified Formatting Guide and in the beginning I found it instructive to experiment with a transitional version, applying one of the freely available Word templates designed for Kindles. If I had been in a hurry, then I might have followed this path, but for a quite complex book layout, I was not confident that I could have sufficient control over the final output.

Irrespective of the method eventually chosen, I recommend becoming familiar with how KDP processes and prepares a file (.mobi, that actually includes the .kf8 version as well) that it would use as a publication candidate. Within the KDP dashboard, you can preview the result online using the Web-based previewer and download the file to inspect on your device (just use Send to Kindle). When I tried this, downloading the first drafts for viewing on my Kindle e-reader (7th generation) all the words were still there and I was impressed to see that the endnotes (which I had arranged on a per-chapter basis) had been converted to footnotes with working hyperlinks between text and notes.

Beyond this was not so good — it had numerous issues:

  • table of contents, list of figures etc had ragged layouts
  • in fact, generally the layout was all over the place, with chapter headings not properly aligned
  • the images had become small and were variously centred or left-aligned
  • varied font sizes
  • large gaps in text
  • no logical table of contents (i.e. the one available via an e-reader menu item)
  • the index, which had originally been designed for a fixed layout, was replicated and in its static form with reference to hardcopy page numbers made little sense

In pondering these issues I learnt quite a lot about going from fixed layouts to reflowable layouts. I pondered quite a lot the last one in particular and eventually realised that there might be a path to a meaningful solution. I had originally created an index the traditional way and been faced with the traditional problem afterwards: on completion of a draft I had worked my way through the book to laboriously compile the index, but subsequently needed to revise it quite considerably. I then discovered the method of indexing in Word using bookmarks as the targets of index entries and the (not particularly robust) Dexter add-in, for managing these entries and generating automatically the index from the bookmarks. Thanks to Robert Papini, the index was re-made this way, to keep evolving without great effort. Furthermore, for the e-book I used another tool from the makers of Dexter, IndexLinker, which turned the index into a set of hyperlinks to the respective bookmarks. Saving this as (filtered) HTML preserves the links and this has been successfully carried over to the e-books.

Much as I could solve some particular problems, tweaking the Word document and uploading would not be completely satisfactory, so I focused attention on the EPUB format, reassured by various messages in fora that Amazon has invested considerable effort in converting to .mobi/.kf8 At this stage, I had to decide how far I wanted to proceed (really how fussy I was to be about the final rendering). For many, particularly those who don’t want to immerse themselves in the technicalities of HTML and XML, there are authoring environments that facilitate the process and can produce good results. Probably the most popular of these is Calibre, which is a whole environment for production and management of e-books.

I gave it a quick go and it could correctly display chapter headings, with the lotus image properly aligned, but inevitably it reflected the quirks in the original and some issues would have to be corrected in the in-built editor. At this point you are exposed to editing HTML and CSS, and whilst the import of a Word document does a lot of tidying (and smartly splits a large file into smaller files based on sections), it does retain much of Word’s original markup and superfluous spacing; also in the translation, it adds a lot of its own CSS without using semantic labels.

As I wanted to learn more about EPUB and edit/process HTML and CSS, I opted to leave Calibre for another day and use tools and techniques with which I’m familiar. I’ll describe some of these in my next post …