Thursday, February 09, 2017

Using EpubCheck 4.0.x as a Web-based EPUB validator

With the assistance of the EpubCheck Web Archive build tool, I was able to deploy EpubCheck as a Web app, ready to be used as a rudimentary web service.  There have long been intentions to make a more fully-fledged web service for EpubCheck, but for whatever reasons, as far as I can tell, these don’t appear to have been realized.  So Jason Darwin’s build tool is really useful for the likes of me who have very limited Java knowledge.

The web app merely echoes EpubCheck’s output: diagnostic information (warnings and errors) are sent to standard error, but output it is structured consistently, so we can work with it.  In this post I show a simple incorporation in PHP to carry out checks as part of a nascent system to convert from Word documents (saved as filtered HTML) to ePub.  As I develop the system, particularly in the early stages, I am using EpubCheck’s output to identify outstanding issues with the .epub file that results from conversion so far; the Web reporting adds a level of convenience.

Specifying and executing a request to the web service

To use the web service a file must be uploaded and the response processed.  For this I use curl and set up the request as follows (procedural coding style).
  $cfile = curl_file_create($epub_file, 'application/epub+zip', 'file');
  $postfield = array('file' => $cfile);
  $ch = curl_init($url);
The first line is using CURLFile, which is recommended for PHP 5.5.0 (the @ prefix is deprecated); $epub_file is the file location of the ePub file.  Line 2 defines the field name associated with the file to be set in the HTTP POST method.  Line 3 sets up a handle and includes the URL of the web service for a given $url, in my case, http://localhost:8080/epubcheck/Check.

A number of options are available to configure the behaviour of the request and the transfer, i.e. what data gets returned.  These options are described in the PHP documentation for curl_setopt.
// do regular POST
  curl_setopt($ch, CURLOPT_POST, 1);
// fields to POST (defined above)
  curl_setopt($ch, CURLOPT_POSTFIELDS, $postfield);
//  omit headers in the response
  curl_setopt($ch, CURLOPT_HEADER, 0); 
// track handle’s request string
  curl_setopt($ch, CURLINFO_HEADER_OUT, 1);  
// define standard headers for form upload
  curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: multipart/form-data’));
// return the transfer as a string
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);  
// Timeout in seconds
  curl_setopt($ch, CURLOPT_TIMEOUT, 60); 
Then the actual request is executed using:
  $response = curl_exec($ch);

Configuring for display in PHP

At this stage in system development I’m seeking to know what kinds of errors are being generated more than individual instances. The data output is structured as follows:

 <status type>: <file path>(<line>,<character>): <message>

We can carry out recursive matching using regular expressions to accordingly split each row of data into five chunks:
  $pattern="/([a-zA-Z]+): ($pub_folder_name\/[\/a-z_\-\s0-9\.]+)\(([0-9]+),([0-9]+)\): (.+)".PHP_EOL."/";
  preg_match_all($pattern,$response,$matches);
  $response=$matches;
(For $pub_folder_name I’m actually using ‘OEBPS’, the standard name for ePub Open Ebook Forum Publication Structure.)

This generates a nested array from which we can easily pick out the messages.
  • the number of issues is given by count($response[0])  
  • a list of issue types is given by array_unique($response[5])
Add a <div> tag with a bit of CSS to define its height plus a JavaScript show/hide for neatness and we can then include a summary and the details without the inconvenience of obligatory scrolling.


Clicking on 'show report details...' expands to produce the detailed report with errors listed in numerical order:


At the moment, there are many errors to fix, but by identifying each case we can deal with them systematically to make them disappear...!

Sunday, February 05, 2017

Deploying EpubCheck 4.0.x as a Web-based EPUB validator


(10 Sept '17): Epubcheck-web updated on Github

I'm very pleased to report that Jason Darwin has merged some small changes I submitted (to the Ant .war build file) and made his project up-to-date, ready to use with the latest version of epubcheck.jar.

https://github.com/jcdarwin/epubcheck-web


Whilst manually creating an EPUB file for Thursday’s Lotus, I wrote some PHP code that I wrote for the ‘heavy lifting’, particularly for assembling the final package. I have recently continued work on the automated support with a view to creating a general-purpose web-based system. An initial aim is to reach the stage where I can upload a book authored in MS Word (saved as filtered HTML) and use the service to generate a valid EPUB file that I can then manually tweak. Ideally, it will be good enough to publish straightaway. As part of the process, the assembled EPUB needs to be validated. Previously, I had run this separately at the command line using epubcheck.jar, as made available by the EpubCheck project, but now I needed to set this up as a web application providing a basic web service, which is the main focus of this post.

Epubcheck is a Java application; the GitHub repository shows the current release at 4.0.2. The project home page indicates, "EpubCheck can be run as a standalone command-line tool or used as a Java library." with the wiki providing some guidance on usage in a variety of contexts, including some GUIs, but no explicit mention of web usage. However, a tantalising hint is found in the distribution README file in the source code (epubcheck/src/main/assembly/README-dist.txt), which mentions "EpubCheck can be run as a standalone command-line tool, installed as a web application or used as a library." (The emphasis is my own.)

My experience of Java is limited to coding very elementary programs and deploying a few web application archive (.war) files, so realistically I need a web application archive (or sources ready to build). Whilst EpubCheck doesn’t include these, fortunately, Jason Darwin has addressed this very problem. A few years ago he wrote about the procedure on his blog, with a post entitled, Creating a WAR file for epubcheck. So it could be done. However, those instructions are for epubcheck up to version 3 and were written when the repository was using Subversion on Google Projects (from which it has moved to GitHub). Accordingly he then set up a GitHub project, epubcheck-web, for version 4. By following the instructions I eventually got it working on my laptop after a few tweaks. Conveniently my development environment is also Mac with the Homebrew package manager, and I am running the Oracle-supplied JDK SDK, currently 1.8.

There are two separate build processes:
  1. epubcheck.jar - the standalone validator built using maven (alternatively, this can be downloaded ready-built from the IDPF site).
  2. epubcheck.war - the web application archive built using ant for deploying in a servlet container (I installed Tomcat locally)
For the epubcheck.jar build, one of the unit tests failed:
remote_Test(com.adobe.epubcheck.test.single_file_Test)  
  Time elapsed: 0.238 sec  <<< FAILURE!
  junit.framework.AssertionFailedError: Missing message
  at junit.framework.Assert.fail(Assert.java:50)
This concerns the processing of single files that are not ePubs. After seeing what it was attempting I skipped it, trusting that it wasn’t an issue, by specifying an exclude in pom.xml
. The compilation then completed safely. Having built epubcheck.jar, I turned to the second build process — the web interface to invoke the validator. Again, I generally followed the instructions, though to actually download the sources I used:
$ git clone https://github.com/jcdarwin/epubcheck-web.git
I found the key to get it working is to ensure that epubcheck.jar is included in the right place. I simply copied the file to the webapp’s lib/ folder and then referenced it in build-war.xml, alongside the other .jar files
<path id="epubcheckServlet.classpath">
...
  <fileset dir="${epubcheck.web.includelibs}"><include name="Saxon-*.jar" />
    <include name="epubcheck.jar" />
  </fileset>
I also commented out any reference to building epubcheck.jar, which I think is superfluous as far as building the web interface is concerned. Then I proceeded to build with ant and copy over the .war, as instructed. Tomcat duly deployed the webapp, with the minimalist web form:

When supplying an ePub file to validate I was initially getting blank output and wondering why, I started thinking it had to do with the Java classpath used by Tomcat. On reading Understanding The Tomcat Classpath - Common Problems And How To Fix Them, I examined more closely WEB-INF/classes and WEB-INF/lib and realised I was missing epubchecker.jar! It was then that I was prompted to add this to the epubcheck-web src/lib/ folder and rebuild. The resulting .war file was duly increased in size by about 1MB. And on redeploying the app and applying it to my EPUB file I got a reassuring pause for processing before the output came through as expected.

Now it was ready for use as a basic web service for my PHP-based system, which I'll describe in the next post.