Thursday, February 09, 2017

Using EpubCheck 4.0.x as a Web-based EPUB validator

With the assistance of the EpubCheck Web Archive build tool, I was able to deploy EpubCheck as a Web app, ready to be used as a rudimentary web service.  There have long been intentions to make a more fully-fledged web service for EpubCheck, but for whatever reasons, as far as I can tell, these don’t appear to have been realized.  So Jason Darwin’s build tool is really useful for the likes of me who have very limited Java knowledge.

The web app merely echoes EpubCheck’s output: diagnostic information (warnings and errors) are sent to standard error, but output it is structured consistently, so we can work with it.  In this post I show a simple incorporation in PHP to carry out checks as part of a nascent system to convert from Word documents (saved as filtered HTML) to ePub.  As I develop the system, particularly in the early stages, I am using EpubCheck’s output to identify outstanding issues with the .epub file that results from conversion so far; the Web reporting adds a level of convenience.

Specifying and executing a request to the web service

To use the web service a file must be uploaded and the response processed.  For this I use curl and set up the request as follows (procedural coding style).
  $cfile = curl_file_create($epub_file, 'application/epub+zip', 'file');
  $postfield = array('file' => $cfile);
  $ch = curl_init($url);
The first line is using CURLFile, which is recommended for PHP 5.5.0 (the @ prefix is deprecated); $epub_file is the file location of the ePub file.  Line 2 defines the field name associated with the file to be set in the HTTP POST method.  Line 3 sets up a handle and includes the URL of the web service for a given $url, in my case, http://localhost:8080/epubcheck/Check.

A number of options are available to configure the behaviour of the request and the transfer, i.e. what data gets returned.  These options are described in the PHP documentation for curl_setopt.
// do regular POST
  curl_setopt($ch, CURLOPT_POST, 1);
// fields to POST (defined above)
  curl_setopt($ch, CURLOPT_POSTFIELDS, $postfield);
//  omit headers in the response
  curl_setopt($ch, CURLOPT_HEADER, 0); 
// track handle’s request string
  curl_setopt($ch, CURLINFO_HEADER_OUT, 1);  
// define standard headers for form upload
  curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: multipart/form-data’));
// return the transfer as a string
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);  
// Timeout in seconds
  curl_setopt($ch, CURLOPT_TIMEOUT, 60); 
Then the actual request is executed using:
  $response = curl_exec($ch);

Configuring for display in PHP

At this stage in system development I’m seeking to know what kinds of errors are being generated more than individual instances. The data output is structured as follows:

 <status type>: <file path>(<line>,<character>): <message>

We can carry out recursive matching using regular expressions to accordingly split each row of data into five chunks:
  $pattern="/([a-zA-Z]+): ($pub_folder_name\/[\/a-z_\-\s0-9\.]+)\(([0-9]+),([0-9]+)\): (.+)".PHP_EOL."/";
(For $pub_folder_name I’m actually using ‘OEBPS’, the standard name for ePub Open Ebook Forum Publication Structure.)

This generates a nested array from which we can easily pick out the messages.
  • the number of issues is given by count($response[0])  
  • a list of issue types is given by array_unique($response[5])
Add a <div> tag with a bit of CSS to define its height plus a JavaScript show/hide for neatness and we can then include a summary and the details without the inconvenience of obligatory scrolling.

Clicking on 'show report details...' expands to produce the detailed report with errors listed in numerical order:

At the moment, there are many errors to fix, but by identifying each case we can deal with them systematically to make them disappear...!

No comments: