Automated EBook Publishing for Kindle from Latex source with CFEngine

Continuous delivery in content publishing

Automation is a powerful thing, in the right hands. Suddenly you can document and scale up processes that are both arduous and complicated. Moving from manual work to automation does require an investment though. One of the hard things about building content is that none of the tools do exactly what you want, so you are constantly working around them.

I recently wrote a book which I formatted for Kindle. As someone who uses Latex for its superior print quality, I was disappointed (but not surprised) to learn that the tools for publishing on Kindle do not include Latex support. However, a little searching online pointed to some ways forward via HTML. The trouble with that was that the tools for making HTML from Latex are also in a sorry state of disrepair. The closest thing to an HTML converter for Latex is tex4ht.

Using trial, tribulation, and a great deal of error, I groped my way to a result that was acceptable, that could handle images and Greek display equations. Tex4ht cannot use pdflatex, so all figures had to be rendered in old latex EPS format. Thank goodness I used xfig for sourcing. No matter, I was up for the challenge. A greater issue was that none of the tools did exactly what was needed. They did almost what was needed. A lot of hacking was needed.

Like most people, while groping forwards I used the closest tools to hand, and went for Unix shell commands. I could get part of the way using the following `cache' of scripted commands, and a lot of manual editing in the middle. What a mess, and quite incomphrensible! More importantly, you can only run this once. If you do it again, you will lose all your manual changes, so it is very fragile.

#!/bin/sh
cd ..
for f in *.tex
do
 echo Converting $f
 cat $f | sed s/subsubsection\\*{}/section\\*{XXX}/g > Ebook/$f
done
cp InSearchOfCertainty.bbl InSearchOfCertainty.idx InSearchOfCertainty.ind Ebook
cd Ebook
mv InSearchOfCertainty.html InSearchOfCertainty.html.old
yes q | htlatex InSearchOfCertainty.tex
tex '\def\filename{{InSearchOfCertainty}{idx}{4dx}{ind}} \input idxmake.4ht'
makeindex -o InSearchOfCertainty.ind InSearchOfCertainty.4dx
gcc -o FixIndex FixIndex.c
./FixIndex < InSearchOfCertainty.ind > xxx.tmp
cp xxx.tmp InSearchOfCertainty.ind 
# Change graphics to graphics-600 for hi res
yes q | mk4ht htlatex InSearchOfCertainty.tex 'xhtml,charset=utf-8,pmathml,frames-,graphics-300,index=2,2' ' -cunihtf -utf8 -cvalidate'
echo "p.indent{ margin-top: 1px; margin-bottom: 1px; }" > tmpfile.css
cat InSearchOfCertainty.css | sed s/monospace/serif/g >> tmpfile.css
mv tmpfile.css InSearchOfCertainty.css
echo  "div.figure p { font-size: 90%;} }" >>  InSearchOfCertainty.css
cat InSearchOfCertainty.html | sed s/XXX/\<p\>\<\\/p\>/g >> tmp.html
cat tmp.html | sed s/InSearchOfCertainty\</\</g > tmp2.html
cat tmp2.html | sed s/x1x/\<a\ href=\"\#/g > tmp3.html
cat tmp3.html | sed s/x2x/\"\>/g > tmp4.html
cat tmp4.html | sed s/x3x/\<\\/a\>/g > tmp5.html
cat tmp5.html | sed s/width=[^/]*/width=\"70%\"/g > tmp6.html
cat tmp6.html | sed s/Fig\\./\<p\>Fig./g > tmp7.html
mv tmp7.html InSearchOfCertainty.html
rm tmp.html tmp2.html tmp3.html tmp4.html tmp5.html tmp6.html InSearchOfCertainty.dvi

# Now roll up your sleeves for manual editing...
# hr width 0px
# Add spacing around chapters, titles
# Adjust display text and figure captions
# Add Kindle <mbp:pagebreak/> commands
# Then generate kindle file
#
# ./kindlegen -gif -c2  InSearchOfCertainty.html

Even with this script, an hour of manual editing was required afterwards to fix many of the formatting details.

The index formatting is broken in tex4ht.
Hyperlinks have to be parsed and generated.
Arbitrary `academic' styling and spacing needs to be fixed to look professional.
Tex4ht doesn't know about mysterious secret metadata magic for Kindle.

It simple wasn't plausible to automate all the details using shell. That meant, any edits to the source would be followed by an arduous process of rebuilding a final file for Kindle.

Even as I was typing all this, I was thinking how much easier some of the text editing would be using CFEngine, not to mention how much easier to document what was going on. Actually, I am already wanting to do all over again this again for the next book! So I set about making a self-healing "makefile" for CFEngine to completely automake from Latex/EPS sources. This took a while, as I had to remember what all the strange sed transormations were for. The result in CFEngine ended up being classic CFEngine handsfree, self-repairing automation: a chain of build operations that end with a desired state: a working deployable package for Kindle.

CFEngine can do the whole thing...and make it comprehensible

Starting again, after deciphering the hieroglyphics above, I decided to invest a little effort into structuring the sequence of processing in a comprehensible way. Part of CFEngine's goal is to be a documentation language---to practise sound knowledge management. Separating out the meta-data is always a good start.

bundle common my
{
vars:

  "title"           string => "In Search Of Certainty: the science of our information infrastructure";
  "author"          string => "Mark Burgess";
  "category"        string => "Popular science, computer science";

  "filename"        string => "InSearchOfCertainty",
                   comment => "The main latex entry point of the book";

  "coverfile"       string => "Cover/$(filename).jpg",
                   comment => "This is a pre-formatted image for use as the cover";

  "source_dir"      string => "/home/mark/books/Certainty",
                   comment => "The directory from which we pull in the current version";

  "container"       string => "/tmp/build_dir",
                   comment => "The container in which we assemble sources for build";

  "project"         string => "$(my.container)/$(filename)";
}

Main build-pipeline

After separating out the basic meta-data, everything else is pretty much reusable code. I split the main assembly line into three groups, represented by a bundle of promises: setup, compilation and post-processing.

bundle agent do
{
methods:

   "Preparation"         usebundle => AssembleContainer,
                           comment => "Pull in latest writing updates into an isolated environment";

   "Autogen XHTML"       usebundle => Latex_2_XHTML,
                           comment => "Tex4ht source build processing is quite complex";

   "Post process and QA" usebundle => PostfixAutoGeneratedObjects,
                           comment => "Fix the spacing in auto-generated files";
}

The assembly methods' promise bundles

In CFEngine, the documentation is the code itself. It is a sequence of promises to bring about a desired outcome to a resource (or a pattern of resources to match). Every satement in CFEngine has the same structure:

 resource_type:

    "affected resource(s)"
        specified_property => detailed_outcome ... ;

The methods are fairly self-documenting with a little context. We start by `assembling a container' for the package, by copying all the source files into a new directory in which to build the package, so we can keep different versions apart, and not risk messing up the original sources. To perform the entire transformation from Latex files *.tex to Kindle *.mobi a lot of magical dancing is required. There is no simple way of doing it, but here is the magic documented as plainly as possible, as CFEngine promises:

bundle agent AssembleContainer
{
files:

  "$(my.container)"
       copy_from => development("$(my.source_dir)"),
    depth_search => including_subdirs,
     file_select => just_build_files,
         comment => "Get latest changes from development (CD)",
         classes => if_ok("pre_process_files");

  "$(my.container)/FixIndex.c"
          create => "true",
       edit_line => SourceFixIndex,
         comment => "Keep the C code transformer in this script for adaptability";

 pre_process_files::

  "$(my.container)/.*.tex"
       edit_line => add_post_processing_tags,
         comment => "Add tags for xhtml commands into every latex source file as tokens
                     because placing them in the autogenerated XHTML later is too hard";
classes:

  "compile_patch_tool" 
      expression => makerule("$(my.container)/FixIndex","$(my.container)/FixIndex.c");

commands:

   pre_process_files.compile_patch_tool::

    "/usr/bin/gcc -o FixIndex FixIndex.c" 
       contain => build_in_container,
       comment => "Make sure the custom patch tool is built";
}

After this, we are in possession of a set of source files for building the book, which have been processed by adding some magic tags for later, and a tool for fixing the automatically generated index. The tags are added because Latex and XHTML (used by Kindle) do not share the same character set, so any (X)HTML we add to the sources would be destroyed during conversion. By tagging certain Latex constructions with plain text, we can search and replace with UTF-8 XHTML later. This kind of editing a very easy in CFEngine.

 bundle edit_line add_post_processing_tags
 {
 replace_patterns:

    # Use ^ to make patterns convergent

    "^\\\chapter{"        replace_with => value("MAGICTAG1 \chapter{");
    "^\\\part{"           replace_with => value("MAGICTAG1 \part{");

    "^\\\subsubsection*"  replace_with => value("\section*{MAGICTAG2}"),
                             comment => "Tex4ht doesn't treat subsubsections properly";
 }

The second stage is compilation, transparent to the magic preprocessing, all inside the special container. This is CFEngine's answer to a Makefile. The result is HTML and CSS output files, that include a hyperlink-working index that mimics the print book index.

bundle agent Latex_2_XHTML
{
classes:

  "verify_latex" 
          expression => makerule("$(my.project).dvi","$(my.project).tex");

  "assemble_final_xhtml" 
          expression => makerule("$(my.project).html","$(my.project).tex");

commands:

 verify_latex::

  "/usr/bin/yes q | /usr/bin/htlatex $(my.filename).tex"
     contain => build_in_container,
     comment => "1. Convert input masterfile $(my.filename).tex into DVI with embedded references, with yes to all questions";

  "/usr/bin/tex '\def\filename{{$(my.filename)}{idx}{4dx}{ind}} \input idxmake.4ht'"
     contain => build_in_container,
     comment => "2. Convert the index format via DVI for latex4ht - magic from the internet!?";

  "/usr/bin/makeindex -o $(my.filename).ind $(my.filename).4dx"
     contain => build_in_container,
     comment => "3. Build the tex4ht index format for final conversion, but this is broken so next step is crucial";

  "/usr/bin/mv $(my.filename).ind fixme; $(my.container)/FixIndex < fixme > $(my.filename).ind"
      handle => "FixIndex",
     contain => build_in_container,
     comment => "4. hack to extract hrefs from ind/4dx conversion, then convert back to ind";

 assemble_final_xhtml::

  "/usr/bin/yes q | /usr/bin/mk4ht htlatex $(my.filename).tex 'xhtml,charset=utf-8,pmathml,frames-,graphics-300,index=2,2' ' -cunihtf -utf8 -cvalidate'"
     contain => build_in_container,
     comment => "5. Build html from dvi and index objects, accepting all questions, graphics-300 
                  for low res on kindle renderer";
}

Finally, post-processing involves editing the HTML and CSS generated files, to replace the magic tags with HTML formatting, then followed by a call to Amazon's kindlegen command on the XHTML.

bundle agent PostfixAutoGeneratedObjects
{
classes:
   "final_packaging"   expression => makerule("$(my.filename).mobi","$(my.filename).html");

files:
   "$(my.project).css"  edit_line => patch_css,
                          comment => "Edit the style settings suitable for Kindle";

   "$(my.project).html" edit_line => patch_html,
                          comment => "Clean up the workaround magic used to trick autobuild",
                          classes => if_ok("files_patched");
commands:

 final_packaging&files_patched::

   "$(my.container)/kindlegen -gif -c2 $(my.filename).html"
      contain => build_in_container,
      comment => "kindlegen binary supplied by Amazon.com, takes several minutes,
                  gif option recommended for proper figure rendering on old kindles";
}

The editing methods (which can be extended for even more tweaking of the style could not be done directly with shell tools like sed:

  bundle edit_line patch_html
  {
  insert_lines:

      "<title>$(my.title)</title>"                             select_region => HTML_header;
      "<meta name=\"Author\" content=\"$(my.author)\" />"        select_region => HTML_header;
      "<meta name=\"Description\" content=\"$(my.category)\" />" select_region => HTML_header;
      "<meta name=\"cover\" content=\"$(my.coverfile)\">"        select_region => HTML_header;

  replace_patterns:

     "MAGICTAG1"  replace_with => value("<mbp:pagebreak/> "),
                         comment => "Add kindle specific pagbreaks around chapters";
     "MAGICTAG2"  replace_with => value("<p></p>"),
                         comment => "Add extra spacing formatting around chapters";

      # Cleanup buggy tex4ht tagging

     "$(my.filename)<"    replace_with => value("<"),
                         comment => "remove text tex4ht embeds wrongly (bug workaround)";

     "x1x"                replace_with => value("<a href=\"#"),
                          comment => "Inverse transform of FixIndex workaround";

     "x2x"                replace_with => value("\">"),
                          comment => "Inverse transform of FixIndex workaround";

     "x3x"                replace_with => value("</a>"),
                          comment => "Inverse transform of FixIndex workaround";

     # Figures and captions, regex's must converge

     "width=\"[0-9]+\"" replace_with => value("width=\"70%\""),
                             comment => "Scale figures for screen width";

     ">Fig\."      replace_with => value("><p/> Fig."),
                        comment => "Add additional spacing under figures";

     "<hr[^>]*>"   replace_with => value(""),
                        comment => "Remove ad hoc <hr> around figures";

     "^<span"     replace_with => value("<p/><span"),
                  select_region => bibliography,
                        comment => "Put line breaks between bibitems, i.e. before each [REF]. (tex4ht workaround)";
  }

 #########################################################################

 bundle edit_line patch_css
 {
 insert_lines:

  "p.indent{ margin-top: 1px; margin-bottom: 1px; }"
     location => start,
      comment => "The spacing around highlighted sections is cramped";

  "div.figure p { font-size: 90%;} }"
     comment => "Reduce font size in figure captions so we can distinguish";

 replace_patterns:

   "monospace" replace_with => value("serif"),
                    comment => "Strange use of monospace chars in conversion";
 }

Automating a business process

Automating the transformation of content from one format to another is a pretty common task in information systems. Not only can CFEngine document the meaning of all of the bizarre black magic contortions one has to go through, it can do them more reliably and at scale. How hard would it be to turn this into a service for building kindle books from latex sources? A very useful function for the academic world. Clearly using CFEngine as part of a business process needs to be explored further -- so I'll return to that in a later post.

RAW CFENGINE SOURCE FILE

Appendices to squirrel away

We don't want all the details of the continuous delivery process to be in our faces. Somethings should be kept under the hood. So some final bureaucracy. In CFEngine you can configure all the details. Of course, some of the details are not strictly necessary. We could simplify a bit, but when you are betting your business on an automation investment, you want things to be as optimal as possible.

body common control
{
bundlesequence => { "do" };
inputs         => { "cfengine_stdlib.cf" };
}

#

body agent control
{
ifelapsed => "0";     # No need to protect run frequency
editfilesize => "2m"; # XHTML is large and verbose
}

#

body copy_from development(from)
{
source => "$(from)";
}

#

body file_select just_build_files
{
leaf_name => { ".*tex", ".*bbl", ".*idx", ".*ind", ".*toc", "kindlegen", ".*eps", ".*png", ".*cls", ".*sty"  };
file_result => "leaf_name"; 
}

#

body depth_search including_subdirs
{
depth => "inf";
exclude_dirs => { "Ebook", "EStore", "QIC", "Articles" };
}

#

body contain build_in_container
{
no_output => "true";
useshell => "useshell";
umask => "077";
exec_timeout => "500";
chdir => "$(my.container)";
}

body select_region HTML_header
{
select_start => "<head>.*";
select_end => "</head>.*";
}

body select_region bibliography
{
select_start => ".*References.*";
select_end => ".*Index.*";
}

Yes we can even keep the source code for the C transformation tool inside the CFEngine code so we don't have to separate

bundle edit_line SourceFixIndex
{
vars:

 "bs" string => "\\\\"; # escaped '\' !!
     comment => "PCRE + C requires a quadruple escape of backslash!";

insert_lines:

"
#include <stdio.h>
#define true 1

int main()
{ char buffer[2048];

 while (true)
    {
    int ref = 0;
    char link[1024];
    char ch;
    int match;

    if (feof(stdin))
       {
       break;
       }

    while (ch = getchar())
       {
       if (feof(stdin))
          {
          break;
          }

       if (isspace(ch))
          {
          putchar(ch);
          }
       else
          {
          ungetc(ch, stdin);
          break;
          }
       }

    link[0] = '\0';
    buffer[0] = '\0';
    ref = 0;
   
    if (match = scanf(\"$(bs)LNK{$(my.filename).html}{%[^}]}{}{%d}\", link, &ref))
       {
       if (ref)
          {
          printf(\"{x1x%sx2x%dx3x}\",link,ref);

          // x1x -> <a href=\"
          // x2x -> \">
          // x3x -> </a>
          }
       }
    else
       {
       if (ch == '$(bs)')
          {
          putchar('$(bs)');
          }
       scanf(\"%s\", buffer);
       printf(\"%s\",buffer);
       }
    }
}
"
comment => "Fix the .ind generated files to generate sane URLs";
}

RAW CFENGINE SOURCE FILE

homepage mark burgess

Thoughts...

New book

The Science of information infrastructure

Model-based monitoring with CFEngine

Third Generation CFEngine

Don't criticize my grammar