Showing posts with label technology. Show all posts
Showing posts with label technology. Show all posts

Wednesday, October 12, 2016

Publication of Thursday’s Lotus as an e-book: finding a suitable approach

Thursday’s Lotus is now available as an e-book through Kindle Store and Kobo, with others to follow. I’m glad I made the effort and hope this post will encourage other newcomers to explore further some of the options to contribute to the world of electronic publishing.

As someone whose career has been largely in technology, I found the preparation and publication of a hardcopy book a fascinating process, resulting in the very satisfying experience of holding a physical copy in my hands. But, of course, nowadays it’s expected that books be made available in digital formats for reading on handheld devices, i.e. as e-books. As I had other work commitments I concentrated initially on the paperback, making it as polished as I could; I felt pursuing an electronic version in parallel would have negated against that, so I put that thought to one side and merely indicated that I expected an e-book version to emerge in 6-12 months. Then, after a little break, I decided to start focusing on the e-book version in August and I published it in September.

Thursday's Lotus paperback alongside Kindle viewer on Android tablet and a Kindle e-reader

For those who have already self-published a paperback, I’ll share a few general observations, but my thoughts are mainly for those who are interested in do-it-yourself (DIY) and have some experience of creating Web pages. It actually took me quite a bit longer than I had anticipated largely because I chose to pursue the DIY route and along the way I wrote some software to do some ‘heavy lifting’, but I found this rewarding and it means I can now help other authors produce both hard copy and electronic copy.

Just as Amazon’s CreateSpace offers a dedicated service for publishing paperbacks, Kindle Direct Publishing (KDP) is the analogue for e-books, and unsurprisingly CreateSpace points to KDP, easing the transition by offering to share the front cover and metadata. So, yes, it seemed a natural next step. However, one of the first tasks was to understand what exactly I was meant to be preparing for it wasn’t immediately obvious. After some searching, I learnt about an open technical standard called EPUB, the work of the International Digital Publishing Forum, a trade and standards association for the digital publishing industry. It consists of a markup specification, where document content is specified as (X)HTML, so it’s like preparing a special class of Web pages. Then there’s the specification of how everything should be packaged, table of contents, navigation etc., which is in a series of XML files. These files get bundled in a ZIP and renamed with .epub extension, and voilà! You have an EPUB instance, or ePub file. Just as there are a range of authoring tools for Web development in general, so there are various options for producing EPUB.

All well and good, except the list of members, doesn’t appear to include Amazon. Indeed Amazon has its own proprietary formats, Mobipocket (abbreviated .mobi) and Kindle Format 8 (.kf8). However, Amazon is pragmatic, and its KDP service supports uploads in various formats, including MS Word, HTML, ePub, MOBI and so on. Nevertheless, expectations need to be tempered as the range of options hides the fact that for all but a simple text document, without revision of that document, these options will produce results that vary a great deal in quality. In almost all cases the original manuscript, typically authored in Word, will need to be carefully edited to ensure tidy and consistent formatting.

Plenty of pointers are available; Amazon provides a Simplified Formatting Guide and in the beginning I found it instructive to experiment with a transitional version, applying one of the freely available Word templates designed for Kindles. If I had been in a hurry, then I might have followed this path, but for a quite complex book layout, I was not confident that I could have sufficient control over the final output.

Irrespective of the method eventually chosen, I recommend becoming familiar with how KDP processes and prepares a file (.mobi, that actually includes the .kf8 version as well) that it would use as a publication candidate. Within the KDP dashboard, you can preview the result online using the Web-based previewer and download the file to inspect on your device (just use Send to Kindle). When I tried this, downloading the first drafts for viewing on my Kindle e-reader (7th generation) all the words were still there and I was impressed to see that the endnotes (which I had arranged on a per-chapter basis) had been converted to footnotes with working hyperlinks between text and notes.

Beyond this was not so good — it had numerous issues:

  • table of contents, list of figures etc had ragged layouts
  • in fact, generally the layout was all over the place, with chapter headings not properly aligned
  • the images had become small and were variously centred or left-aligned
  • varied font sizes
  • large gaps in text
  • no logical table of contents (i.e. the one available via an e-reader menu item)
  • the index, which had originally been designed for a fixed layout, was replicated and in its static form with reference to hardcopy page numbers made little sense

In pondering these issues I learnt quite a lot about going from fixed layouts to reflowable layouts. I pondered quite a lot the last one in particular and eventually realised that there might be a path to a meaningful solution. I had originally created an index the traditional way and been faced with the traditional problem afterwards: on completion of a draft I had worked my way through the book to laboriously compile the index, but subsequently needed to revise it quite considerably. I then discovered the method of indexing in Word using bookmarks as the targets of index entries and the (not particularly robust) Dexter add-in, for managing these entries and generating automatically the index from the bookmarks. Thanks to Robert Papini, the index was re-made this way, to keep evolving without great effort. Furthermore, for the e-book I used another tool from the makers of Dexter, IndexLinker, which turned the index into a set of hyperlinks to the respective bookmarks. Saving this as (filtered) HTML preserves the links and this has been successfully carried over to the e-books.

Much as I could solve some particular problems, tweaking the Word document and uploading would not be completely satisfactory, so I focused attention on the EPUB format, reassured by various messages in fora that Amazon has invested considerable effort in converting to .mobi/.kf8 At this stage, I had to decide how far I wanted to proceed (really how fussy I was to be about the final rendering). For many, particularly those who don’t want to immerse themselves in the technicalities of HTML and XML, there are authoring environments that facilitate the process and can produce good results. Probably the most popular of these is Calibre, which is a whole environment for production and management of e-books.

I gave it a quick go and it could correctly display chapter headings, with the lotus image properly aligned, but inevitably it reflected the quirks in the original and some issues would have to be corrected in the in-built editor. At this point you are exposed to editing HTML and CSS, and whilst the import of a Word document does a lot of tidying (and smartly splits a large file into smaller files based on sections), it does retain much of Word’s original markup and superfluous spacing; also in the translation, it adds a lot of its own CSS without using semantic labels.

As I wanted to learn more about EPUB and edit/process HTML and CSS, I opted to leave Calibre for another day and use tools and techniques with which I’m familiar. I’ll describe some of these in my next post …

Saturday, April 02, 2011

Bye bye Thinkpad (R51) Hello Thinkpad (Edge 13)

[updated 10 April '11]

After almost 6 years, the LCD display on my IBM Thinkpad R51 has become faulty with thin coloured vertical lines, the hard disk is getting full and generally it's not as sprightly as it once was.

So I decided to replace it and as I've been generally impressed with the Thinkpad range, I've bought another one, though this time I'm much more on a budget. After reading a number of online reviews, I settled on a Thinkpad Edge 13, which balances reasonable screen size with portability, the latter particularly enhanced by processors that consume less power; I went with the AMD Turion II Neo K685 CPU as I felt I couldn't afford one with the Intel Core i3. I ordered directly through the online Lenovo store on St. Patrick's Day (for a 10% discount) and it was duly delivered by UPS within the advertised 1-2 weeks time slot. Some retailers were offering a cash-back deal that would make it cheaper, but it meant purchasing an external DVD writer, which would be surplus to my requirements. No, I didn't choose heatwave red or glossy options - the traditional boxy design would have been fine for me!

I opted for Windows 7 Professional, which allows me to run under XP mode software that I was using previously that's not supported by Windows 7. As this would run under in a virtual PC, I figured I probably would need 4GB to be comfortable, but I initially ordered just the minimal 2GB from Lenovo as extra memory seemed to be charged at something of a premium. I then turned to Crucial Memory (UK), but hit a slight snag. The first batches of Thinkpad Edge PCs had been shipped with an earlier versions of the AMD chipset and processor and so when I used their Memory Advisor Tool for the Edge 13, it only offered DDR2 RAM (and 4GB max). Checking the spec from the Lenovo site and elsewhere, I could establish that this machine comes with the AMD M780G chipset which supports DDR3 RAM, viz: 204-pin PC3-10600 DDR3 SDRAM 1333MHz SODIMM. So on that basis I could find a suitable match and bought a 4GB module, part number CT51264BC1339. It was easy to install and has been accepted by the OS, which now reports 6GB of installed memory of which 5.75GB is usable (0.25GB gets swallowed up by the graphics card!)

First Impressions

I'm generally pleased with this new Thinkpad so far. It's still solid, has a streamlined yet familiar keyboard layout, a somewhat smaller form factor, lighter weight and much better battery life. So it makes it more practical to take with me, though it's some way from the Eee PCs I'm used to. :-) Anyway, last week I duly popped it in my work bag and kept it on during a two and a half hour meeting, for which I took the minutes, with reduced brightness, hardly any wifi. Afterwards 65% charge remained, which I thought was quite impressive. I do miss, however, having some indicator lights, particularly for hard disk and wireless connections. Sometimes, to check that something is happening I find myself listening for hard disk activity.

The OS looks okay and operates smoothly and I think the visual appearance and general look and feel of Windows 7 is better than XP. Here's a screenshot: in the background is a standard desktop theme of rotating scenes and in the foreground I'm running a virtual machine (see below).

Windows 7 desktop with VirtualBox running Debian Squeeze featuring WMI2

As someone who does a lot of reading on screen, I find the widescreen display (WXGA or 1366*768) offers significant benefits over the R51's XGA (1024*768), particularly useful for comparing documents side by side. It reminds me of when the HP320LX offered 640*240 instead of the previous norm of 480*240 for handheld PCs - the limited height mattered surprisingly little.

In terms of software, one of the first sighs of relief was being able to use MozBackup to migrate my Thunderbird email accounts and preferences.

Curious about the performance, the Windows Experience Index, rating from 1.0 (lowest) to 7.9 (highest) indicates middling performance:

  • Processor Calculations per second: 4.8
  • Memory (RAM) Memory operations per second: 6.5
  • Graphics Desktop performance for Windows Aero: 3.3
  • Gaming graphics 3D business and gaming graphics performance: 5.1
  • Primary hard disk Disk data transfer rate: 5.9

Curiously, immediately after I upgraded to service pack 1, I was prompted to update the index and the memory operations per second increased to 7.0. However, after a further update it reverted to 6.5!

Virtualization

I have tended to run Windows as my day-to-day OS, but I also use Linux for system administration and Web development. This has usually led me to create a dual boot setup, but this time I am experimenting with running virtual machines. In fact I'm already running two systems and teetered on running a third!

My first VM was installed through necessity - a few years ago I bought a combined package of HP Deskjet 950C printer and HP Scanjet 5370C scanner, but it turns out that only the printer is supported in Windows 7. To solve this, I needed to run MS Virtual PC and install as a guest OS Windows XP (XP Mode). Even then, care is needed to ensure that when the scanner is plugged in as a USB device, it is duly attached under XP and not routed to Windows 7: when running XP Mode, along the top of the VM there is a row of menu items; click on the USB menu and attach the unidentified device corresponding to the scanner.

I installed the second VM by choice - looking to use it for web application development on a Linux platform and am hoping that it will be robust as well as offer reasonable performance. First up is VirtualBox, but waiting in the wings is VMWare Player, so how does it fare so far...? Well, installation of VirtualBox itself is nice and convenient. I like the way that it can grow to use the resources as required and can readily tweak the allocations of RAM and video RAM. So, I've dived in and installed Debian Lenny off DVD followed by an upgrade to Squeeze. I strongly recommend installing the Guest Additions since without them I was finding the mouse pointer control rather unpredictable.

The first real test was to install Web Mathematics Interactive 2 (WMI2), a computer algebra system with a very user-friendly calculator-style interface. On the Web site, PHP scripts take the input and issue Ajax calls that are communicated to Maxima, which handles all the calculations and returns output for display. The screenshot above shows it in action - rather than having to remember some markup language like TeX, you can build up formulae by hitting the buttons. The system checks your input as you go along. I downloaded the package from SourceForge and followed the instructions (just one thing to note: timeout is already included in the coreutils package).

There has been a lot of software and data to transfer, but I am gradually emerging from this liminal state of migration... :-)

Friday, October 01, 2010

Installing MySource Matrix on a Win XP Netbook (for evaluation)

Among the crop of open source content management systems that are deployed at Oxford, I recently had my first encounter with MySource Matrix, developed by Squiz, which is released under GPL. I decided to take a closer look by installing it on a netbook running Windows XP Home, just to have a poke around. The official requirements are a UNIX-like operating system, but this is a Web application, so in principle I think it ought to work, just as I can be confident in installing WordPress or Drupal. This should be the case, even though MySource Matrix is definitely a more substantial proposition in that a default install seems to gives you 'the kitchen sink'. The following is a screenshot after I had created my first page:

screenshot: MySource Matrix admin panel

As I couldn't find anyone else writing about it I thought I'd share an outline of the process I went through in case others would like to evaluate on this platform. I apologize in advance for not offering any support or follow-ups because I've since moved on/back to other systems for now, so for queries I think the best place would be the support forums.

Requirements

I carried out the installation at the beginning of September 2010, mainly following the steps in the installation guide provided by Squiz. The first thing that I'd stress is that it doesn't support the latest version of PHP. I really should have read the requirements more carefully as I did try 5.3.x and progressed only so far until I hit an issue reported on the forum. Afterwards I dropped back to version 5.2.14.

For the Web server I am running Apache 2.2 and this appears to be the recommended choice for working with the other components.

For the database back-end, MySQL is not supported, but there is a Windows distribution of the freely available PostgreSQL.

Installation

Apache (preliminary): A standard install should be fine as a basis to work with installing PHP and its libraries. For the Matrix config itself, see later.

PostgreSQL: For reference, I used Squiz's page on database installation. I installed Postgres 8.4.x using the Windows easy installer, accepting the defaults. Then for the installation of the Squiz database I used pgadmin, specifically pgAdmin III (v. 1.10.3). From the command line I issued the following commands to create two users: $ createuser -SRDU postgres matrix
$ createuser -SRDU postgres matrix_secondary
where -S: NOT a superuser, -R NOT allowed to create roles, -D NOT allowed to create databases, -U connect as username.

Then I created the database: $ createdb -U postgres -O matrix -E SQL_ASCII mysource_matrix, where -O owner, -E encoding. Note that it's important that the right template database is used — template0.

Comment: pgAdmin warns that storing data using the SQL_ASCII encoding means that the encoding is defined for 7 bit characters only. So it is dependent on the Web application to do the conversions (since content served to the Web is typically 8 bit UTF. I wonder if this is an indication of the longevity of the software...?

At some stage, you need to create the PLPGSQL language for Matrix. I did this in pgAdmin once I had run from the command line: C:\PostgreSQL\8.4\bin>createlang.exe -U postgres -d mysource_matrix plpgsql, but for a while I had some difficulties and needed some guidance.

PHP: When installing PHP (I did this via the MSI installer), I did the custom configuration to ensure support for PDO. There are other bits, like SMTP support, that should be included also, but selecting everything is not a good idea because, for instance, it will then expect that an Oracle client is installed. The most immportant of these is the PEAR Package Manager since MySource Matrix depends a lot on the additional PHP libraries that PEAR provides.

I actually did the PEAR installation separate from the main PHP installation, by running the gopear.php script from within a browser and subsequently installing packages through its Web front end. A lot of modules are needed and I'm not sure that my list is complete, but for what it's worth, here is a list of what I've installed so far from the channel pear.php.net: Archive_Tar, Auth_SASL, Cache_Lite, Config 1.10.11, Console_Getopt, HTML_Template_IT, HTTP, HTTP_Client, HTTP_Request, I18N, Image_Canvas, Image_Color, MIME_Type, Mail, Mail_Mime, Mail_Queue, Mail_mimeDecode, Math_BigInteger, Math_Stats, Net_SMTP, Net_Socket, Net_URL, Numbers_Roman, Numbers_Words, PEAR, PEAR_Frontend_Web, Structures_Graph, Text_Diff, Text_Highlighter, XML_HTMLSax, XML_Parser, XML_RPC, XML_Tree. Note that all these are marked 'stable' apart from Image_Canvas (alpha); Numbers_Words and PEAR_Frontend_Web (beta).

Basic Config Settings

With everything in place, I was ready to run the php install scripts. Here I emphasize that the PHP PDO and PDO_PGSQL modules must be installed for this to complete. For step 2, you need to ensure that there is the right access to Postgres. by editing pg_hba.conf to have lines roughly like:

host    all         all         your_local_IP/24          trust

Then at the end of running the script it should report at end: all secondary and tertiary user permissions fixed.

For step 3 I found I needed to specify a locale like: C:\www\home\websites\mysource_matrix>c:\PHP\php.exe install\compile_locale.php c:\www\home\websites\mysource_matrix --locale=en.

Apache (config for Matrix): a virtual host needs to be created, which for my setup of Apache is in conf/extra/httpd-vhosts.conf. I created a named virtual host for Matrix with the lines:

  NameVirtualHost site1.local
  NameVirtualHost 127.0.0.1
(apparently using localhost won't be sufficient).

The installation directories are a matter of personal preference, but as I could see some long directory paths, I've set the Matrix Home directory to C:/www/home/. This allows a fairly close following of the suggested config for UNIX, though file paths still need to be edited to cater for Windows directories. The relevant virtual container starts:

<VirtualHost 127.0.0.1>
 ServerName site1.local
 DocumentRoot "C:/www/home/websites/mysource_matrix/core/web"
 Options -Indexes FollowSymLinks

I also found that I needed to insert another alias into the Apache httpd.conf for asset_types: Alias /asset_types "C:/www/home/websites/mysource_matrix/data/public/asset_types" Alias / "C:/www/home/websites/mysource_matrix/core/web/index.php/"

I then followed a quick start guide and was able to complete the steps there.

Conclusion

Not a 5 minute install like WordPress, but it is fairly straightforward, at least in hindsight!

Thursday, March 25, 2010

Translating Thai with help from electronic tools

With the advent of various electronic tools translations from one language to another should be greatly facilitated, improved, and made faster. However, I’ve found the initial preparation is no trivial matter. Furthermore, as I hope to show, when it comes to attempting a reasonably reliable translation, you need to draw on your wits and whatever knowledge you’ve tucked away in the recesses of your memory – so having a good memory is a good start!

I’ll indicate some particular issues with respect to Thai, with a few comparisons with other languages. I make no claims about my general linguistic ability and with Thai I consider myself a novice both in speaking and writing, though I’m gradually acquiring more skills – without any language aids I would not be able to get very far at all! Even so, having heard my mother speak to me as a child, I have some sense of how Thai ‘sounds’ and its structure.

Assuming that an electronic document is available, like humans, automated assistants have to content with the following general problems:

  • There’s no punctuation in Thai – it means that there’s more effort required in parsing the text and, particularly chunking, working out where divisions lie between clauses and sentences. I’ve struggled with this and sometimes depend on the tools’ suggestions.
  • There are no tenses in Thai apart from a few designators (token words added in) – it’s not always obvious what mode of voice to use and if making an arbitrary choice, then consistency is needed across the text as a whole.
  • Phonetic transcriptions are helpful for aiding a quicker reading, but there’s no single standard – I think it’s partly because Thai is tonal, and Romanised phonetics either look clumsy or just omit the tones; it’s also partly because of the sound combinations, many of which could be transcribed in more than one way.

A Suite of Translation Tools

But let’s not be too pessimistic – as Benny the Irish polyglot would say, the language cup is half full! Having created an electronic document, perhaps via scanning, OCR, and manual corrections, it’s time to find the tools to help you read it!

When it comes to electronic assistants, the temptation is take the easiest route: locate one tool, preferably free and on the Web, and just use that. However, it’s essential to have at least a second opinion! The first electronic tool that I have used in earnest is Lingvosoft Talking Dictionary Thai to English, though the pronunciation even in the 2010 version is still only in English. :-( This is basically a large conventional dictionary with a simple interface – you type in your word letter by letter, and if you’re not sure of the ending, then it will list words that start with that combination. I originally bought the Windows CE version thinking that it would be handy to have with me on my travels in Thailand, but I’ve not really got used to inputting on a small screen.

I’ve found this the most useful tool amongst all those I’ve tried is Thai2English. There’s a version of the software is available on the Web site http://thai2english.com. I have purchased the full copy, though it should be noted that it only runs on Windows. You can see from exploring the Web site that it goes well beyond a simple dictionary and has quite an array of pedagogic building blocks that supports those who are learning Thai.

However, the first thing that can be done is to get a quick sense of what the text is about and it’s here that I’ve turned to the Web by uploading content into Google Translate. This free service, which has only been available since January 2009, provides a very convenient interface offering a number of ways to get content translated automatically – technically it’s called machine translation. You can enter text into the box, upload a document or enter the URL (Web address) of a page that you’d like translated. You specify what language to which you’d like it to be translated and then just press the [Translate] button. You can also bookmark combinations, e.g. Thai to English:
http://translate.google.com/?th&tl=en#
(For newcomers, you can get a flavour from a quick overview provided by Google, which covers a lot of ground in a little over a minute, but you can pause, rewind and replay to take it all in...)

Google Translate does set a limit of a few pages per go, so if you have more than a slender booklet, you’d need to repeat this process a number of times, but for most purposes I don’t think that’s going to be very troublesome.

TIP: When running MS Windows (XP), I notice that there’s much better support for Firefox than Internet Explorer, especially when copying from the browser Window into a Word Processor, even to MS Word, when I intuitively expect more information to be retained from IE.

An example

I’ll consider the title and opening paragraph from my mother’s article about her experience of the Hampshire Buddhist Society. The URL is: http://www.chezpaul.org.uk/fuengsin/dhamma/hants60s.htm.

Here is what Google currently makes of it (click on the image to see the full size version):

Google Translate's translation into Thai of a title and paragraph of English

Room for improvement, yes? I think it’s quite instructive of the challenges facing language learners, so let’s take a closer look at this paragraph.

You can do this using the text box entry form or alternatively, you can actually enter the above URL into Google and ask for English to be returned. Wherever it encounters what it thinks is Thai, Google has a go at translating, so it generally leaves the English untouched, though not completely(!) In this interface, moving my mouse pointer over the translated title reveals the original Thai, ส่วนหนึ่งของชาวพุทธในอังกฤษ:

Google rollover revealing source text in Thai

Here is the phonetic transcription provided by Thai2English:

Phonetic transcription of a sentence generated by Thai2English

Right at the start there’s a lot of scope for differing translations. Let’s compare what Google and I make of it. I’ll do this chunk by chunk:

Title:
ส่วนหนึ่งของชาวพุทธในอังกฤษ
Google’s English:
Part of the Buddhist in England.
Paul’s English:
Some Buddhists in England.

Comments:

  • With Thai, there is no written designation for plural – here Google has interpreted ชาวพุทธ (chaao put) as singular, but should it be in the plural?
  • It opens with a figure of speech ส่วนหนึ่ง (suan neung), a construct recognised by Thai2English:
    Thai2English parsing Thai, recognising a phrase
    Lingvosoft also lists it as a phrase:

    Lingvosoft definition of ส่วนหนึ่ง

However, it’s still grammatically correct to assume that the two words are distinct: ส่วน หนึ่ง. Then a whole host of meanings are possible for ส่วน, which could be one of a number of parts of speech. Lingvosoft indicates:

Adverb.
Apropos;
Conjunction.
As for, as to
Noun.
Fragment, denominator, form, lineament, member, part, portion, proportion, quota, region, section, segment, while, zone, bit, body
Preposition.
As of

Thus it could be translated: Concerning a Buddhist ... , i.e. about a [single] Buddhist’s experiences in the UK.

So I’ve had to weigh up these alternatives. How to home in on the right meaning? One approach I adopt is to shorten the phrase, which should draw on a larger statistical sample so that the translation is based on more occurrences. Thus I can try ส่วนหนึ่งของชาว (sùan nèung kŏng chaao). Google renders this as 'Part of the people.' This helps persuade me to settle on 'Some people' as the main sense. Yet even with some more pointers it’s still largely guesswork until I’ve had a native or fluent speaker to check it for me.

Having pondered enough over just the title, let’s move onto the first sentence(!)

Sentence 1

นับตั้งแต่ข้าพเจ้าออกจากบ้านเมืองมาอยู่ในประเทศอังกฤษเป็นเวลาเกือบ ๕ ปีไม่มีโอกาสไปวัดทำบุญตักบาตรและฟังพระธรรมเทศนา

Google:
Since I come from homes in the UK for nearly 5 years, no opportunity to measure merits, and put listening preaching.
PT:
Ever since I left my homeland to be in England nearly 5 years ago I have not had the opportunity to go to a temple to make merits, to put almsfood in a monk's bowl, or to listen to the Buddha's teachings.

Comments on Google’s effort:

  • The subject of the sentence almost gets lost at ไม่มี – literally ‘there wasn’t’, but in English it’s clearer to turn this into the first person
  • Google omits the translation of ไป วัด (go to the temple), yet it’s a very common activity.
  • There’s a lack of contextual awareness with “measure merits” – it just doesn’t make sense here!
  • Google translates ตักบาตร as just ‘put’, but it’s a construction, which Thai2English renders as “to put almsfood in a monk's bowl” and Lingvosoft offers: “give food offerings to a Buddhist monk.” Perhaps the latter is safer, but the former really conveys the Thai tradition!
  • The resulting sentence offered by Google is grammatically very poor. If you look at it, there’s a distinct absence of Buddhist-related vocabulary, which suggests a significant gap in the corpora (assuming it is using statistical methods).

Afterwards I made a few more stylistic changes such as changing ‘home’ to ‘homeland’ to emphasize the change in culture.

Sentence 2

ข้าพเจ้ายังมีความเลื่อมใสในพุทธธศาสนาอยู่เสมอ

Google:
I also have a sequin. Enter the Buddhist religious path always.
PT:
Yet I still have faith in the Buddha's teachings.

Comments:

  • Whereas Thai2English translates ความเลื่อมใส as a phrase meaning ‘faithfulness, believability, conviction’, Google errs in its chunking and decides to apply a full stop in the middle of a word, i.e. after ความเลื่อม which literally means ‘glossy things,’ hence ‘sequin’!
  • Google doesn’t retain a single voice – it jumps from first person indicative to imperative(?)
  • The phrase พุทธธศาสนา is just the Thai transcription from the Pali of Buddha Sasana, which just means ‘teachings of the Buddha’. Although ‘Buddhist religious path’ sounds okay, to use the word 'religious' arguably brings with it a lot of unnecessary cultural baggage.

Sentence 3 (first part)

ในยามว่างได้พยายามอ่านหนังสือเกี่ยวกับธรรมนั่งสมาธิวิปัสสนา

Google:
The guard was busy trying to read books about the fair. Insight meditation.
PT:
In my free time I am always trying to read books on Dhamma, sit and practise Vipassana meditation.

Comments

  • Google has split this into two sentences.
  • Google has not recognised that ยาม ว่าง is a phrase; Lingvosoft confirms that on it’s own ยาม means ‘gatekeeper, guardian, ...’, but Thai2English both defines it as ‘time; hour; period’ and groups this word with ว่าง (‘free, empty, vacant’)
  • Google renders ธรรม as ‘the fair’, but that’s completely out of context. Thai2English helpfully offers amongst others: ‘dharma’ or ‘[to be] natural, lawful, normal.
  • It has taken นั่ง สมาธิ วิปัสสนา as just the practice (noun) of insight meditation, rather than as a verb. I’ve emphasized the activity by a longer rendering.

Sentence 3 (second part)

และปฏิบัติธรรมเท่าที่สามารถจะทำได้ในใจนั้นเฝ้าแต่คิดว่าคงจะได้พบกับชาวพุทธเข้าสักวันหนึ่ง

Google:
and practice as they can do but keep in mind that think that would be found to be a Buddhist one day.
PT:
and practise the Dhamma to the best of my ability. I keep these in mind, thinking that I might yet some day get to meet with other Buddhists.

Comments:

  • I found this a difficult clause and am not really sure about the translation.
  • Google’s clause is all over the place
  • Google again fails to translate the key word of ธรรม

As you can see, at present Google’s rendering is very variable, not coherent, and doesn’t make much sense. It seems to chop up sentences and make clauses into short sentences, giving a staccato effect! I’m guessing that Thai is not one of its stronger languages.

Evaluation

I have found that the most helpful translation tool is Thai2English and I copy chunks of Thai there. It gives meanings and phonetic transcriptions word by word, together with help concerning Thai grammar. Occasionally it also fails to chunk correctly and sometimes lacks some vocabulary, but most of the time is does a good job so that where there are doubts or blank spaces, I have often found that there are typographical errors in the original text (or mistakes in the OCR/copy typing).

Google Translate is quick and useful for giving some features, but it’s not fit for translating anything substantial. I’ve found that close-reading is required, for which Thai2English, supplemented by another electronic dictionary – here Lingvosoft – is far more productive.

Whilst Google struggles to provide accurate translations, it does provide a very useful template structure for working on documents: it splits up translations into bite-sized segments of Thai followed by English. At the moment I don't pay too much attention to its translation, but retain it whilst I’m working since sometimes it does offer useful clues. I'm sure that it will improve quite rapidly as it's an important project for them.

At the end of the day the notice pinned onto the board would be: "All translations may be subject to change!"

Sunday, March 21, 2010

Translating Thai: Some Experiences of Digitisation

Is it possible to produce a reasonably accurate translation from Thai into English with only a basic knowledge of the language and the aid of electronic tools? I’m not going to make great claims as my experiences are from home-grown experimentation over a few months. However, having recently completed a few translations, I think there are promising signs. At least I’m quite satisfied with a translation of my mother's article concerning Buddhism in Hampshire in the '60s, which runs to about 2,000 words. So there may be some pointers that others find helpful.

Setting this post in the context of biographical research, I’ll first describe some broad considerations and then discuss digitisation (scanning and optical character recognition). One tip I’d offer is that there needs to be attention to detail – rough and ready methods won’t yield very much that's of value. Certainly, there’s been more involved than I anticipated!

I’ll start with a list of very basic questions - as much for my own benefit as anyone else’s :-)

  • What are you trying to learn? Why is it significant? Even when carrying out research entirely in one’s native language, time often forces choices with regard to the materials that you examine closely. If they are in a foreign language, then that imposes further constraints.
  • Is there anyone who can help? It may be that you can effectively form a team.
  • Of the materials available, which ones are going to shed most light in key areas?
  • Among these materials, which ones are amenable to analysis? Are they easy to access physically? Are they printed or hand-written?

All these points apply to any language, but then each language has further characteristics that can make the situation more or less difficult.

With regard to Thai, its alphabet (44 consonants and 28 vowel forms) is much more elaborate, particularly with the use of diacritics. Even Thais will tell you that looking up words in a dictionary can be quite a chore. Yet, if the letters are clearly formed then actually reading it is not so hard because it’s generally phonetic. As someone with a limited vocabulary, needing to look up many words, I soon decided that it’d be much more convenient to have an accurate transcription in electronic form so that I can use software-based dictionaries.

A note on reading handwriting

So what about Thai handwriting?! In the Thai education system, primary school children learn to write by copying individual Thai printed letters – I’ve seen one of my cousins do this repeatedly when she was 5 years old. When they leave primary school they then learn cursive script and that stage can mark a huge departure. It’s a similar approach as I learnt for English, but I don’t know whether children develop their own style or are guided to adopt one of a number of standard styles. I’ve shown sets of photos to relatives and friends with Thai writing on the back – quite often there is a struggle to read what’s written, so it appears to be no easier than English. It’s a daunting prospect, but assuming that the writing is consistent, then it becomes a question of recognising patterns and perhaps understanding its topology will help. So for a given author, it may suffice for someone to translate a sample for me and I can try to figure out the rest.

Anyway, at the moment I can’t read much beyond the printed word, which means I have to ask others to copy type what I can’t read. For general documents concerning work that’s quite feasible, at least for someone in Europe the costs of getting this done in Thailand are affordable. However, a biography containing personal items (which are often of greater interest) requires more care – until their contents are known they should be read only by people you can trust.

So in the remainder of what I share here I’ll confine my attention to printed documents as I indicate a methodology I’m adopting for their translation.

Copy type or scan for OCR?

Technology-assisted translations often start with flatbed scanners that can convert the physical page into an image that then gets ‘read’ using optical character recognition software (OCR). In theory, since the printed word generates letters uniformly, software can accurately interpret them. In practice, results are imperfect for most kinds of sources and can take longer than expected. It may be better simply to copy type.

So when should OCR be used? Whatever language you are trying to read, the utility depends upon the nature and condition of the original document – if it is a fragile pocket volume with hundreds of faded pages with tiny letters in an obscure font, then even if you manage to safely scan the page, you may find OCR yields very poor results.

However, this kind of discussion assumes that there actually is some decent software for any language, when in fact for languages that don’t use Roman script, the situation seems to be very varied...

Available OCR options for Thai (very few!)

For Thai the available options have been very few. On asking a few Thai friends, I drew only blanks and when I carried out a quick investigation it seemed that until only a few years ago, the options were not far out of the university laboratory and didn’t look very amenable. An example is NEC-0006 อ่านไทย เวอร์ชัน 2.5 (OCR), which is inexpensive, but it doesn’t get very good experience reports from a Thai OCR discussion thread..

The larger well-established commercial products such as Omnipage and Abbyy seemed for a long time to have ignored Thai until a couple of years ago when additional language support for Abbyy FineReader Pro was introduced for Thai in version 9. Trusting the claims of accuracy I took the plunge and bought a copy - quite an investment, even with an educational discount.

I’m glad I did as the results are generally good, although its accuracy is inferior to that for languages based on Roman script. For someone like me who types Thai very slowly it is a useful start, but unless the lettering in the documents is very clear so that the accuracy is close to 100%, its utility will fall away for anyone who can type reasonably quickly and accurately.

(In case you are wondering, there have been efforts to recognise handwriting, but it’s a much harder task – I was interested to note, though, that a fairly recent paper, Maximization of Mutual Information for Offline Thai Handwriting Recognition, in IEEE Transactions on Pattern Analysis and Machine Intelligence, makes use of a toolit that is primarily used for speech recognition research. It prompts the question of the relationship of Thai speech to writing. From my very rudimentary knowledge of Thai linguistics I gather that it has roots in Sanskrit, where the letters of the alphabet are placed according to where in the throat/mouth/lips they are formed. Thai reflects this ordering quite substantially, though not completely.)

Undertaking the OCR.

I think getting the best results is an art and worth persevering to make improvements. For all but a few cases with one or two small documents, the whole scanning workflow ought to be considered as a successful process requires a good rhythm. Washington State Library has a useful checklist and there are some good tips on the OCR process provided by About.com. These cover physical aspects including the selection of the scanner itself, keeping it clean, the placement of the source document, the scan settings (resolution, colour contrast, expected language(s)), and how the scanned image is divided up for the actual process of scanning.

One particular aspect that many software packages provide is training. For text recognition this is basically the process of chopping up the scanned image into a sequence of glyphs (character elements) and assigning glyphs to character names – see e.g. Wikipedia for a detailed entry. As you feed in multiple samples and specify the assignments, it learns how particular characters should be interpreted. There’s a training tutorial for a software library called Gamera, which I found very helpful in explaining the concepts.

I’ve not yet used training, probably because I’ve been a bit lazy to make the effort to learn how to make it learn!

Finereader’s Thai OCR Performance and Correcting the OCR.

Here’s a sample of FineReader's output.

Thai OCR in Abbyy FineReader Pro 9

As you can see, it’s a long way from perfection! Here it obviously doesn’t handle the English. I actually set it to interpret everything as Thai – although I could have included English as an additional language, it seems to have a net effect of adversely affecting the Thai rendering, so since English is easy for me to recognise and type, I prefer to let it get that part wrong.

A Thai person might well be dissatisfied with the results, but overall I was quite happy given my very slow Thai typing speed. There were one or two characters that FineReader seemed to really struggle with, but correction was not difficult as the suggested match was often a character used here and not elsewhere – so I could do a ‘search and replace.’ More challenging was the handling of the small diacritical marks – in Thai they are all glyphs since they each contribute towards meaning, either as vowel sounds or tone marks. Instances where there are two such marks on a single letter are common and FineReader often struggled to pick out -่ ไม้เอก (mai ek) – it looks like a hyphen, but its placement varies a lot. If you look at the screenshot carefully, you can see that FineReader simply omits quite a few of these, perhaps because the original source document was not clear enough.

Even if you train an OCR package, there will still be imperfections, so the output needs to be corrected. This process is tedious, but helpful – not least in learning to read! It helps you to familiarise yourself with the alphabet and especially pay attention to the way letters are formed.

If you have a large screen, particularly with widescreen dimensions, then it’s probably easiest to use the scanned image, set the zoom as needed, and place it next to the OCR’ed version that you’re editing.

Conclusion

Although a quick and perfect system is far away, for printed texts a few OCR options are emerging that I find helpful in digitising printed Thai texts. Alternative suggestions are very welcome – I’m keen to improve what I’m doing, even though it’s already been quite an effort and I haven’t yet started talking about the translation itself...!

Wednesday, December 30, 2009

Recalling Memories through Pictures (using multimedia tools)

The processes of contact, feelings, perception and memory are closely interlinked. They are mediated through our senses and for most people the sense that usually predominates is sight. So in trying to put together the early life of my mother, the late Fuengsin Trafford, it's been helpful to carry out interviews based on sets of photographs. I haven't done much planning really, but rather have made things up as I've gone along, working intuitively; it's only now I can see more of the methodology that I've actually followed! I'll report here on that methodology and also on some of the technical tools that I've used to assist me.

My mother left hundreds of photos, which I've tried to arrange in sets according to distinct periods: early childhood, University days, her first years of teaching and so on. I created an index for each set and have pencilled in an incrementing number on the back of each photo, so that they are uniquely identified and there's some order to them, though (as I later would frequently find out) it's not chronological! I then scanned in the photos at a fairly high resolution (on an HP Scanjet 5370C, quite old now) and saved the files using the index as part of the file name. Having done this for a fair proportion of the collection, I've put copies in many places - on laptop hard drives, an external backup disk and memory sticks.

However, merely creating an archive without any descriptions is not much use! For some while I had intended to ask relatives and friends of my mother to enlighten me as to the context and details concerning the photos. I was finally able to set off for my mini fieldwork earlier this month (December), with a copy of the photos on my netbook, an Eee PC. When I met the 'interviewees' in Thailand I recorded the conversations using a digital voice recorder, saving copies of the recordings as files on the netbook.

It was the first time I had properly used such a recording device and my experience of conducting interviews was minimal (though I once did an interview with a Big Issue seller as part of a one day digital video course). So earlier this year I explored the world of digital audio recorders (a process that's familiar for me as I've purchased quite a lot of electronic devices :-) I settled on an Olympus WS-110, which is a compact device, somewhat smaller and lighter than e.g. a Nokia 8210 mobile phone. I chose it based on reviews of its audio quality - good microphone and high quality sampling (see e.g. reviews on Amazon); file format wasn't a concern for me. These devices are evolving rapidly and already Olympus lists this as an archived product, which means you should be able to find it new at a very good price on ebay (which is where I purchased it). Operating the device was very simple.

Then the netbook would serve as a digital lightbox and a basic means of navigation - for a given photo set all the photos would be the same folder and I'd run a slideshow using the wonderful Irfanview! The major handicap with the netbook is the relatively small screen - in many cases I needed to zoom in (my audio recording has a lot of tapping sounds!) When I was in conversation, I'd start with a preamble about what I was intending to do and asked for permission (it's worth confirming this afterwards as well). Although sometimes you know that everyone is happy, it's a good habit to get into in case I go on to do academic fieldwork, which is something I am deliberating. My main role felt like being a catalyst, with some general encouragement and a few questions sprinkled here and there, to elicit a few more details. There's no doubt a large swathe of literature on conducting such interviews, but I didn't read any.

On my return to the UK it was time to transcribe what had been said. To facilitate this, I wanted to associate the audio with the respective pictures (a tradeoff of using a separate recording device rather than doing the recording directly on the netbook). The intended result would be a video consisting of the photos that I had shown with each photo accompanied by the respective audio commentary, i.e. the comments from friends and relatives.

The solution I adopted was to use a video editing tool, Windows Movie Maker (WMM for short), which comes part of the Windows operating system. I guess it is similar in functionality, if not in elegance, with Apple's iMovie. My familiarity with WMM is very limited, so it's probably best if I summarise. The basic idea is to create one WMM file for each interview (WMM only provides a single audio track) so that in any given interview when playing back you know what was said about a particular picture. Here's a screenshot:

Windows Movie Maker screenshot showing a composition of photos synchronised with an audio track

There are basically three areas: top left is the collection of files that I used to create the composition - this is where you import the photos and the audio and in this case I could import audio straightaway without conversion as it was in WMA format. Top right is the playback for the composition as a whole. However, the work is carried out below in the storyboard/timeline, which consists of parallel tracks. All I used was the Video and Audio tracks, dragging and dropping photos from the collection area, moving them about until there was approximate synchronisation.

However, in writing a biography I need words as well as pictures! The next step in the process is thus transcription. The method I'm using here is to create a large table with the first column containing the photos, one photo per row. Each of the other columns are to record the transcription from a particular interview. With reference to the WMM files I'm transcribing what was said about a particular photo in the corresponding cell of the table. Again I'm not being particularly sophisticated about the implementation - it's one mammoth table in a MS Word document. As long as it works, it is okay. For a formal research project I expect this would be better implemented in a database.

Handwriting bonus!

There have been some nice extras in undertaking this exercise. My mother has penned in Thai many documents, including a diary over several years. It's one thing to learn how to read the printed word, but a further step to decipher Thai handwriting! With these compositions I have some samples here that have been read out (and with the aid of a dictionary I can slowly spell them out myself). To be systematic, for each letter I can build up a set of samples that I can use later on.

For a few hours of recording, there are many more in organising and interpreting, but I find it fun to do and along the way I learn a little more about Thai history generally. For anyone contemplating learning more about their own family history, I'd recommend this as a stimulating and informative exercise.

Acknowledgements

I mustn't forget to thank everyone who has kindly provided information in the December interviews, including: Pah Vasana, Khun Jamras, Pah Umpai, P' Laem, P' Darunee & her mother, Khun Chaiwat, P' Yui, P' Ead, Na Tewee, Na Tun, and Pah Jah. If I could contact all those my mother knew well, this list would be very long ...

Monday, October 05, 2009

Social SVG?

A few years ago I was pondering whether SVG could allow more than text-oriented approaches to blogging.

I'm thinking about it again because:

  • more mobiles have touch screen devices encourage doodling
  • updated standard - SVG Tiny 1.2
  • better browser support for displaying and more recently editing SVG
  • Google Wave (and similar initatives) are presenting a more flexible messaging paradigm

SVG has been around a long time now, but in day-to-day online content-creation it remains rather hidden: whether sending an e-mail or contributing to social networking sites, it's generally text, photos and videos that are created and circulated, with other activities bolted on via apps.

And yet there's already software that makes it easy to draw, to doodle, and not consume lots of computing resources (disk space, processing power etc). Berners-Lee conceived a read/write Web, with his Amaya Web editor/browser having a toggle button between browse and edit. Now the latest version has a very nice SVG editor built-in. And gradually momentum has been building for mobile initiatives built on SVG, generally based on open standards, leading to solutions such as Ikivo.

It seems to me that the time is ripe for all kinds of SVG-based communications. With its graphical nature the replies could be more about editing the images you've been sent - so when you receive an SVG message, you can edit it and send it back. A simple example would be a game of Os and Xs, but it can apply to any scenario where people are sketching a design. It becomes even more attractive with multi-touch. For implementation purposes I guess you could have some form of version control both to make it more efficient and to support animations.

So basically this is aiming at a drawing equivalent/extension of SMS, blogs, twitter etc.

Google Wave is obviously developing messaging a great deal and no doubt can demonstrate its potential; already there are efforts to incorporate SVG as a gadget, such as Vidor Hokstad's Google Wave Gadget API Emulator. It reminds me of some promising CSCW research into shared authoring widgets/X Windows toolkits that I saw being carried out at Kingston University in the early to mid 90s by Maria Winnett, a former research colleague (can I say 'colleague'? We were actually we were a diverse group of PhD students sharing a research lab in the Sopwith Building). And it looks like there's been renewed interest that involves the mobile scenario.

However, SVG editing could be as ubiquitous as e-mail so should not be dependent on Google or any other single provider for a transport.

There must be a simpler more universal solution (perhaps there already is ...)

Sunday, April 29, 2007

Earth Day viewed on a mobile phone

Last Sunday, 22 April, all around the world there were gatherings to commemorate and reflect on our custodianship of this world - Earth Day 2007.

At Wat Phra Dhammakaya in Thailand, Earth Day is a special occasion at the temple as it also is the occasion of the birthday of the Abbot, the Most Ven. Phra Rajabhavanavisudh (Ven. Dhammajayo bhikkhu). The focus on such occasions is always on inner cultivation, through dana (generosity), sila (virtue) and samadhi (meditation), which can help us to have a sense of proper perspective, and thus how best to help. This is explained in the temple's official programme (sorry the English is 'Thai' English, but hope the message is clear enough). I have attended Earth Day in person at the temple in the past, but on this occasion I just joined through the Webcast (which, as usual, started at 9.30am Thai time) broadcast by DMC, the Dhammakaya Media Channel. The more usual alternative is receive transmission via satellite, dish and receiver, but I don't have a TV (or PC Tuner card)!

This year Ven. Dhammajayo used the occasion of Earth Day to invite monks from 20,000 temples to gather in solidarity for the troubled Southern provinces in Thailand. Through the tsunami and other troubles he has often initiated and supported various efforts to try and ease the difficult situations there. I've frequently heard or watched the Abbot refer to the South - he is evidently very concerned about the welfare of that region, especially concerning his fellow monks.

So on Earth Day, Ven. Dhammajayo expressed particular appreciation to representatives from more than 200 temples in the affected area travelled to Pathum Thani province to join the ceremonies, as for many such a journey carried considerable risk. The Abbot devised ceremonies whereby the kinds of offerings that people made to the monks, such as medicines, were such that they could safeguard their welfare and thence the welfare of the whole community that supports the Sangha, because even today the Sangha and lay communities work together like an ecosystem.

Although most participants were from Thailand, some came from other countries. For major occasions like this, invitations are extended around the world, and there are some links especially with other monastic orders. For example, Wat Phra Dhammakaya has a sister temple relationship with Fo Kuang Shan in Taiwan.

Also on Earth Day there was the ceremony to cast Buddha images to go inside the Maha Dhammakaya Cetiya. Sponsoring a Buddha image is a beneficial thing to, not least because when you come to sit in meditation, you can start by recollecting your good deed and so make your mind start in a happier state :-).

DMC has been broadcasting over the Internet for a while and I had previously only ever accessed the webcasts using a desktop or laptop computer. However, after the ceremony was over I tried to see if I could view anything using a mobile phone, HTC P3600, that I upgraded quite recently . I accessed the standard Web page and navigated to the video streaming page (click on the little red banner with a satellite dish) and to my surprise I was indeed able to watch video, embedded in the browser. The two screenshots above are from the DMC site.

The images are somewhat smaller than what might be expected on, say, a small laptop, though there may be a way of increasing the display size in the Web site. I found I could get a slightly better view by rotating the display, but the width of the video seemed to remain the same. In due course, I intend to take a closer look at using Internet video on small computers, but for now it's wonderful that it actually works!

Sunday, January 14, 2007

ZX Spectrum and Scrabble Nostalgia

I've been prompted to delve back into my teenage years by a surprising article in the February issue of PC Pro which urges us to 'Forget 3D games, says Dick Pountain, it's Scrabble that PCs really need to get to grips with.'

From an early age I had a penchant for words and numbers, their calculation and manipulation. This manifested in several ways during my time at secondary school: I became keen on Scrabble and I started teaching myself to program computers.

Like many of my generation, I owe my first steps to Sir Clive Sinclair: initially, I started with a Sinclair ZX81, which my parents kindly bought for me for Christmas. With 1K of RAM, I was limited in what I could develop, though I was able to validate UK VAT registration numbers! Nevertheless, it was enough to introduce me to a new world and I could write my first programs - in Sinclair's implementation of BASIC. Within a year, there was another breakthrough with the ZX Spectrum and soon after I persuaded my parents again to invest in this new toy that boasted 16K RAM, 16 colours, sound and a wider range of software titles.

I then made a concerted effort to produce a Scrabble program, where the computer could act as one of the players. Whereas previously I had been content to write everything in BASIC, in this instance, I learnt sufficient Z80 machine code to be able to convert the main 'thinking' algorithms. Result: the computer responded in a few seconds rather than a couple of minutes!

It can be quite a work of art to cram as much as you can into the 16K. The computer had a 500 word vocabulary and included an algorithm for ensuring it played the highest scoring move, but generally it wasn't a strong opponent. In actuality, 7K was devoted to the screen display, which can be set as part of the procedure to load the program. Normally, that leaves you with just 9K, but I allotted this space to display a detailed logo in one section and a simple blocky 'Welcome' title in another, which 'hid' about 2K of instructions, which could be read as a scrolling message along the bottom of the screen. Once the game had started, though, the who screen display was refreshed, so these instructions could not be revisited unless the game was reloaded.

Having played Scrabble competitively, I wanted to see the development of a version that was much more competitive. After a while, there was a highly polished product, Psion Scrabble. I wrote to them in the beginning of '86 and described tactics that could enhance the software. Three months later I received a kind response thanking me for the ideas and wishing me well in my 'A' levels, but the overall message was that Psion was going to concentrate on the development of its hardware products. (Perhaps I should have bought some shares?!)

I stopped development of the code around that time, but retained some interest in how tactics could be encoded. About 10 years later, I happened to come across a journal article by Steven Gordon concerning Scrabble algorithms. I corresponded a little by email and learnt that he had implemented a number of similar ideas, but I think far more systematically! So he's probably a good contact for Mr. Pountain.

Over 20 years later...

I've come across an old cassette tape with a copy of the program on it and having invested in an external sound card to digitise these tapes , I decided to undertake a conversion. After some fiddling, I worked out the right settings and the process works fine - the Creative Player is able to sample at the right frequency and bits and I used MakeTZX to convert into a tape archive format.

If you're curious in seeing the program (and don't have high expectations!) you are welcome to download a copy, available as a zip package. When you've unzipped the package, you will see a number of files, with a readme and instructions. Look through those and then launch the .tzx file in an emulator — I've found emuZwin works very well on a PC.

Sunday, July 16, 2006

RAMBLE Project blog - hiatus and archival

This post concerns a work-related blog I have been maintaining, which disappeared off the radar for a couple of weeks or so. This is to explain what has happened.

From Autumn 2004 until Spring 2005 I managed a small externally-funded project in mobile learning called RAMBLE, which concerned blogging on PDAs and other handheld devices and linking them with institutional learning environments. A readable overview was published in an online journal called Ariadne.

As part of the process, I maintained a project blog and the budget included all the hosting needs, but once the project had finished - as so often happens - the blog could only be maintained on good will and very mimimal resources. Even so, the blog server software, Pebble weblog, impressed several colleagues and even the Director hosted his blog there... But alas we were hit by spam, which escalated in magnitude, and it was decided to remove the service and I don't think it will come back online :-(

For a while none of the blogs were available at all, but I've found a way of creating an archive that, all being well, preserves the orginal addresses of the posts, i.e. the permalinks. Pebble stores everything to do with each blog in flat files, so I simply copied the files across to a fresh local installation of Pebble and ran a spidering tool (wget) to grab a static snapshot, and then the sys admin could copy these files to the server. As I type there's a wget-generated archive available at the moment, but it's not yet complete and retains options for posting comments etc.

Another blog, pault@LTG, has suffered the same problems, and I need to find a replacement; I'm thinking of setting up on Educause as I'm registered member, due to attend the 2006 conference in Dallas in October.

Sunday, April 30, 2006

A Research Genealogy Project?

The Mathematics Genealogy project provides a field to categorise dissertations according to the Math Subject Class. Seeing how the selection is very broad, e.g. covering computer science, I was prompted to wonder what about genealogy projects for other subjects? There appear to be a few ideas and initiatives, including Thomas Witten's proposal for a Physics PhD Genealogy project, the High Energy Physics directory, the Software Engineering Academic Genealogy, the Theoretical Computer Science Genealogy and the Notre Dame University academic genealogy, that covers current members of its departments of Chemistry & Biochemistry and Physics.

It's a very fragmented picture, with independently developed systems, very partial coverage of researchers and yet already some duplication. It will become even more so as subject disciplines keep growing...

So it makes sense to me to take a fundamentally more integrated view that incorporates research in any field, one that can also have a richer model, taking into account different kinds of research qualifications, not just PhDs; and different kinds of relationships, not just formal supervisor-student; thereby responding to issues raised in the Mathematics PhD in the United Kingdom.

The findings yielded on this broader base will be fascinating, showing among other things how disciplines evolve over the generations, shedding light on questions such as: What happened to descendants of those who studied classics? What did the ancestors of computer scientists research? Many trends can be observed. There's a lot of talk in the UK about lifelong learning, so how about considering lifelong and generational research?

Another aspect that needs attention is the quality of entries. It's a tall order for just one central team responsible for verifying information received and compiling the database, which is the current arrangement at the Mathematics Genealogy Project. It would be better to distribute the workload and make use wherever possible of local expert knowledge, suitably authorised to update data in the areas with which they are familiar, whilst allowing for as wide public participation as possible.

So what's the solution?

I'm quite sure that the biggest consideration is organisational, not technical. It's probably a workflow problem and perhaps can be addressed by appealing to other international networks, most likely business networks. The quality control needs to rest with academic departments and it seems sensible that they should deal with information relating first to their department, then their institution and then neighbouring institutions. So I envisage an international network of genealogy research nodes where public contributions would be submitted though their nearest research node rather like, "contact your nearest reseller."

A few days ago I attended a presentation by someone who has done work for the World Wide Web consortium and he re-iterated the point that if there's one technical issue affecting software above all others it's scalability. So any proposal probably ought to design and develop a system that distributes the processing (cpu and resources) as well as the administration, though the computing power need not be distibuted per site (big companies typically use a few data centres containing large numbers of rack-mounted PCs). This suggests an application for a parallel computing grid.

I don't know what the implementation itself should look like: it could well be underpinned by a relational database or might even be a special kind of wiki (thinking about how that can really grow rapidly). However, the data model should certainly be given careful consideration. How to deploy it on the Internet? How to authenticate and authorise? Lots of questions will pop up if one investigates further!

Mathematics Genealogy: Indexing

Exploring some of the entries in the Mathematics Genealogy project has led me learn about some interesting and unexpected connections, but it's also highlighted quite a number of limitations regarding accuracy and promptness of updates. Saying this is really just an indication that if you offer something good, then people will be looking for more!

One particular issue is that the total number of descendants requires a separate process to run as explained by the FAQ, which says:

Because of the time required to run the descendant counting program, it is only run once per week (early morning US Central Time on Sundays), while our data is updated nightly.

That surprises me somewhat as with around 100,000 people with not very many details stored per user and few relations, it's not a big complex database. The issue here is probably that it's a relational database and the advisor-student relationship is hierarchical, somewhat like a tree structure. However, it's not a tree because of having multiple parents (multiple advisors), but rather a directed graph, where the nodes represent the mathematicians and the edges correspond to the advisory relationship. [I'm taking definitions from MathWorld, an encyclopaedia that provides clear and nicely formatted explanations with diagrams]. Further, I think there is a fair chance that it would be more general than a simple directed graph from the scenario of the same supervisor supervising a candidate in more than one thesis - although it might sound unlikely today, it is quite plausible a few centuries ago, when a researcher could be at the forefront in a number of fields. I'd also expect it to be an oriented graph, in that supervision is expected to go in one direction, but it's not inconceivable that a student produces a thesis separately in two fields under two supervisors and then shares the knowledge back across.

Returning to the problem of counting, hierarchical relationships are easy to model in a relational database, but retrieving even summary counts may mean a lot of spidering through the hierarchy, which can be very slow. The key consideration is how to index the database. I'm not a database expert, but have seen this issue in the daily work I undertake as an administrator of WebLearn an e-learning system based on software called bodington, which is essentially a web database application. The system contains various resources, arranged hierarchically, in trees, so more specific than the genealogy case.

Jon Maber, the original developer, had started work on Bodington in the mid 90s and had thought about the issue of efficent queries about resources within a given branch; he reviewed approaches to indexing and decided to adopt the tree visitation model devised by Joe Coelko. Celko had given consideration to this graph theory problem and came up with SQL for Smarties: A Look at SQL Trees, an article that appeared in DBMS, March 1996 . Basically, each node or vertex has two indices - left and right - that are numbered according to a complete tour of all the nodes, visiting each twice. It means that selecting the number of descendants of a resource a simple SQL statement that subtracts one index from another at the given node. However, there is a trade-off in that every time you update the database you need to update the index, so if lots of changes are being made it can be a major performance issue.

Celko's solution may not be appropriate in this case, but it looks like the of approach that may lead to a suitable index that will allow real-time queries of how many descendants. The article was published more than 10 years ago, so I expect research has progressed a fair bit since then.

Saturday, April 22, 2006

Structure and Flow - an example in XML editing

Carrying on with the balance of structure and flow, although I left formal methods research a long time ago, I still come across it as a recurring theme in IT work.

At my present workplace, as a software developer I found myself with the task of extending a web-based system to allow anyone to use the web to edit some data encoded in XML. XML (short for 'eXtensible Markup Language) is a hot topic that promises the recording of meaningful information, its long term preservation and wonderful exchange and interoperability among software systems (e.g., because it's stored in a text file, so you can read an XML file in Notepad). An XML file is a data file, basically a hierarchical structure of tags and content. It's structure and data in one.

So where's the flow? That comes in the editing, because to edit the documents I devised a system that used a functional programming language called XSLT [well, it looks like it should be functional, though proof "by example" doesn't look like proof!]. Every change to an XML document is carried out in terms of XSLT, i.e. suppose we have XMLDOC1, then apply an XSLT stylesheet xslt1 to get XMLDOC2; and then apply xslt2 to get to XMLDOC2 and so on. In practice, each stylesheet defines a slight change in the document, with all else remaining the same. The operative verbs are simply: add, delete, and update. Perhaps you could use the word 'perturbation' for this?

As it happens, an XSLT stylesheet is actually an XML document, so again it has all those nice qualities described above, which means that using this system not only allows you to have a convenient text-based history of the documents, but also of the transformations and I can not only share data, but the transformations needed to carry out changes. There's a walkthrough illustrating what I mean through a number of screenshots.

Now there's an irony in using XSLT for change because as instances are XML documents it means that the transformations are themselves data and structure. So in one sense we have a sequence of data structures - where data and the way it changes is in the same format. But what I haven't addressed is how you actually generate and carry out the transformations. This requires a processor!

I've grown to appreciate this system for it gives me a sense of holism - a stream of documents and transformations in one flow. It certainly intrigued one of my colleagues, for whom XML and XSLT are very much his bread and butter!