Sunday, March 28, 2010

Invitation to a Middle Way retreat in Surrey

The UK has just welcomed the prospect of summer as it enters 'British Summer Time' - the clocks have just been set forward one hour ('spring forward ... fall behind' is how I remember it), which means that the evenings are extended bringing a greater opportunity to enjoy the natural environment after work :-)

This summer there's a special opportunity to learn meditation over a long weekend - a 4 day retreat (2-5 July 2010) organised by the Middle Way team, which has for several years successfully run meditation retreats for Westerners in Northern Thailand. I've been helping with its organisation: the location is the Ladywell Retreat centre, which was recommended by a friend, and I think it will be an excellent venue.

It's aimed primarily at those who have some experience of meditation, particularly in the dhammakaya tradition, but I think it's open to anyone who is in reasonably good health and keen to learn. For those who have continued practising since attending sessions organised by one of our temples, it will be a great opportunity to intensify the practice under the guidance of experienced monks. Even if you haven't been able to continue meditating or not as much as you would have liked, then this will be an excellent way of getting back into the practice, purifying and calming the mind, finding inner peace and giving you a firm basis for further spiritual development.

Interested? You can find out some details in an accompanying leaflet and obtain instructions on how to join from Wat Phra Dhammakaya (London) (Not many Thais have heard yet of Woking!)

UPDATE: Further details, including photos from the retreat centre, have been posted in the Wat's blog.

Thursday, March 25, 2010

Translating Thai with help from electronic tools

With the advent of various electronic tools translations from one language to another should be greatly facilitated, improved, and made faster. However, I’ve found the initial preparation is no trivial matter. Furthermore, as I hope to show, when it comes to attempting a reasonably reliable translation, you need to draw on your wits and whatever knowledge you’ve tucked away in the recesses of your memory – so having a good memory is a good start!

I’ll indicate some particular issues with respect to Thai, with a few comparisons with other languages. I make no claims about my general linguistic ability and with Thai I consider myself a novice both in speaking and writing, though I’m gradually acquiring more skills – without any language aids I would not be able to get very far at all! Even so, having heard my mother speak to me as a child, I have some sense of how Thai ‘sounds’ and its structure.

Assuming that an electronic document is available, like humans, automated assistants have to content with the following general problems:

  • There’s no punctuation in Thai – it means that there’s more effort required in parsing the text and, particularly chunking, working out where divisions lie between clauses and sentences. I’ve struggled with this and sometimes depend on the tools’ suggestions.
  • There are no tenses in Thai apart from a few designators (token words added in) – it’s not always obvious what mode of voice to use and if making an arbitrary choice, then consistency is needed across the text as a whole.
  • Phonetic transcriptions are helpful for aiding a quicker reading, but there’s no single standard – I think it’s partly because Thai is tonal, and Romanised phonetics either look clumsy or just omit the tones; it’s also partly because of the sound combinations, many of which could be transcribed in more than one way.

A Suite of Translation Tools

But let’s not be too pessimistic – as Benny the Irish polyglot would say, the language cup is half full! Having created an electronic document, perhaps via scanning, OCR, and manual corrections, it’s time to find the tools to help you read it!

When it comes to electronic assistants, the temptation is take the easiest route: locate one tool, preferably free and on the Web, and just use that. However, it’s essential to have at least a second opinion! The first electronic tool that I have used in earnest is Lingvosoft Talking Dictionary Thai to English, though the pronunciation even in the 2010 version is still only in English. :-( This is basically a large conventional dictionary with a simple interface – you type in your word letter by letter, and if you’re not sure of the ending, then it will list words that start with that combination. I originally bought the Windows CE version thinking that it would be handy to have with me on my travels in Thailand, but I’ve not really got used to inputting on a small screen.

I’ve found this the most useful tool amongst all those I’ve tried is Thai2English. There’s a version of the software is available on the Web site http://thai2english.com. I have purchased the full copy, though it should be noted that it only runs on Windows. You can see from exploring the Web site that it goes well beyond a simple dictionary and has quite an array of pedagogic building blocks that supports those who are learning Thai.

However, the first thing that can be done is to get a quick sense of what the text is about and it’s here that I’ve turned to the Web by uploading content into Google Translate. This free service, which has only been available since January 2009, provides a very convenient interface offering a number of ways to get content translated automatically – technically it’s called machine translation. You can enter text into the box, upload a document or enter the URL (Web address) of a page that you’d like translated. You specify what language to which you’d like it to be translated and then just press the [Translate] button. You can also bookmark combinations, e.g. Thai to English:
http://translate.google.com/?th&tl=en#
(For newcomers, you can get a flavour from a quick overview provided by Google, which covers a lot of ground in a little over a minute, but you can pause, rewind and replay to take it all in...)

Google Translate does set a limit of a few pages per go, so if you have more than a slender booklet, you’d need to repeat this process a number of times, but for most purposes I don’t think that’s going to be very troublesome.

TIP: When running MS Windows (XP), I notice that there’s much better support for Firefox than Internet Explorer, especially when copying from the browser Window into a Word Processor, even to MS Word, when I intuitively expect more information to be retained from IE.

An example

I’ll consider the title and opening paragraph from my mother’s article about her experience of the Hampshire Buddhist Society. The URL is: http://www.chezpaul.org.uk/fuengsin/dhamma/hants60s.htm.

Here is what Google currently makes of it (click on the image to see the full size version):

Google Translate's translation into Thai of a title and paragraph of English

Room for improvement, yes? I think it’s quite instructive of the challenges facing language learners, so let’s take a closer look at this paragraph.

You can do this using the text box entry form or alternatively, you can actually enter the above URL into Google and ask for English to be returned. Wherever it encounters what it thinks is Thai, Google has a go at translating, so it generally leaves the English untouched, though not completely(!) In this interface, moving my mouse pointer over the translated title reveals the original Thai, ส่วนหนึ่งของชาวพุทธในอังกฤษ:

Google rollover revealing source text in Thai

Here is the phonetic transcription provided by Thai2English:

Phonetic transcription of a sentence generated by Thai2English

Right at the start there’s a lot of scope for differing translations. Let’s compare what Google and I make of it. I’ll do this chunk by chunk:

Title:
ส่วนหนึ่งของชาวพุทธในอังกฤษ
Google’s English:
Part of the Buddhist in England.
Paul’s English:
Some Buddhists in England.

Comments:

  • With Thai, there is no written designation for plural – here Google has interpreted ชาวพุทธ (chaao put) as singular, but should it be in the plural?
  • It opens with a figure of speech ส่วนหนึ่ง (suan neung), a construct recognised by Thai2English:
    Thai2English parsing Thai, recognising a phrase
    Lingvosoft also lists it as a phrase:

    Lingvosoft definition of ส่วนหนึ่ง

However, it’s still grammatically correct to assume that the two words are distinct: ส่วน หนึ่ง. Then a whole host of meanings are possible for ส่วน, which could be one of a number of parts of speech. Lingvosoft indicates:

Adverb.
Apropos;
Conjunction.
As for, as to
Noun.
Fragment, denominator, form, lineament, member, part, portion, proportion, quota, region, section, segment, while, zone, bit, body
Preposition.
As of

Thus it could be translated: Concerning a Buddhist ... , i.e. about a [single] Buddhist’s experiences in the UK.

So I’ve had to weigh up these alternatives. How to home in on the right meaning? One approach I adopt is to shorten the phrase, which should draw on a larger statistical sample so that the translation is based on more occurrences. Thus I can try ส่วนหนึ่งของชาว (sùan nèung kŏng chaao). Google renders this as 'Part of the people.' This helps persuade me to settle on 'Some people' as the main sense. Yet even with some more pointers it’s still largely guesswork until I’ve had a native or fluent speaker to check it for me.

Having pondered enough over just the title, let’s move onto the first sentence(!)

Sentence 1

นับตั้งแต่ข้าพเจ้าออกจากบ้านเมืองมาอยู่ในประเทศอังกฤษเป็นเวลาเกือบ ๕ ปีไม่มีโอกาสไปวัดทำบุญตักบาตรและฟังพระธรรมเทศนา

Google:
Since I come from homes in the UK for nearly 5 years, no opportunity to measure merits, and put listening preaching.
PT:
Ever since I left my homeland to be in England nearly 5 years ago I have not had the opportunity to go to a temple to make merits, to put almsfood in a monk's bowl, or to listen to the Buddha's teachings.

Comments on Google’s effort:

  • The subject of the sentence almost gets lost at ไม่มี – literally ‘there wasn’t’, but in English it’s clearer to turn this into the first person
  • Google omits the translation of ไป วัด (go to the temple), yet it’s a very common activity.
  • There’s a lack of contextual awareness with “measure merits” – it just doesn’t make sense here!
  • Google translates ตักบาตร as just ‘put’, but it’s a construction, which Thai2English renders as “to put almsfood in a monk's bowl” and Lingvosoft offers: “give food offerings to a Buddhist monk.” Perhaps the latter is safer, but the former really conveys the Thai tradition!
  • The resulting sentence offered by Google is grammatically very poor. If you look at it, there’s a distinct absence of Buddhist-related vocabulary, which suggests a significant gap in the corpora (assuming it is using statistical methods).

Afterwards I made a few more stylistic changes such as changing ‘home’ to ‘homeland’ to emphasize the change in culture.

Sentence 2

ข้าพเจ้ายังมีความเลื่อมใสในพุทธธศาสนาอยู่เสมอ

Google:
I also have a sequin. Enter the Buddhist religious path always.
PT:
Yet I still have faith in the Buddha's teachings.

Comments:

  • Whereas Thai2English translates ความเลื่อมใส as a phrase meaning ‘faithfulness, believability, conviction’, Google errs in its chunking and decides to apply a full stop in the middle of a word, i.e. after ความเลื่อม which literally means ‘glossy things,’ hence ‘sequin’!
  • Google doesn’t retain a single voice – it jumps from first person indicative to imperative(?)
  • The phrase พุทธธศาสนา is just the Thai transcription from the Pali of Buddha Sasana, which just means ‘teachings of the Buddha’. Although ‘Buddhist religious path’ sounds okay, to use the word 'religious' arguably brings with it a lot of unnecessary cultural baggage.

Sentence 3 (first part)

ในยามว่างได้พยายามอ่านหนังสือเกี่ยวกับธรรมนั่งสมาธิวิปัสสนา

Google:
The guard was busy trying to read books about the fair. Insight meditation.
PT:
In my free time I am always trying to read books on Dhamma, sit and practise Vipassana meditation.

Comments

  • Google has split this into two sentences.
  • Google has not recognised that ยาม ว่าง is a phrase; Lingvosoft confirms that on it’s own ยาม means ‘gatekeeper, guardian, ...’, but Thai2English both defines it as ‘time; hour; period’ and groups this word with ว่าง (‘free, empty, vacant’)
  • Google renders ธรรม as ‘the fair’, but that’s completely out of context. Thai2English helpfully offers amongst others: ‘dharma’ or ‘[to be] natural, lawful, normal.
  • It has taken นั่ง สมาธิ วิปัสสนา as just the practice (noun) of insight meditation, rather than as a verb. I’ve emphasized the activity by a longer rendering.

Sentence 3 (second part)

และปฏิบัติธรรมเท่าที่สามารถจะทำได้ในใจนั้นเฝ้าแต่คิดว่าคงจะได้พบกับชาวพุทธเข้าสักวันหนึ่ง

Google:
and practice as they can do but keep in mind that think that would be found to be a Buddhist one day.
PT:
and practise the Dhamma to the best of my ability. I keep these in mind, thinking that I might yet some day get to meet with other Buddhists.

Comments:

  • I found this a difficult clause and am not really sure about the translation.
  • Google’s clause is all over the place
  • Google again fails to translate the key word of ธรรม

As you can see, at present Google’s rendering is very variable, not coherent, and doesn’t make much sense. It seems to chop up sentences and make clauses into short sentences, giving a staccato effect! I’m guessing that Thai is not one of its stronger languages.

Evaluation

I have found that the most helpful translation tool is Thai2English and I copy chunks of Thai there. It gives meanings and phonetic transcriptions word by word, together with help concerning Thai grammar. Occasionally it also fails to chunk correctly and sometimes lacks some vocabulary, but most of the time is does a good job so that where there are doubts or blank spaces, I have often found that there are typographical errors in the original text (or mistakes in the OCR/copy typing).

Google Translate is quick and useful for giving some features, but it’s not fit for translating anything substantial. I’ve found that close-reading is required, for which Thai2English, supplemented by another electronic dictionary – here Lingvosoft – is far more productive.

Whilst Google struggles to provide accurate translations, it does provide a very useful template structure for working on documents: it splits up translations into bite-sized segments of Thai followed by English. At the moment I don't pay too much attention to its translation, but retain it whilst I’m working since sometimes it does offer useful clues. I'm sure that it will improve quite rapidly as it's an important project for them.

At the end of the day the notice pinned onto the board would be: "All translations may be subject to change!"

Sunday, March 21, 2010

Translating Thai: Some Experiences of Digitisation

Is it possible to produce a reasonably accurate translation from Thai into English with only a basic knowledge of the language and the aid of electronic tools? I’m not going to make great claims as my experiences are from home-grown experimentation over a few months. However, having recently completed a few translations, I think there are promising signs. At least I’m quite satisfied with a translation of my mother's article concerning Buddhism in Hampshire in the '60s, which runs to about 2,000 words. So there may be some pointers that others find helpful.

Setting this post in the context of biographical research, I’ll first describe some broad considerations and then discuss digitisation (scanning and optical character recognition). One tip I’d offer is that there needs to be attention to detail – rough and ready methods won’t yield very much that's of value. Certainly, there’s been more involved than I anticipated!

I’ll start with a list of very basic questions - as much for my own benefit as anyone else’s :-)

  • What are you trying to learn? Why is it significant? Even when carrying out research entirely in one’s native language, time often forces choices with regard to the materials that you examine closely. If they are in a foreign language, then that imposes further constraints.
  • Is there anyone who can help? It may be that you can effectively form a team.
  • Of the materials available, which ones are going to shed most light in key areas?
  • Among these materials, which ones are amenable to analysis? Are they easy to access physically? Are they printed or hand-written?

All these points apply to any language, but then each language has further characteristics that can make the situation more or less difficult.

With regard to Thai, its alphabet (44 consonants and 28 vowel forms) is much more elaborate, particularly with the use of diacritics. Even Thais will tell you that looking up words in a dictionary can be quite a chore. Yet, if the letters are clearly formed then actually reading it is not so hard because it’s generally phonetic. As someone with a limited vocabulary, needing to look up many words, I soon decided that it’d be much more convenient to have an accurate transcription in electronic form so that I can use software-based dictionaries.

A note on reading handwriting

So what about Thai handwriting?! In the Thai education system, primary school children learn to write by copying individual Thai printed letters – I’ve seen one of my cousins do this repeatedly when she was 5 years old. When they leave primary school they then learn cursive script and that stage can mark a huge departure. It’s a similar approach as I learnt for English, but I don’t know whether children develop their own style or are guided to adopt one of a number of standard styles. I’ve shown sets of photos to relatives and friends with Thai writing on the back – quite often there is a struggle to read what’s written, so it appears to be no easier than English. It’s a daunting prospect, but assuming that the writing is consistent, then it becomes a question of recognising patterns and perhaps understanding its topology will help. So for a given author, it may suffice for someone to translate a sample for me and I can try to figure out the rest.

Anyway, at the moment I can’t read much beyond the printed word, which means I have to ask others to copy type what I can’t read. For general documents concerning work that’s quite feasible, at least for someone in Europe the costs of getting this done in Thailand are affordable. However, a biography containing personal items (which are often of greater interest) requires more care – until their contents are known they should be read only by people you can trust.

So in the remainder of what I share here I’ll confine my attention to printed documents as I indicate a methodology I’m adopting for their translation.

Copy type or scan for OCR?

Technology-assisted translations often start with flatbed scanners that can convert the physical page into an image that then gets ‘read’ using optical character recognition software (OCR). In theory, since the printed word generates letters uniformly, software can accurately interpret them. In practice, results are imperfect for most kinds of sources and can take longer than expected. It may be better simply to copy type.

So when should OCR be used? Whatever language you are trying to read, the utility depends upon the nature and condition of the original document – if it is a fragile pocket volume with hundreds of faded pages with tiny letters in an obscure font, then even if you manage to safely scan the page, you may find OCR yields very poor results.

However, this kind of discussion assumes that there actually is some decent software for any language, when in fact for languages that don’t use Roman script, the situation seems to be very varied...

Available OCR options for Thai (very few!)

For Thai the available options have been very few. On asking a few Thai friends, I drew only blanks and when I carried out a quick investigation it seemed that until only a few years ago, the options were not far out of the university laboratory and didn’t look very amenable. An example is NEC-0006 อ่านไทย เวอร์ชัน 2.5 (OCR), which is inexpensive, but it doesn’t get very good experience reports from a Thai OCR discussion thread..

The larger well-established commercial products such as Omnipage and Abbyy seemed for a long time to have ignored Thai until a couple of years ago when additional language support for Abbyy FineReader Pro was introduced for Thai in version 9. Trusting the claims of accuracy I took the plunge and bought a copy - quite an investment, even with an educational discount.

I’m glad I did as the results are generally good, although its accuracy is inferior to that for languages based on Roman script. For someone like me who types Thai very slowly it is a useful start, but unless the lettering in the documents is very clear so that the accuracy is close to 100%, its utility will fall away for anyone who can type reasonably quickly and accurately.

(In case you are wondering, there have been efforts to recognise handwriting, but it’s a much harder task – I was interested to note, though, that a fairly recent paper, Maximization of Mutual Information for Offline Thai Handwriting Recognition, in IEEE Transactions on Pattern Analysis and Machine Intelligence, makes use of a toolit that is primarily used for speech recognition research. It prompts the question of the relationship of Thai speech to writing. From my very rudimentary knowledge of Thai linguistics I gather that it has roots in Sanskrit, where the letters of the alphabet are placed according to where in the throat/mouth/lips they are formed. Thai reflects this ordering quite substantially, though not completely.)

Undertaking the OCR.

I think getting the best results is an art and worth persevering to make improvements. For all but a few cases with one or two small documents, the whole scanning workflow ought to be considered as a successful process requires a good rhythm. Washington State Library has a useful checklist and there are some good tips on the OCR process provided by About.com. These cover physical aspects including the selection of the scanner itself, keeping it clean, the placement of the source document, the scan settings (resolution, colour contrast, expected language(s)), and how the scanned image is divided up for the actual process of scanning.

One particular aspect that many software packages provide is training. For text recognition this is basically the process of chopping up the scanned image into a sequence of glyphs (character elements) and assigning glyphs to character names – see e.g. Wikipedia for a detailed entry. As you feed in multiple samples and specify the assignments, it learns how particular characters should be interpreted. There’s a training tutorial for a software library called Gamera, which I found very helpful in explaining the concepts.

I’ve not yet used training, probably because I’ve been a bit lazy to make the effort to learn how to make it learn!

Finereader’s Thai OCR Performance and Correcting the OCR.

Here’s a sample of FineReader's output.

Thai OCR in Abbyy FineReader Pro 9

As you can see, it’s a long way from perfection! Here it obviously doesn’t handle the English. I actually set it to interpret everything as Thai – although I could have included English as an additional language, it seems to have a net effect of adversely affecting the Thai rendering, so since English is easy for me to recognise and type, I prefer to let it get that part wrong.

A Thai person might well be dissatisfied with the results, but overall I was quite happy given my very slow Thai typing speed. There were one or two characters that FineReader seemed to really struggle with, but correction was not difficult as the suggested match was often a character used here and not elsewhere – so I could do a ‘search and replace.’ More challenging was the handling of the small diacritical marks – in Thai they are all glyphs since they each contribute towards meaning, either as vowel sounds or tone marks. Instances where there are two such marks on a single letter are common and FineReader often struggled to pick out -่ ไม้เอก (mai ek) – it looks like a hyphen, but its placement varies a lot. If you look at the screenshot carefully, you can see that FineReader simply omits quite a few of these, perhaps because the original source document was not clear enough.

Even if you train an OCR package, there will still be imperfections, so the output needs to be corrected. This process is tedious, but helpful – not least in learning to read! It helps you to familiarise yourself with the alphabet and especially pay attention to the way letters are formed.

If you have a large screen, particularly with widescreen dimensions, then it’s probably easiest to use the scanned image, set the zoom as needed, and place it next to the OCR’ed version that you’re editing.

Conclusion

Although a quick and perfect system is far away, for printed texts a few OCR options are emerging that I find helpful in digitising printed Thai texts. Alternative suggestions are very welcome – I’m keen to improve what I’m doing, even though it’s already been quite an effort and I haven’t yet started talking about the translation itself...!