Translating technical documentation with Google Translate
Introduction
This post summarizes my insights as I worked my way through translating some technical documents, written in LaTeX, into Chinese, Japanese and Korean. The immediate approach was to feed Google Translate with the pdf documents, but not only are the results ugly, but then there are a lot of technical terms in the documents which are better not translated. Even worse, there are code examples with explanation in the text, file names, references to variable names and other elements that become meaningless if translated.
One of the advantages of having the document written in LaTeX to begin with, is that the LaTeX text formatting commands effectively flag the parts that aren’t just plain language in the text, so it’s relatively easy to spot them and protect them. But that alone was a long way from the finish line, as elaborated in this unexpectedly long post.
A different post discusses the technical aspects of talking with Google Cloud’s API as well as creating documents in these languages with LaTeX.
I also did something similar with translating web pages. For example, the translation of this post to Chinese, Japanese and Korean.
This post was written in the summer of 2022, and odds are that things will change dramatically over the course of time.
Is translation by human better?
The short answer: Yes, as of 2023, human translation is much better. It’s mainly because there is no way to give the translating tool hints about the context. For example, the word “driver” could be a car driver or a term related to a computer. All translation tools just pick one meaning. Some tools allow choosing a specific dictionary, and there are ways to shape the behavior of the translator. But the results are far from satisfactory.
However but both options have their disadvantages: Working with a human necessarily requires trusting that a specific person will perform the job thoroughly, and well, that’s anything but taken for granted. It’s extremely difficult to verify that the work was done well, in particular when the document is technical, as it’s not possible to give it to just someone and ask if it’s well written. An automatic reverse translation will miss some poor translations (in particular poor translations of technical terms) and at the same time make false alarms.
But the worst problem with human translation is that every future change in text requires contacting the people who made the translation, and ask them to make the adjustments. They may not be so willing to do that. So unless you employ these people full-time, it may be difficult to translate small edits.
Another problem with humans is that significant errors in the meaning of the text might occur. It’s easy to reverse or otherwise obscure the meaning of a sentence because of a simple human error. “Be sure not to turn on the power supply” can easily turn into “Be sure to turn on the power supply”. Automatic reverse translation can reveal this, but it’s easy to miss an error like this, when the person that verifies the text already knows what it should say.
Automatic translation should be less likely to make a mistake of this sort, but the truth is that Google Translate, with all its Neural Network magic, turns out to be more human than desired in this matter: It’s not completely unusual that the meaning of the text changes, sometimes to the complete opposite.
It also has a variety of passive-aggressive behaviors, in particular ignoring whole sentences or part of them, mostly when the text becomes a bit rambling.
I had a case where the automatic translation ignored a “non-” prefix on a noun, and by doing so reversed the meaning of the sentence. I’ve also had a case where “must not” was translated into the equivalent of “doesn’t have to”.
The clear disadvantage of an automatic translation is poor expression and grammar. If the technique explained below is adopted, it’s however possible to end up with a fairly good result, even if the language is a bit off at times.
But this disadvantage can be mitigated by letting someone who knows the target language well proof-read the result. This person doesn’t need to know English well, but only be sensitive to the target language, so it’s easier to find someone for that job. And in particular when translating to Asian languages, it’s easy to tell the persons involved to ignore technical terms, as they are easily distinguishable, written in Latin script.
The results of this proof-reading session are only slight changes in word choice or ordering, and they can be verified against automatic translation as well as another person. In fact, in most cases, the best way is to improve the wording in the original language, until the person checking the text confirms it sounds OK.
Whether it’s worth the effort and cost to make this language cleanup is an open question. It’s a matter of how much the target audience appreciates the fact that the documentation is available in their language vs. how much the language defects come across badly.
Another issue with automatic translation is that words with more than one meaning can be mistranslated, in particular when the intended meaning is the less common one for a specific word (examples for that below). A back-translation doesn’t necessarily reveal a problem of this sort.
So with the possibility of having someone read through the translated text, the only remaining problem is when the meaning is changed unnoticed during the translation. Frankly speaking, I don’t know which option, human or machine, is better regarding this problem. The only real solution anyhow is to back-translate the text and read it through. Good luck with that.
General insights on automatic translation
Google Translate is based upon a neural network machine learning algorithm, which means that it’s chaotic by nature (in the scientific sense). That gives it a bit of a human touch, which surely makes the translations better, but also makes it quite unpredictable. In particular, it’s capable of making any possible mistake, no matter how pointless and unexpected. It’s impossible to be 100% sure that it won’t do this or that, and it’s not even a bug when a phrase in the original text just disappears, or when a meaningless string of characters gets translated to something else, also completely meaningless. Those small glitches are part of the game, and it makes automated processing of the translated text quite challenging.
Having said that, the general rule is that if Google Translate does weird things, it’s because it was fed with something it found hard to digest. So even if the weirdness doesn’t appear to be related to language, the best way to rectify this is to change the original text into a simpler, more common way to express the same idea. Unfortunately, this calls for dull, play-it-safe English. However with by far less silly typos and grammar mistakes.
If I was to speculate how Google Translate’s algorithm works, I would say something like this: Attempt to find recognizable words in the sentence, fix spelling mistakes (“did-you-mean” style) and try to match the words that are recognized with a known phrase from the huge training corpus. Pick the known translation into the desired language of the word pattern that fits best. Fill in the words that were unknown in the original language in the translated text in their natural positions — these are treated as names (of persons, places etc.).
Punctuation like full period and commas, as well as enclosure in parentheses, makes the translator treat each part separately, more or less.
The destination language matters a lot regarding the interpretation of the meaning of the text. It doesn’t seem like the question is “what does the original text mean” but “which language pattern was most common in the training data for translating into language X”.
The main takeaways for this speculation, regarding how to write for translation are:
- Use common expressions and language patterns. In particular, use the most commonly used words to express a certain meaning.
- Be super-careful with trivial spelling mistakes, as they break the statistics for the desired language pattern.
- If the translation is successful to one language, in the sense that the original meaning was “understood”, it doesn’t necessarily means it will be as successful to another one. Same goes with failure. It seems to depend on what the translations between the two languages are usually used for. In other words, judging by the results, it seems like translations into Hebrew are used more for everyday text, but translation into east Asian languages is more for technical documents. Hence the selection of meaning tends to be more technical with the latter.
- As there is no analysis of the semantics of the original sentence, anything can happen, including a translation that says the opposite of the original.
Interestingly enough, I’m under the impression that the translation with Google Lens is much better than the cloud based translation service. In particular, the cloud translation is more likely to produce nonsense translations because of small disturbances in the middle of text, where Google Lens’ translation seems to have extra wisdom to overcome such.
Translating to a foreign language
How do you know a translation is OK, when you don’t know the language it’s translated into? The short answer is that one can’t really know. It helps a lot knowing another language, even if it’s different from the target language, because it allows spotting misinterpretations of certain words, in particular technical ones. But often a word is poorly translated into one language, but fine with another.
There’s the possibility to translate it back to English, but that doesn’t always spot problems. Technical words like “clock”, “bus”, “sink”, “assertion” are translated to ridiculous words in Hebrew, for example, but the translation back looks OK in English. In particular a work like “sink” translates into the word in Hebrew that means kitchen sink, and then goes back to the correct work in English, of course.
But then comes the question: Why translate these words at all?
Quality of translation
Among the three target languages, the translation to (simplified) Chinese is the best by far. Probably because the natural flow of Chinese is the closest to western languages. The runner-up is Korean, and the worst is Japanese.
The worst problem with both Korean and Japanese is that parts of the original text can just disappear. This happens often when the semantic structure gets too complicated, or if there’s no normal way to say something in Japanese. For example, the sentence “you’re absolutely welcome to mess up completely, the tools won’t stand in your way” lost the entire first part in Japanese. So it just says “no tools get in the way”. If only the first part is translated separately, it turns into “completely ruined is welcome” (it had to give me something back when that sentence stood alone).
So short and plainly informative sentences are best translated into Japanese and Korean. Chinese seems to work with anything.
As for words like “it”, Chinese tolerates that best too. The two other languages are more likely to need repeating the word that “it” refers to, and hence possibly pick the wrong word to repeat.
Testing by translating to Hebrew
Since I happen to speak Hebrew fluently, I checked the translation to Hebrew of all documents, not for the purpose of publishing this translation, but because I soon found out that Google Translate struggles with Hebrew. So the idea was that if it’s OK with Hebrew, it’s probably OK any language.
For this, I tried two cases where the translation got wrong, as indicated by the result in Hebrew.
The first sentence that failed was “Close the window after a successful generation”. The problem was that the word “generation” was interpreted as the relationship between age groups, and not from the word “generate” as intended. This, in itself, is easily fixed by changing it into “Close the window after a successful generation of the file“. It was a matter of fitting the entire sentence into a different pattern of words.
Surprisingly enough, the translation into Chinese, Japanese and Korean was correct even without the fix. This can be verified by looking at the translation back to English, and isolate the word or couple of words of interest.
The next problematic phrase was “The non-X81X are ignored by almost all X82X computers”. In the translation to Hebrew, the “non-” part was ignored, so the sentence’ meaning was reversed. Once again, the translation into the three other languages was correct (the X81X thing is explained below).
So if I once had the speculation that the machine translates the words into an intermediate format that somehow contains the meaning, and takes it into the target language from there, it’s definitely not the case. Whether there’s a misunderstanding or not in the translation depends on the target language.
I’m optimistic and hope that Hebrew is in particular prone to errors, so if I clean up the translation to Hebrew, it will hopefully work with other languages. However odds are that each language has its own pitfalls. Even though it really seems like the translation to Hebrew from any language is bad in particular. Including changing the meaning of the text. Also, I’ve found that plain idioms like “it doesn’t hurt” are often translated horribly to Hebrew but get perfectly OK in CJK languages. But then, I don’t know about misses in CJK languages that were OK in Hebrew…? And yet, after checking numerous expressions (“bite back”, “copy-paste” and a lot of this sort) it really seems like Hebrew is really bad off.
This way or another, the only sure benefit of checking the translation to Hebrew is that it does, after all, remove some ambiguities, whether that is necessary or not. Actually, I found tons of plain typos by looking at this translation, so that alone justifies this phase. It’s difficult to proofread text exactly as it was written, but reading it again in another language feels as if someone else wrote it.
I also had the opportunity to have a translation into Japanese by a helpful person, and it was quite clear that the problems were in the places where the Hebrew translation also limped.
Hands-on insights
After quite some back and forth, I learned that the best way to work with Google Translate with text is to feed it with paragraphs of text in HTML, enclosed in <p> (or <hN>) tags. Plain formatting tags is fine (<b>, <i> and even <span> etc.) but it’s important not to insert anything that breaks the continuity of the sentences: No <br> or <img> tags in the middle, or anything else that isn’t expected in the middle of a sentence. It makes Google Translate translate the part before and after the break as separate sentences, and that’s a disaster.
Newlines are ignored in the cloud interface with HTML, as they should be. This is contrary to the web interface for Google Translate, which is extremely sensitive to newlines, so copy-pasting a chunk of text from a pdf document can result in a horrible translation, because there are newlines between each row in the original text, which makes the translator treat each row a separate phrase.
But the real difficulty is the fact that the translated text is technical. Google Translate is trained with mainly non-technical text (I guess), so its interpretation of technical terms that happen to also have a non-technical meaning is naturally inclined towards the non-technical meaning. Words like “driver”, “compile”, “assert” and “board” are not only likely to be translated incorrectly, but also stir a mess in that imaginary brain that holds all those neurons, resulting in a completely unintelligible translation.
The problematic words are those that have a possible non-technical meaning. The word “boot” could mean a piece of footwear, to boot a computer could be mistaken for “to give the computer the boot”, but to reboot a computer could only mean one thing. So it’s not all that much about the word being technical, like the fact that it could be remotely confusing.
Other ambiguities occur with words like “target”. Using it in any form, i.e. “to target” or “targeting” as well as “targeted” as in “depending on which software version is targeted” leads to a completely wonky translation, at least into Hebrew.
Surprisingly enough, it copes quite well with sentences that contain untranslatable items. I guess it treats anything it can’t handle as a name. Since it’s supposed to be able to translate “Joseph prefers Paris over Berlin”, it works fine with “X prefers Y over Z” as well. So the trick is to remove all technical terms from the the text, and replace them with something that Google Translate will treat as a name, something it can’t translate. And then return those words into the translated text.
This means that all technical terms remain in English in the translated text, which is perfectly fine, because a technical reader is expected to know these terms. It’s the blah-blah part that needs translation, and with the technical words out of the way, Google Translate does a good job on that.
The problem that remains is how to feed the translator with these untranslatable X, Y and Z placeholders, when there can be thousands of these, and they must all be left intact in the translation (well, except for Russian and Greek, see below). The section below on placeholders tells the full story, but the spoiler is that I used X0X, X1X, X2X, and so on. It’s not watertight, but it works best. I tried quite a few options.
The main thing to keep in mind is that it’s all about word patterns: If Google Translate recognizes the structure of the sentence, based upon words that are commonly used together for a certain meaning, it translates that part correctly, and then puts the placeholders in the right places, treating them as names.
I should mention that Google Translate offers a “notranslate” style, which can be used to enclose e.g. <span> segments of text that shouldn’t be translated. I didn’t attempt using it, in particular as people out there in the web have complained that it disrupts the translation exactly like that. Another problem is that chunks that shouldn’t be translated often have a different formatting (e.g. Courier font for variable names), and Google Translate tends to behave in an unpredictable manner, making it difficult to rely on its output for feeding LaTeX with directly.
Also worth mentioning is that Google offers an advanced version of the translation API, with the ability to train the learning machine and possibly feed it with specific word translations, but that would require knowing the correct term in the target language. How do you say “compile” in Chinese and Japanese? But it could have been helpful for solving the problem with verbs, that have a technical meaning (“compile”, “boot”, “implement”, “overflow”, you name it).
How I actually did it
The idea was to extract translatable text from the LaTeX source, and feed Google Translate’s cloud API with it in HTML mode. Then take the translated text and implant it back into the LaTeX doc.
The overall goal is to feed Google Translate with hollow phrases, albeit with a solid and common semantic structure, of the form of “X with Y is more important that Z”. This makes it easy for the translator to detect the structure of the phrase, and translate it to a meaningful sentence in the foreign language. That gives good odds for meaningful sentence when the placeholders are replaced with the actual technical words in the translated phrase.
In more detail:
- Fetch paragraphs of text and enclose them in <p> or <h1>, <h2> or <h3> tags. Each of these tags have a unique “id” attribute, so when the translation returns, it’s possible to track which text segments should be written back to which part in the LaTeX file. This is why HTML mode came handy. I haven’t had a single case of these attributes being messed up (yet?).
- Turn some LaTeX formatting into plain HTML tags, e.g. <b>, <i> etc. Then do the opposite when implanting the text back. The advantage is that this doesn’t break Google Translate’s view of the text as a contiguous sentence. Once again, HTML mode is required for this stunt.
- Anything that shouldn’t be translated — technical terms, references to variables, file names, references to paragraphs, labels etc. — is replaced with a unique identifier (“placeholder”) that Google Translate doesn’t attempt to translate. The major caveat with this method is that it works only with nouns. This requires rewording, in particular turning verbs into nouns (e.g. “perform a compilation” instead of “compile”). More on this below.
Note that some parts of the LaTeX document are completely out of this game, as they aren’t even given to the translator to look at. For example, verbatim environment chunks, and even the newlines between the text paragraphs. They remain the same because they aren’t overwritten when the translated text is transformed back and implanted in the relevant segment.
Work flow
I wrote a Perl script for the back-and-forth manipulations between LaTeX and HTML, but I won’t get into that too much, because it’s complicated and really specific to the documents at hand for translation. Among others, this script loaded a list of words that are always replaced with placeholders, and I also added a LaTeX command, \notranslate{}, which just leaves the content as is when interpreted by LaTeX, but to the script it means that the entire chunk should be replaced with a placeholder as well.
Writing scripts and all that is nice, but there’s still some manual preparation required. So this was the procedure I adopted:
- Run the script that creates the HTML file that is sent to Google Translate. View that file with a web browser, and look for words that are technical and can be mistranslated. When such are found, either add the word or phrase to the list of words to automatically replace with placeholders, or fix it specifically with \notranslate{} LaTeX statements.
- In fact, I also wrote a script that puts \notranslate{} on certain words and patterns (e.g. sets of upper case characters) so I ran this script, and then verified each such occurrence. This is faster than finding them manually, and is useful for words that may have a non-technical meaning, or otherwise require human attention to get 100% right. For example, the word “image” should be translated when it refers to a picture in the text, but not when it’s an image of a disk.
- Go through the text manually, and apply the guidelines listed below (the do’s and don’ts).
- Translate the text into Hebrew, and read through the result. If something ends up unclear, fix it. The further the language is from English, the better. The one sure benefit of this check is that small typos are spotted (e.g. “in” instead of “is”) because the translation gets weird. The fact that the order of words changes in the translation also helps spotting ambiguities, that are often solved with works like “which is” or punctuation.
- Translate into the target language. Make the necessary fixes. Don’t bother to find out why certain placeholders are missing in the translation. Rather, look at the original text and try to figure out why it was difficult to translate, and fix that instead. Sometimes a missing placeholder is due to a whole sentence being dropped off, in particular with Korean. It’s as if the algorithm said “I have no idea how to reorganize this sentence into something that makes sense in Korean, so I’ll just skip it”.
- Maybe attempt to translate the document back as a pdf file (with Google Translate’s web interface) or use Google Lens’ translation feature for a more sporadic check. I’m not sure if this is worth the time.
The order of translation is Korean first, then Japanese and finally Chinese. This is because the translation to Korean is the most troublesome, however often fixing the problems consists of changes that are likely to benefit the other translations.
All in all, it appears like using placeholders instead of technical terms actually improved the translation regardless of these specific words. It seems like these words confused the translation machinery, which made it create obscure phrasing. With the technical words out of the way, inserted as opaque symbols, it seems like Google Translate managed much better to handle the rest, which now consisted of commonly spoken language.
So my ultimate approach was to put placeholders instead of virtually all technical terms which are nouns. That’s quite a lot of them, and the translated documents ended up full with terms in English. I’m not sure what Chinese are going to think about this, but if they have the same problem as in Hebrew — weird “official words” for technical terms — it’s going to be just fine.
The do’s and don’ts
Based upon quite some trial and error, these are the guidelines I ended up with for producing text with placeholders that translates well.
- The text should consist of hollow sentences like “If the X113X process fails, the X641X’s output may supply hints on what went wrong. The X102X application on the computer should be configured for X114X, no X115X and no X116X ( X640X )”. However sentences like “The X219X for X220X on X221X or X222X is part of the majority of X223X and distributions, as explained below” should be fixed, inserting some meaningful words between those four placeholders with just one word between each. In this example, it’s not clear whether the last “or” refers to instead of X221X alone or all of the three. If the translation requires word reordering, this will obscure the meaning of the sentence.
- Use punctuation (in particular commas) and parentheses to chop up long sentences into segments. This prevents ambiguity. In particular, text inside parentheses is translated into parentheses, so this is a good tool for breaking up long and complicated sentences.
- Try to keep phrases short and concise (and somewhat boring), partly because sentences are short in the target languages. If the sentence is long, try to mitigate the damage with punctuation.
- Use plain and explicit English. Don’t leave out “which”, “that” and all those words that explicitly define the role of each word. Even a simple expression like “for the curious” can go wrong, but works perfectly well when changed into “for those who are curious”. Yuck, but translates well.
- Avoid words that refer back to something earlier in the sentence, unless it’s very obvious. In particular, the word “it” is often replaced with the word it’s supposed to refer to during the translation, and sometimes the wrong word is chosen in the translation. When this happens, the translation explicitly changes the meaning. Because the translation into CJK languages often involves splitting a long sentence into shorter ones, without a possibility to use a word like “it”, implicit references of this sort easily translate into nonsense. To make things worse, the back-translation may bring back the “it”, so there’s no way to spot the mistaken translation. There are cases where these duplications are safe, for example expressions like “one thing to another” (which is often translated into “one thing to another thing”).
- Prefer “the red book and the blue book” over “the red and blue books”. The order of the words may be changed during the translation, and in that case, it’s possible that only the “blue books” is moved to the new position in the sentence, so the result is rubbish. These overly explicit sentence are less awkward to read than they are awkward to write, but they are nevertheless slightly awkward as the same word is repeated over and over again.
- Avoid idioms. Even the simplest ones, like “out of the box” may and may not translate into something that makes sense. Because of the statistical nature of the translations, idioms might get translated with the right spirit into a certain language, and fail completely with another. So dull language it is.
- Avoid verbs in passive form, in particular if it comes with a “by”. Passive form is useful for not naming the doer, but if it’s named anyhow, use the active form. A lot of times, the passive form, and the tangled sentences that it usually creates, were the reason for problems in translation.
- Use possessive form for clarification. For example, if the word “register” is replaced with a placeholder, “register modification” should change to “modification of registers” or likewise “registers’ modification”. Using the ‘s suffix works great, so use it as long as it doesn’t create an ambiguity on who the owner is.
- In fact, there’s no problem at all with segments like “X400X’s X401X”, as possessive form. This translates well, surprisingly enough.
- Don’t replace partial expressions with placeholders. For example, in the expression “the user-space application”, don’t replace just “user-space”, but rather “user-space application”. Word ordering might be different in another language, which can at worst lead to a complete disassociation between the placeholder and its related word in English, with a completely unclear result.
- Avoid replacement of parts of expressions with placeholders. For example, in “VGA port”, if only “VGA” is replaced, it’s not sure if this will translate fine. “VGA-port” increases the changes. If it’s a common language pattern, e.g. “VGA-based”, there’s a good chance for proper translation. Same goes with “the X500X file”, because it’s a common language pattern.
- Don’t use “non-” as a prefix. It’s often missed, reversing the meaning.
- Look out for ambiguous words. For example, the word “signals” could be the verb (to signal) but also the plural of the noun. Avoid less common uses of words, such as “writes” to say several write operations, and use “write operations” instead.
- Be extra careful with trivial spelling mistakes and typos, in particular mixing “is” with “it” and such. These are overlooked when reading the text in English, but they kill the translation, sometimes by changing the meaning significantly, and sometimes by just confusing the translation algorithm into producing something weird.
- Bonus: Check all duplication of placeholders, and verify that the correct one is duplicated. Because these duplications are usually the result of a word that refers back to something (“which”, “that”, “it” etc.), it’s a good idea to verify that the reference goes to the correct placeholder. In theory, this should be done with all uses of back referencing, but that means proofreading the entire text. So with placeholders it’s less work (and less gain). Having run through a checkup of my own translations, I’d say about 10% of these duplications garble the meaning, by explicitly duplicating the wrong word.
Caching translation results?
Since the document is chopped into paragraphs, each within a <p> enclosure, does it matter if each is sent separately or if all are sent in one API transaction as a concatenated string? Does it matter if the translator sees the entire text?
Because if each <p> enclosure is treated separately, it’s possible to cache the pieces of text that have already been translated.
Caching is more than just a money saver. It allows making manual changes in Google Translate’s output (in particular if it messed up the placeholders) and then not having to repeat this every time the entire document is translated.
Even more important, avoiding the repeated translation of parts that have already been translated means avoiding the possible mishaps that may suddenly occur (like suddenly dropping a sentence). Think about making a small change, and then the translation fails on something completely different. But it worked last time!
This is also important if there’s feedback from readers that corrects a poor translation at a specific place. So caching is very helpful for the incremental kind of work that is necessary to maintain the document in the long run.
So I tried this with translating from English to Hebrew, and a bit with Chinese as well (by looking at the translation back to English). As it turns out, there are occasional differences between the translation of an isolated paragraph and that made with a context. But it doesn’t seem like an intelligent use of the context. Comparing the results, the isolated translation was sometimes better, sometimes worse, with a very slight difference in most cases. So it looks more like the algorithm randomly picked another wording, for no apparent reason. It was usually exchanging equally valid synonyms, or choosing to translate the name “Linux” to Hebrew or not.
Another observation I made is that the use of context is poor. For example, the word “call” is translated to the the word in Hebrew that means a phone call, but “function call” is translated correctly. So what if there’s a sentence saying something about a “function call”, and a sentence afterwards uses the word “caller”? In the <p> enclosure, that is. Well, the translation of “caller” still relates to a phone call. The neural network clearly didn’t learn anything from the first sentence.
So it makes perfect sense to cache translations at a paragraph level. If the original document changes, request a translation only on the enclosure that actually changed.
Finding the right kind of placeholder
This is a long explanation on why ended up with the XnX placeholders. I would skip this part if I were you.
As mentioned above, the main problem with translating a technical document is that some technical terms are translated into an unhelpful, sometimes ridiculous way, and that it confuses the translation algorithm. As the reader of the document is most likely familiar with the English term, it’s safer to leave these words as is. The problem is how to insert these terms in a way that ensures they don’t get translated, and at the same time retain their position in the context.
As it turned out, the main problem with inserting an untranslated chunk into the text is that it may disrupt the translation, in particular as Google Translate tends to treat the part before and after the chunk as separate sentences, which results in a poor translation that misses the point of the sentence.
I began with adding markers in a plain text (like <%103%>, [^29^] and ^^26^^), however Google Translate inserted a space in the middle of some of these (so it turned out to be e.g. “< %103%>”) and also threw in some markups where they shouldn’t be. A complete disaster, in short. This could have worked with non-HTML translation, but well, it didn’t work.
Another attempt was to use translation of HTML, with <b id=”n23″>P</b> markers as placeholders. The id allowed to identify which placeholder to insert, and the “P” to give the translator something to consider as a word. This failed as well, in many ways: The fact that the “P” part sometimes got translated into “PP” (why on earth) didn’t matter much, because it’s not really important. The real problem was that at times there were other words inserted into the <b> enclosure as well (for no apparent reason). Even worse, sometimes a completely different word, somewhere else in the sentence, got into a <b> enclosure with the same id. So processing this would have been complicated.
Another thing I tried was to use <var>n</var> enclosures, where n is the number of the placeholder. That failed, partly because some of these disappeared for no clear reason, and others were manipulated (for example, characters from previously outside the enclosure went into it).
To ensure that the placeholder is fully opaque, I tried <img id=n23>. The clear advantage was that Google Translate didn’t duplicate these not modify them, but they broke the sentence into fragments. Google Translate assumed that no sentence will have an image in the middle of it.
So if not images, what about emoticons? Or even better, I made an attempt to use the Unicode range U+13000 to U+1342e (Egyptian Hieroglyphs) as placeholders instead of <img> markups. The idea was that Google Translate would have to pass them through as is, and that they would be considered to be names. In order to make this work, there had to be a whitespace on both sides of the Hieroglyph, but even with that, Google Translate would mess up and occasionally add completely unrelated characters instead.
In the end, I went for inserting words like X0X, X1X, X2X, and so forth. These remain intact through translation, however they are occasionally duplicated, in particular with sentences like “that is possible with X, which is the best option” which can turn into “that is possible with X, and X is the best option”. The word “it” is also translated sometimes into the placeholder instead. But that’s actually a correct translation, and it’s easy to process. Even though this worked almost flawlessly, there were occasional surprises, including rare cases where Google Translate changed the number between the Xs without myself being able to figure out why on earth, and why that specific change. So there’s always a certain amount of manual cleanup after the translation.
These duplications are common with east Asian languages, and usually occur when a long sentence is chopped into several shorter ones. In these languages, it’s more common to repeat the word than to use “it”, “which” and such.
When translating to Russian and Greek, the “X” character was occasionally replaced with the Russian capital letter “Ha” (Unicode U+0425) or the Greek capital letter “Chi” (Unicode U+03A7). Both look exactly like an “X”, so the replacement is understandable. Once this issue is known, it’s quite easy to handle, so it’s not a big deal.
As for the quality of the translation, this worked well, and Google Translate combined these nicely into the translation, even when changing the word ordering was necessary. This works however only when the placeholder is used as a noun. So it doesn’t solve the problem with verbs like “assert”, “raise”. In some cases, a word like “overflow”, used as a verb, can be replaced with something like “cause an overflow”, so it can be translated properly.
Another thing with these XnX placeholders is that there must be a whitespace in either side of it, or Google Translate gets confused. To ensure that the placeholder is restored properly, the strategy was to include any surrounding whitespaces in the string that was stored to replace the placeholder later on, and then add a whitespace in either side of the XnX string. When reverting the process, all whitespaces around the XnX string were removed before restoring the original string. This results in a perfectly consistent back-and-forth, even if the translator adds or removes whitespaces (which happens a lot).
As a side note, Google charges for all characters, even those not translated. Hence it’s a good idea to keep the placeholders short markups. Not a big deal, but still.
Sanity checks on placeholders
The natural expectation is that any placeholder in the text for translation will result in a single placeholder in the translation. I’ve already mentioned above that some placeholders turned into two in the translated text, and it was actually correct. But what if the placeholder disappears?
The answer is that it’s always an error, and it has to be fixed manually. In fact, it’s often an indication that something worse happened, which would have been left unspotted had it not been for the missing placeholder. Sometimes the number between the Xs is changed arbitrarily, but it happens in conjunction with other placeholders in the vicinity being messed up.
Sometimes the absent placeholder was the result of a part of a sentence that was completely eliminated. The small piece of information it contained was simply absent in the translation. This can happen for several reasons, but the most recurring one seems to be when it’s not clear what “which” or “that” refers to, earlier in the same sentence. One can get away with that in translations to European languages, but because the sentence is built differently in east Asian languages, the translator is forced to make a pick. So instead of doing that, it just eliminates the part it can’t decide upon. A neural network algorithm showing a bit of human behavior, I would say.
It also seems that a colon sign (‘:’) tends to eliminate what comes immediately after it, fully or partly. Changing it to a full stop often returned chunks of texts from the dead in Korean and Japanese. Or splitting the text, so that part after the colon is in a separate enclosure (note to self: possibly with a \skipthis{}).
Same thing with a sentence starting with “likewise”.
Another somewhat weird phenomenon with Korean and Japanese is that a whole sentence was sometimes dropped. The really weird thing was that when the same sentence was put in a separate <p> enclosure, it was translated properly. So it was like Google Translate said “nah, this is too much rubbish, I’ll drop the last sentence”.
So in this sense, the placeholders help spotting other problems with the translation. I got an error of this sort for each few thousand translated words, which practically means a bit of fixing for each document. What’s really worrying is how many sentences without any placeholders have vanished unnoticed?
Placeholders that contain a word in plural
One problem that is inevitable with placeholders is that the information on the word’s plural vs. singular form is hidden away from the translator. So if the work that is hidden is “compilers”, the surrounding text in the translation might refer to it in singular, and that makes the sentence sound a bit off.
In some cases, the translator can deduce it from the surrounding words (e.g. if “is” or “are” is used in reference to it), but sometimes there are no hints. Luckily, the plural-singular thing isn’t very present in Chinese, Japanese and Korean, so the effect of this ambiguity is expected to be small. Try, for example to translate and back-translate “He gave me the books” with these languages, and you get “he gave me a book” — the indication for plural is lost. But there’s also a backside to this: The fact that the original word in English appears in its plural form will probably feel uneasy to an East Asian reader. I’m not sure about this, but it appears like they would use the English word in singular form anyhow, even if it refers to several pieces of whatever it is. So any use of plural will probably feel wrong to them.
Surprisingly, this can be fixed by using a placeholder like X205Xs (with the “s” in the end). This appears to be translated correctly into plural, and even the possessive form (e.g. X205Xs’) seems to work well into Hebrew.
But this hack creates a new problem: The translation might add suffixes and other grammatical elements to mark the plural form of the hidden word. If this happens, there will create a double plural. In German, for example, there are many ways to go from singular to plural, so this extra “s” just remains, when it comes after an XnX placeholder. If it isn’t removed, the result is “compilerss” (with a double “s” at the end). In Norwegian, it may add “-er” for plural (with the dash).
OK, so remove anything alphanumeric that comes after a placeholder, so that if the “s” remains, it’s gone? That may not work well either. For example, the possessive form in Swedish is expressed with a “:s” suffix and “:n” in Finnish (at least on a placeholder), so removing suffixes blindly takes its toll as well.
So even though appealing, there “s” method won’t work as a clean way to hint that the word is plural, in particular because the placeholder might get conjugated into plural in the translation. And there’s no catch-all solution for getting rid of this possible conjugation.
Given that the problem with plural is a relatively minor nuisance, that happens only when the context doesn’t say that it’s plural, it’s not worth the risk of adding garbage characters, or mistakenly removing relevant conjugation characters.
On the wishlist: The possibility to tell the translator that a blob is a noun in plural. Actually, wouldn’t it be nice to be able to do that with verbs as well, saying which tense and person?
Placeholders and Korean particles
In English, we have this thing that we say “a book” and “an orange”. The choice of the indefinite article, “a” or “an”, depends on whether the word that comes after it starts with a vowel or consonant sound.
In Korean, there are particles that are added after a noun to mark if it’s the subject, the topic or the object in the sentence. The particle is chosen according to whether the preceding word ends with a consonant or a vowel, respectively:
- Topic particles: 은 or 는 (eun or neun)
- Subject particles: 이 or 가 (i or ga)
- Object particles: 을 or 를 (eul or leul)
Not surprisingly, the particles that come after a vocal begin with a consonant, so there’s always a consonant in the game. Same principle as English’ indefinite article.
And here’s the crux: When a placeholder is used instead of a noun, Google Translate gets XnX instead of the real word, so the particle is chosen according to the “word” at hand.
So “I read the book” is translated by Google to 난 책을 읽는다 (book is 책, chaeg, ends with a consonant, hence the choice of the object particle 을, eul). But if “book” is replaced with “X10X”, we get 나는 X10X를 읽었다. “X” sounds like “eksae” in Korean, so it ends with a vowel, hence the 를 particle was used. (The word that means “I” changed from 난 to 나는, but the former is just a contraction of the latter, so it’s like “I’m” vs. “I am”)
This can be fixed automatically by looking for these particles: They are always immediately after a placeholder, and there’s a whitespace after them. The tricky part is to identify whether the replaced word ends with a consonant or a vowel, the way it’s pronounced in Korea (which may be different from the English pronunciation?).
The possessive particle, 의, as well as several other particles are indifferent to this matter.
It doesn’t seem like there’s a similar problem with Japanese nor Chinese, but I reached that conclusion based upon not finding anything related with a Google search. I will be really surprised if there was anything like this in Chinese because its script is generally unrelated to pronunciation. But with Japanese, I’m not that sure.
Maybe use a word in the target language?
I haven’t experimented a lot on this option, but maybe it will work: If a text is translated into Hebrew, and there is a Hebrew word in the middle of the text, it’s used correctly in the translation. So for example, “I ran back to בית quickly” is translated to “רצתי בחזרה לבית במהירות”. This isn’t perfect (הביתה would have been better) but it shows that a word in Hebrew is conjugated slightly and correctly.
So this opens for the possibility to replace technical terms with their relevant word in the target language. It seems like the grammar in CJK languages is exceptionally forgiving regarding nouns: There is generally no plural form, and it also seems like other conjugations are made with separate words (e.g possessive form).
Even more interesting, it works with verbs as well. “I רץ back to בית quickly” translated into “אני חוזר מהר לבית” which means “I quickly return home”. The word for “run” (רץ) was magically replaced with “return”, which is an interesting interpretation.
So maybe this can work. Not sure how much it improves, though.
Reader Comments
Very sensible suggestions, but clearly a lot of work!
Have you found any of the new Chat bots helpful in translating technical documents (PDF’s)? I’m particularly interested in translating research papers, leaving the equations and figures unchanged and just translating the words.
I don’t use chat bots a lot. For pdf docs, I would try Google Translate’s capability of translating documents, but the results aren’t always that impressive. Surprisingly enough, Google Lens tended to give better results last time I checked.