Google Translate, LaTeX and asian languages: Technical notes
Introduction
These post contains a few technical notes of using Google Translate for translating LaTeX documents into Chinese, Japanese and Korean. The insights on the language-related issues are written down in a separate post.
Text vs. HTML
Google’s cloud translator can be fed with either plain text or HTML, and it returns the same format. Plain text format is out of the question for anything but translating short sentences, as it becomes impossible to maintain the text’s formatting. So I went for the HTML interface.
The thing with HTML is that whitespaces can take different forms and shapes, and they are redundant in many situations. For example, a newline is often equivalent to a plain space, and neither make any difference between two paragraphs that are enclosed by <p> tags.
Google Translate takes this notion to the extreme, and typically removes all newlines from the original text. OK, that’s understandable. But it also adds and removes whitespaces where it had no business doing anything, in particular around meaningless segments that aren’t translated anyhow. This makes it quite challenging when feeding the results for further automatic processing.
Setting up a Google Cloud account
When creating a new Google Cloud account, there’s an automatic credit of $300 to spend for three months. So there’s plenty of room for much needed experimenting. Too see the status of the evaluation period, go to Billing > Cost Breakdown and wait a minute or so for the “Free trial status” strip to appear at the top of the page. There’s no problem with “activating full account” immediately. The free trial credits remain, but it also means that real billing occurs when the credits are consumed and/or the trial period is over.
First create a new Google cloud account and enable the Google Translate API.
I went for Basic v2 translation (and not Advanced, v3). Their pricing is the same, but v3 is not allowed with an API key, and I really wasn’t into setting up a service account and struggle with OAuth2. The main advantage with v3 is the possibility to train the machine to adapt to a specific language pattern, but as mentioned in that separate post, I’m hiding away anything but common English language patterns.
As for authentication, I went for API keys. I don’t need any personalized info, so that’s the simple way to go. To obtain the keys, go to main menu (hamburger icon) > APIs and services > Credentials and pick Create Credentials, and choose to create API keys. Copy the string and use it in the key=API_KEY parameters in POST requests. It’s possible to restrict the usage of this key in various ways (HTTP referrer, IP address etc.) but it wasn’t relevant in my case, because the script runs only on my computer.
The web interface for setting up cloud services is horribly slow, which is slightly ironic and a bit odd for a company like Google.
The translation script
I wrote a simple script for taking a piece of text in English and translating it into the language of choice:
#!/usr/bin/perl
use warnings;
use strict;
use LWP::UserAgent;
use JSON qw[ from_json ];
our $WASTEMONEY = 0; # Prompt before making request
my $MAXLEN = 500000;
my $chars_per_dollar = 50000; # $20 per million characters
our $APIkey = 'your API key here';
my ($outfile, $origfile, $lang) = @ARGV;
die("Usage: $0 outfile origfile langcode\n")
unless (defined $origfile);
my $input = readfile($origfile);
askuser() unless ($WASTEMONEY);
my $len = length $input;
die("Cowardly refusing to translate $len characters\n")
if ($len > $MAXLEN);
writefile($outfile, translate($input, $lang));
################## SUBROUTINES ##################
sub writefile {
my ($fname, $data) = @_;
open(my $out, ">", $fname)
or die "Can't open \"$fname\" for write: $!\n";
binmode($out, ":utf8");
print $out $data;
close $out;
}
sub readfile {
my ($fname) = @_;
local $/; # Slurp mode
open(my $in, "<", $fname)
or die "Can't open $fname for read: $!\n";
my $input = <$in>;
close $in;
return $input;
}
sub askuser {
my $len = length $input;
my $cost = sprintf('$%.02f', $len / $chars_per_dollar);
print "\n\n*** Approval to access Google Translate ***\n";
print "$len bytes to $lang, $cost\n";
print "Source file: $origfile\n";
print "Proceed? [y/N] ";
my $ans = <STDIN>;
die("Aborted due to lack of consent to proceed\n")
unless ($ans =~ /^y/i);
}
sub translate {
my ($text, $lang) = @_;
my $ua = LWP::UserAgent->new;
my $url = 'https://translation.googleapis.com/language/translate/v2';
my $res = $ua->post($url,
[
source => 'en',
target => $lang,
format => 'html', # Could be 'text'
key => $APIkey,
q => $text,
]);
die("Failed to access server: ". $res->status_line . "\n")
unless ($res->is_success);
my $data = $res->content;
my $json = from_json($data, { utf8 => 1 } );
my $translated;
eval {
my $d = $json->{data};
die("Missing \"data\" entry\n") unless (defined $d);
my $tr = $d->{translations};
die("Missing \"translations\" entry\n")
unless ((defined $tr) && (ref $tr eq 'ARRAY') &&
(ref $tr->[0] eq 'HASH'));
$translated = $tr->[0]->{translatedText};
die("No translated text\n")
unless (defined $translated);
};
die("Malformed response from server: $@\n") if ($@);
$translated =~ s/(<\/(?:p|h\d+)>)[ \t\n\r]*/"$1\n"/ge;
return $translated;
}
The substitution at the end of the translate() function adds a newline after each closing tag for a paragraph or header (e.g. </p>, <h1> etc.) so that the HTML is more readable with a text editor. Otherwise it’s all in one single line.
Protecting your money
By obtaining an API key, you effectively give your computer permission to spend money. Which is fine as long as it works as intended, but a plain bug in a script that leads to an infinite loop or recursion, or maybe just feeding the system with a huge file by mistake, can end up with consequences that are well beyond the CPU fan spinning a bit.
So there are two protection mechanisms in the script itself:
- The script prompts for permission, stating how much it will cost (based upon $20 / million chars).
- It limits a single translation to 500k chars (to avoid a huge file from being processed accidentally).
Another safety mechanism is to set up budgets and budget alerts. Go to Main menu (hamburger) > Billing > Budgets & Alerts. Be sure to check “Email alerts to billing admins and users”. If I got it right, budgets don’t protect against spending, but only sends notifications. So I selected a sum, and enabled only the 100% threshold. It seems to make sense to check all the Discounts and Promotion options in the Credits part, which makes sure that the alert is given for the money to be spent by deducing all promotion credits.
On top of that, it’s a good idea to set quota limits: Go to Main menu (hamburger) > IAM & Admin > Quotas. Set the filter to Translation to get rid of a lot of lines.
It’s also the place to get an accurate figure for the current consumption.
Enable the quota for “v2 and v3 general model characters per day”, which is the only character limit that isn’t per minute, and set it to something sensible, for example 2 million characters if you’re a modest user like myself. That’s $40, which is fairly acceptable damage if the computer goes crazy, and high enough not to hit the roof normally.
Also do something with “v3 batch translation characters using general models per day” and same with AutoML custom models. I don’t use these, so I set both to zero. Just to be safe.
There’s “Edit Quotas” to the top right. Which didn’t work, probably because I did this during the trial period, so quotas are meaningless, and apparently disabled anyhow (or more precisely, enabled to fixed limits).
So the way to do it was somewhat tricky (as it’s probably pointless): To enable a quota, right-click the “Cloud Translation API” to the left of the quota item, and open it in a new tab. Set up the quota figure there. But this description on how to do it might not be accurate for a real-life use. Actually, the system ignored my attempts to impose limits. They appeared on the page for editing them, but not on the main page.
Supporting CJK in LaTeX
I’m wrapping up this post with notes on how to feed LaTeX (pdflatex, more precisely) with Chinese, Japanese and Korean, with UTF-8 encoding, and get a hopefully reasonable result.
So first grab a few packages:
# apt install texlive-lang-european # apt install texlive-lang-chinese # apt install texlive-lang-korean # apt install texlive-cjk-all
Actually, texlive-lang-european isn’t related, but as its name implies, it’s useful for European languages.
I first attempted with
\usepackage[UTF8]{ctex}
but pdflatex failed miserably with an error saying that the fontset ‘fandol’ is unavailable in current mode, whatever that means. After trying a few options back and forth, I eventually went for the rather hacky solution of using CJKutf8. The problem is that CJK chars are allowed only within
\begin{CJK}{UTF8}{gbsn}
[ ... ]
\end{CJK}
but I want it on the whole document, and I need the language setting to be made in a file that is included by the main LaTeX file (a different included file for each language). So I went for this simple hack:
\AtBeginDocument{\begin{CJK}{UTF8}{gbsn}}
\AtEndDocument{\end{CJK}}
As for the font, it appears like gbsn or gkai fonts should be used with Simplified Chinese, and bsmi or bkai for with Traditional Chinese. Since I translated into Simplified Chinese, some characters just vanished from the output document when trying bsmi and bkai. The back-translation to English of a document made with bsmi was significantly worse, so these dropped characters had a clear impact in intelligibility of the Chinese text.
I got this LaTeX warning saying
LaTeX Font Warning: Some font shapes were not available, defaults substituted.
no matter which of these fonts I chose, so it doesn’t mean much.
So the choice is between gbsn or gkai, but which one? To decide, I copy-pasted Chinese text from updated Chinese websites, and compared the outcome of LaTeX, based upon the TeX file shown below. It was quite clear that gbsn is closer to the fonts in use in these sites, even though I suspect it’s a bit of a Times New Roman: The fonts used on the web have less serifs than gbsn. So gbsn it is, even though it would have been nicer with a font with less serifs.
For Japanese, there’s “min”, “maru” and “goth” fonts. “Min” is a serif font, giving it a traditional look (calligraphy style) and judging from Japanese websites, it appears to be used primarily for logos and formal text (the welcoming words of a university’s president, for example).
“Maru” and “goth” are based upon simple lines, similar to plain text in Japanese websites. The latter is a bit of a bold version of “maru”, but it’s what seems to be popular. So I went with “goth”, which has a clean and simple appearance, similar to the vast majority of Japanese websites, even though the bold of “goth” can get a bit messy with densely drawn characters. It’s just that “maru” looks a bit thin compared to what is commonly preferred.
Korean has two fonts in theory, “mj” and “gt”. “mj” is a serif font with an old fashioned look, and “gt” is once again the plain, gothic version. I first failed to use the “gt” font even though it was clearly installed (there were a lot of files in the same directories as where the “mj” files were installed, only with “gt”). Nevertheless, trying the “gt” font instead of “mj” failed with
LaTeX Font Warning: Font shape `C70/gt/m/it' undefined (Font) using `C70/song/m/n' instead on input line 8. ! Undefined control sequence. try@size@range ...extract@rangefontinfo font@info <-*>@nil <@nnil
But as it turns out, it should be referred to as “nanumgt”, e.g.
\begin{CJK}{UTF8}{nanumgt}
나는 멋진 글꼴을 원한다
\end{CJK}
It’s worth mentioning XeLaTeX, which allows using an arbitrary True Type font withing LaTeX, so the font selection is less limited.
See this page on fonts in Japanese and Korean.
For these tests, I used the following LaTeX file for use with e.g.
$ pdflatex test.tex
\documentclass{hitec}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{CJKutf8}
\newcommand{\thetext}
{
它说什么并不重要,重要的是它是如何写的。
}
\AtBeginDocument{}
\AtEndDocument{}
\title{This document}
\begin{document}
gbsn:
\begin{CJK}{UTF8}{gbsn}
\thetext
\end{CJK}
gkai:
\begin{CJK}{UTF8}{gkai}
\thetext
\end{CJK}
bsmi:
\begin{CJK}{UTF8}{bsmi}
\thetext
\end{CJK}
bkai:
\begin{CJK}{UTF8}{bkai}
\thetext
\end{CJK}
\end{document}