Cool Tool - ExamDiff Pro
Yeah, I am still alive but barely and I believe my brain is quite worn out. We had released several new products and launched a separate website specifically for the photo market (http://www.xritephoto.com).
I also designed a webpage (http://www.mm-translations.com), I turned 40, we took a few short trips during the summer and we just had nice weather so I rather stay outside.
During one of my projects I had to send out Word documents for review. I told everyone to please turn on “Tracking” in Word and correct the files so I can update my translation memories. These were long (70 pages), big (60 MB) documents with a lot of formatting, grouped elements and images.
Unfortunately, people either don’t like or are not familiar with the tracking function (and are afraid to ask), so I got a variety of documents back, each markked up in a creative way - or not at all.
Now, I know that Word has a Compare function, but I have never been happy with the results. The view is very messy, it highlights ever little “font hiccup” and some ominous language changes and I’d say it is not usable in general.
Fortunately, I remember the little compare tool I use for XML files, it is called ExamDiff and I wondered what it would do with a binary format like a Word document. Well, it handles them like a champ! I have created a doc with a partial Microsoft EULA and changed a copy. ExamDiff shows both documents in a window split vertically or horizontally, highlighting the lines that changed and marking the particular change.
But for me the best was the report function. Since I had to communicate all the changes back to my Italian translator, this was very important. ExamDiff created an HTML document (other formats are also available) that basically looked exactly like the comparison I saw in the application window. It had a little legend that explains what the colors mean and I was able to just send this document to my translator who was easily able to update his translation memory with these changes. Report generated by ExamDiff
Maybe there are a lot of tools like this out there, but ExamDiff has always been wonderful when comparing XML files or source code with 10,000+ lines of code, but until now I was not aware that it can do other files.
Now, as for versions, we are using the older version ExamDiff Pro 3.5. The new version 4.5 costs $35 and is available at PrestoSoft
Context please!
I am just going through my old list email. Since I filter them to different directories it doesn’t bother my inbox I don’t care much if they accumulate. They never have attachments, so even a few thousand emails don’t amount to any significant file size.
One of my pet peeves are translation questions without context. Actually, it applies not only to translation mailing lists but to all types of online communication and all sorts of topics. If you have a computer question, you should always include your computer specs or what exactly the problem is and how it started. The question “What should I do if my computer doesn’t boot anymore?” is impossible to answer.
I am never sure why people do not add the specifics or the sentence before and/or after to their questions. Sometimes, if you are in a hurry, you may forget it and just slap the sentence into an email and send it off. It happens, that’s OK. Unfortunately, you sometimes see an elaborate email that omits the context and in that case you know it was not written in a rush. So, what makes those people omit the oh-so-important context?
A few possible reasons come to mind:
- The asker feels the context is not relevant
This is a poor judgment call given the fact that the asker didn’t understand the sentence well enough to translate it in the first place. To a native speaker, the sentence before and after is often the deciding factor for a translation. This is not a “need-to-know” situation, unless the asker is translating highly confidential material for the CIA or FBI. Let the mailing list decide how much additional information they need, but give them as much as possible.
- The asker feels the context is obvious
This may be true in some cases, but in most cases it is just obvious to the asker because his mind is “in the text”. The people on the mailing list have no idea what the translation is about. And even if they know the general context, it is usually important to know who said something, the frame of mind of the narrator, the time frame - or for technical items, if it is a description, a caption, a menu item, a catalog entry etc.
- The asker finds it more appropriate to explain the context
This is my favorite. In this case, the mailing list has to trust that the asker actually understood the parts before and after, which can be dubitable. He didn’t understand the sentence he is asking about, so how can he be sure he understood the sentence before and after properly? The target language explanation is a translation of his understanding which doubles the error rate. First the context is possibly misunderstood, and second it is possibly mistranslated.
In the end, it is a big waste of time for everyone involved. People who are trying to help send possible solutions that are totally useless within the context - which they couldn’t tell because there was no context. Members of the mailing list get annoyed to the point where they don’t even answer anymore, but just roll their eyes and hit the “Delete” key. The mailing list gets swamped with a back-and-forth of emails with the context slowly unfolding and with the new information, people “re-answer”.
In one of those discussions a couple of months back, a fellow translator and ATA/GLD member Karin Bauchrowitz replied with a great comment (she was quoting someone from a conference):
Beginners translate word by word, then it goes up to sentence by sentence, then paragraph by paragraph, but only the experienced translators go by the entire text, meaning that they take the entire text under consideration.
Considering the topic of this post - do your fellow list members a favor: don’t ask for help posting a word or sentence. For a proper translation, give them at least a paragraph of the text!
The Linguists - Tonight (Feb. 26) on PBS
I just saw this in a magazine - tonight on PBS (10pm EST on my PBS station):
The Linguists is a hilarious and poignant chronicle of two scientists—David Harrison and Gregory Anderson—racing to document languages on the verge of extinction. In Siberia, India, and Bolivia, the linguists confront head-on the very forces silencing languages: racism, humiliation, and violent economic unrest. David and Greg’s journey takes them deep into the heart of the cultures, knowledge, and communities at risk when a language dies.
Before airing on PBS, The Linguists world premiered at the 2008 Sundance Film Festival and screened at more than 40 festivals worldwide. The Linguists is produced and directed by Seth Kramer, Daniel A. Miller, and Jeremy Newberger of Ironbound Films, and based upon work supported by the National Science Foundation under Grants No. 0452417 and 0438121 and by the Nonprofit Media Group.
Cold callers asking for call-back
Is this normal? I am receiving an unsolicited call from a translation agency which I have never done business with. The name appeared on the caller ID so I didn’t pick up. The caller left a message, asking me to call him back. Is this normal? Why would I want to call him back? Usually, cold callers will just try again, but not ask you to call them.
I hope this doesn’t go around, and next I have Planned Parenthood, the local Police Department, the local Fire Department, Clean Water Action, the Juvenile Diabetes Research Foundation and what not call and ask me to call them back. Now, don’t get me wrong - I would like to get this message and then be able to decide whether I call back or not. And of course, if I DON’T call back, they should take the hint that I am not interested. But they don’t and just keep calling.
In recent days, this procedure and the whole “charities begging for money thing” has actually turned me away from giving money to anyone. If you give something once, they will call you every other month and ask for more. And they will not take No for an answer.
Fun with character encodings
What do ASCII, ANSI, Latin-1, Windows-1252, Unicode and UTF have in common?
They are a pain in the neck for translators - but also, they are ways to encode characters in files, even in plain text files that usually seem as “un-encoded” as possible. Most of the time, you don’t have a problem with it, you open a txt file, you don’t really know (or need to know) what character format it has. The only reason why most people even know about this is because of the “bush hid the facts” (see below) trick in Notepad. I am not going into the history and details of the various formats, at the bottom are some links to other pages that deal with that if you want to learn more. I am merely looking at the consequences it can have for me during translation.
What I care more about is the fact that it can really break your neck during translation of string files. I run into that on and off and every time it happens, I learn a little bit more about it. I wanted to write about it since quite a while, and since the whole thing came down again earlier this week, I think it is time now.
We have a little update tool for an application that is written in Java. Java programs usually have their strings in .properties files. Those files are usually encoded in the 8-bit characters of ISO 8859-1 (aka Latin-1) which contains most “regular” characters but lacks support for language specific characters like ü Ü é or ñ. Those characters have to be converted into Unicode escape characters sometimes referred to as Java escape characters. I think most of us have experienced other escape characters, for example the \n for a new line, \t for a tab. Unicode escape characters are a little more involved, using a \uHHHH notation, where HHHH is the hex index of the character in the Unicode character set. So, for example the ß in a Java properties file has to be encoded into \u00df. To convert those characters, I use Rainbow which is part of the Okapi Framework. It has a handy Encoding Conversion Utility that allows you to convert files from one encoding to another.
Sounds really easy, right? Right? Now what is this woman complaining about again? Well, it’s not that easy. The conversion tool is designed to work with 8-bit ASCII-based encodings. Now, so what IS the problem - it was just stated that Java properties files are ASCII-based encodings? Well, TagEditor takes the ASCII file and when you “Save as Target” after translation, it converts the file into a UTF-8. And that is still not the problem, the problem is that it uses a UTF-8 format without a BOM (Byte Order Mark). The BOM is an (invisible) 2 byte sequence in the beginning of a file which basically tells a program “This is a Unicode file”. Without the BOM, some programs do not recognize the encoding of the file and assume ASCII - and that is the problem with Rainbow (and also with Passolo, a program that just got bought by SDL).
If you try to convert the encoding of a BOMless Unicode file, it goes terribly wrong. As I mentioned, the correct conversion of ß will give you \u00df. Converting a BOMless file will “double escape” the extended characters, and you get \u00c3\u0178 - clearly not the same. The “double escape” is actually a good indicator that something went wrong, if you check your file and see that your extended characters are represented by two escape sequences, you know something went wrong. Of course, that can be difficult when dealing with languages like Greek, Russian or Asian languages, simply because every single character is escaped. I usually try to find a short string and count.
Now, how do you know how a file is encoded? Right now, I use Notepad++ to check. It has a handy little Format menu and allows you to see which encoding is used and it also allows you to convert from one encoding to another. Supported formats are Windows, UNIX, Mac, ANSI, UTF-8 w/o BOM, UTF-8 and UCS-2 Big and Little Endian. Surprisingly, Windows Notepad is one of the few programs that actually manages to decipher the Unicode encoding even without a BOM, just open the BOMless file in Windows Notepad and save them without change. Unfortunately, you usually just don’t know and usually it isn’t even an issue.
I actually happen to get to talk to Yves Savourel, who is working at ENLASO and with the Okapi Framework (and about a gazillion other things related to localization), and he has been very helpful. He explained a few things to me a little better.
- The issue:
- a BOMless UTF-8 file is recognized as “windows-1252″ encoding
- a UTF-8 file uses two or more bytes to encode the extended characters
- the application thinks each of those bytes is a separate character and converts each into a Unicode escape sequence
- The solution:
- in Rainbow, manually force the encoding of the source file to UTF-8
- in Rainbow, use the Add/Remove BOM utility to set the BOM properly
If you got through all this stuff, you may now wonder if you’ll ever run into this issue. It is also not just about BOM or not, the whole file encoding raises issues in other applications too. To be honest, I don’t know how often freelance translators are confronted with these types of files, but here are the situations where I keep my eye peeled:
- Java files (.properties)
This was the most recent issue that triggered this post. - String export files (often XML files or even plain txt)
I tend to get the strings for REALBasic applications in XML files, though I believe they are created by RegexBuddy. - Non-Windows files or Windows files that will be used on other OSs
We run into this issue with txt files the were created on a Mac and that will be used in InstallShield-type applications, for example to display the license agreement or a readme file. - All files
Haha, very funny - I know. What I mean is, I have experienced various issues with files, if I have to process them through different applications in order to get CAT-translatable files, for example if we receive a weird string file that Trados doesn’t understand and where we need to find a managable way to extract translatable text.
Anyway, maybe this will help someone else in the situation where the client comes back and claims the files are corrupt or so. Otherwise, I apologize for boring the heck out of you. You should have stopped reading my post a long time ago
Some interesting links with related information:
Okapi Framework
Notepad++
Bush hid the facts hoax and Bush hid the facts on Wikipedia
Mojibake
How to Determine Text File Encoding
Cast of Characters: ASCII, ANSI, UTF-8 and all that
