My home on the web - featuring my real-life persona!

OMG - It’s full of mistakes!

So, we just had one of our applications translated into Greek. It is a very big application for a total of 13,000 words just strings. Initially, we had about a month so time was not a big issue and the translation got started on May 8th. Of course, these things change and all of a sudden, we needed it not by the beginning of June but for a show on May 20th. That means 12 days for the translation, cleaning up the bilingual files, importing the strings, fixing truncations and other issues, testing functionality and compiling DLLs. Of course we made it!

Now I was waiting for feedback. Nothing at all from the guys who were at the show. No “good job”, no “shame on you” - nothing. After a week, I inquired and I got the reply back that there were “a big number of errors”. That sent a shiver down my spine. We don’t have many translations into Greek, only one other application so I don’t know this translator very well. We don’t have any Greek reference material, but I asked and he confirmed that he knew the subject matter. And I myself can of course not check anything in Greek.

Turns out, it wasn’t all that bad. We had issues for all language because unless you are a printing press operator, you really can’t figure out some things. I remember asking our German guys questions and they had no clue either. Unfortunately, some terms that were wrong occured 50 or even 100 times so yeah, it looks like a lot. Correcting all strings took me a couple of hours of manual copy/paste, which is not bad at all.

It just irks me that the only feedback I get was that there are a lot of errors (which wasn’t even true). He never acknowledged that we did the impossible by turning this around so fast and that it worked fine. Only the tester mentioned that this must have been the fastest turn-around we had for any language but I am also getting a lot better at handling languages I know nothing about. The last translation we did for that was Russian - I am fine navigating through French, Spanish, Italian and Portuguese, but Russian and Greek are a whole different animal. If I see a truncation at runtime, I can’t just type in the text I see and search for it - I need a virtual keyboard and go letter by letter type in a keyword to search for. And I am amazed how nicely Trados and TagEditor handle the different character sets. I don’t think many people know what an ordeal it can be to have an application ready for non-Western character sets.

Ah well, believe it or not, I still love doing it - it’s a big girl puzzle and I am getting paid to solve it!

Databases are boring

As some may know, I love computers, I love everything about them. I like writing and reading on the computer - the wealth of information is amazing, I like smaller programming tasks (for now), I enjoyed Java, HTML, CSS, my current excursion into ASP.NET and VB.NET, I enjoy working with images and video. But there is one thing that I could never warm up to and that is anything related to databases. Databases bore me - just the thought of gigabytes of data makes me yawn. Unfortunately, the ASP.NET class at Davenport also covers some database stuff. It makes sense I guess, most web-based programs are connected to a database somewhere be it an online shop or a customer list, a service list or what-not. Have I just not found the right approach to it to see the beauty of databases?

So far, I have enjoyed every single chapter of this book, but now I have to force myself through pages and pages covering SQL, ADO, ODBC, COM, OLE and I can’t seem to finish. What’s up with all the acronyms? I have to fight through at least another 50 pages going on and on about data connections and sources, queries and relationships, propertiessssssss - sorry, my head hit the keyboard when I feel asleep typing this. Can anyone help me? What am I not getting? Is there a way to make this more enjoyable? It’s not like I need much - heck, I have spent hours of my life watching my hard drive defrag and download progress bars grow.

Oh well, back to the ADO.NET model….

Discuss why the data types are the same across languages in the MSIL.

The data types in the different languages in the Microsoft Intermediate Language (MSIL) to provide compatibility. In .NET, various languages can be used but in the end they have to be able to interface with each other. The .NET compiler compiles the languages into MSIL and it can only do this without problems since the data types are the same. It would be incredibly difficult if each language had its own data types or its own definitions of data types. A string for example is a “group” of characters or digits and every program understands it as such. Now imagine another programming language would define a string only as a group of characters and not consider digits and a combination of characters and digits would be called a word, then the VB string cisp238 would be considered a word in the other languages and the .NET compiler would have to examine all strings to see if they need to be converted to be compatible.

I found a real world examples: In VB.NET, the data type Long is a 64-bit number while in VB6 it was a 32-bit number. So, it would require a lot of care to bring those two languages together and every time one languages uses a data type Long you would have to exactly specify what it means. The rules for the different .NET languages are specified in the Common Language Specification (CLS).

I found an interesting link about how different languages define the primitive data type Boolean: http://en.wikipedia.org/wiki/Boolean_datatype which is a good example of the advantages of uniform data type definitions.

It also seems like MSIL is the great “equalizer”. People used to have arguments about the speed of different programming languages, or better the speed of the applications coded in those languages. The argument was that C++ was faster than VB but in .NET everything is going through the MSIL and that determines the speed.

Corporate Email Woes

I am usually a strong believer in the theory that an email that has been sent will also arrive in my inbox. Emails just don’t get lost just like that, even though every now and then someone may claim it happened. In all honesty, I too have claimed to have sent an email if in reality, I forgot. In the past, it was an email to my Mom to send her a photo, recipe, or something - I am pretty sure I am not the only one who has used this white lie.

Now, when it comes to business email, it’s quite different to me. The information or attachments in a business related email is usually critical to a job or a project and can have consequences. For example, if I receive a set of strings, I usually tell my developer that he can expect them back by the end of the week. My freelancers are usually very fast and I am able to estimate how long it should take them. They will let me know if they cannot make it within a reasonable amount of time but I don’t expect them to confirm every email I send. At their own discretion, they reply with an estimated delivery date or they just return the files. Considering the previous premise that emails reach their destination, I think it works fine both ways. Both sides cut down on the chit-chat back and forth a little - it’s not like we all don’t send enough emails anyway.

Unfortunately, my system has been shattered by our new overzealous email filter system. All of a sudden, emails I send are not going out, I don’t receive emails that my freelancers send - and the notification system is lacking to non-existant. First, my translator returned a translation in TTX files on Thursday. On Monday, I carefully inquired if he had received the files to which he replied he had delivered on Thursday. Quick check whit IT, of course it got caught in the mail filter because of “inappropriate language”. Haha, that would mean that either the help system contains foul language which it doesn’t (especially not since I was able to send the file out fine) or a word in the Spanish translation happens to match an English term which is on the index. No one knows, I was told there is no log listing which word was the offender. A few hours later, the translator received a note that the email he had sent Thursday could not be delivered.

Then all of a sudden, we cannot send or receive compressed attachments anymore - yes, a regular zip file is held back because who knows what’s in it. And again, no one gets a notification. The sender believes it went through, the receiver has no idea anything was blocked and the lonely email is sitting in quarantine. Apparently, this is now handled on a case by case base and the IT department checks the emails with attachment and patches them through if appropriate - a system which apparently doesn’t work very well. The reason by the way is protection against viruses I was told.

Next thing, the SDL Trados Synergy translation packages are blocked - same with the zip files, no notification, they are just quarantined. The packages are pkzipped so the system recognizes them as zip files and blocks them. At least with those, we found a solution because of the unique file extension they could write a filter rule that allows stppk out and strpk in. But even one of those was blocked recently again becasue of offensive language.

The whole thing is a major pain now. Not only do I have to confirm all email people send to me, I also need the translators to confirm they received my emails until IT gets the out-of-control email filter configured properly. In between, I also got mocked by an IT guy for being a “troublemaker” because I insist on receiving my email. Yeah, the audacity - I insist on receiving professional business emails sent by associates. Not sure if he was kidding, but I do believe before implementing a system like this, it should be looked at a little closer. I don’t even want to know how many customer emails got lost.

I am sure some freelancers have wished for an IT department that takes care of all the computer woes. Believe me, it really doesn’t work like that - woes sometimes aren’t eliminated but created :-)

Fun with character encodings

What do ASCII, ANSI, Latin-1, Windows-1252, Unicode and UTF have in common?

They are a pain in the neck for translators - but also, they are ways to encode characters in files, even in plain text files that usually seem as “un-encoded” as possible. Most of the time, you don’t have a problem with it, you open a txt file, you don’t really know (or need to know) what character format it has. The only reason why most people even know about this is because of the “bush hid the facts” (see below) trick in Notepad. I am not going into the history and details of the various formats, at the bottom are some links to other pages that deal with that if you want to learn more. I am merely looking at the consequences it can have for me during translation.

What I care more about is the fact that it can really break your neck during translation of string files. I run into that on and off and every time it happens, I learn a little bit more about it. I wanted to write about it since quite a while, and since the whole thing came down again earlier this week, I think it is time now.

We have a little update tool for an application that is written in Java. Java programs usually have their strings in .properties files. Those files are usually encoded in the 8-bit characters of ISO 8859-1 (aka Latin-1) which contains most “regular” characters but lacks support for language specific characters like ü Ü é or ñ. Those characters have to be converted into Unicode escape characters sometimes referred to as Java escape characters. I think most of us have experienced other escape characters, for example the \n for a new line, \t for a tab. Unicode escape characters are a little more involved, using a \uHHHH notation, where HHHH is the hex index of the character in the Unicode character set. So, for example the ß in a Java properties file has to be encoded into \u00df. To convert those characters, I use Rainbow which is part of the Okapi Framework. It has a handy Encoding Conversion Utility that allows you to convert files from one encoding to another.

Sounds really easy, right? Right? Now what is this woman complaining about again? Well, it’s not that easy. The conversion tool is designed to work with 8-bit ASCII-based encodings. Now, so what IS the problem - it was just stated that Java properties files are ASCII-based encodings? Well, TagEditor takes the ASCII file and when you “Save as Target” after translation, it converts the file into a UTF-8. And that is still not the problem, the problem is that it uses a UTF-8 format without a BOM (Byte Order Mark). The BOM is an (invisible) 2 byte sequence in the beginning of a file which basically tells a program “This is a Unicode file”. Without the BOM, some programs do not recognize the encoding of the file and assume ASCII - and that is the problem with Rainbow (and also with Passolo, a program that just got bought by SDL).

If you try to convert the encoding of a BOMless Unicode file, it goes terribly wrong. As I mentioned, the correct conversion of ß will give you \u00df. Converting a BOMless file will “double escape” the extended characters, and you get \u00c3\u0178 - clearly not the same. The “double escape” is actually a good indicator that something went wrong, if you check your file and see that your extended characters are represented by two escape sequences, you know something went wrong. Of course, that can be difficult when dealing with languages like Greek, Russian or Asian languages, simply because every single character is escaped. I usually try to find a short string and count.

Now, how do you know how a file is encoded? Right now, I use Notepad++ to check. It has a handy little Format menu and allows you to see which encoding is used and it also allows you to convert from one encoding to another. Supported formats are Windows, UNIX, Mac, ANSI, UTF-8 w/o BOM, UTF-8 and UCS-2 Big and Little Endian. Surprisingly, Windows Notepad is one of the few programs that actually manages to decipher the Unicode encoding even without a BOM, just open the BOMless file in Windows Notepad and save them without change. Unfortunately, you usually just don’t know and usually it isn’t even an issue.

I actually happen to get to talk to Yves Savourel, who is working at ENLASO and with the Okapi Framework (and about a gazillion other things related to localization), and he has been very helpful. He explained a few things to me a little better.

    The issue:

  • a BOMless UTF-8 file is recognized as “windows-1252″ encoding
  • a UTF-8 file uses two or more bytes to encode the extended characters
  • the application thinks each of those bytes is a separate character and converts each into a Unicode escape sequence
    The solution:

  • in Rainbow, manually force the encoding of the source file to UTF-8
  • in Rainbow, use the Add/Remove BOM utility to set the BOM properly

If you got through all this stuff, you may now wonder if you’ll ever run into this issue. It is also not just about BOM or not, the whole file encoding raises issues in other applications too. To be honest, I don’t know how often freelance translators are confronted with these types of files, but here are the situations where I keep my eye peeled:

  • Java files (.properties)
    This was the most recent issue that triggered this post.
  • String export files (often XML files or even plain txt)
    I tend to get the strings for REALBasic applications in XML files, though I believe they are created by RegexBuddy.
  • Non-Windows files or Windows files that will be used on other OSs
    We run into this issue with txt files the were created on a Mac and that will be used in InstallShield-type applications, for example to display the license agreement or a readme file.
  • All files
    Haha, very funny - I know. What I mean is, I have experienced various issues with files, if I have to process them through different applications in order to get CAT-translatable files, for example if we receive a weird string file that Trados doesn’t understand and where we need to find a managable way to extract translatable text.

Anyway, maybe this will help someone else in the situation where the client comes back and claims the files are corrupt or so. Otherwise, I apologize for boring the heck out of you. You should have stopped reading my post a long time ago :-)

Some interesting links with related information:

Okapi Framework
Notepad++
Bush hid the facts hoax and Bush hid the facts on Wikipedia
Mojibake
How to Determine Text File Encoding
Cast of Characters: ASCII, ANSI, UTF-8 and all that

« Previous PageNext Page »