5 Min. Read | Nathan Lucas | July 13, 2022 |
When online digital content is translated from one language to another, an unfortunate—and common—side effects can occur when this translated content is transported to a different medium.
Simple sentences that contain accented letters or special formatting can appear malformed when copied from one file to another. Specific characters and punctuation elements are usually rendered as a series of question marks or random non-standard characters.
Why is this happening? Character encoding.
Character encoding tells computers how to interpret digital data into letters, numbers and symbols. This is done by assigning a specific numeric value to a letter, number or symbol. These letters, numbers, and symbols are classified as “characters”. Characters are grouped together into specific “character sets” or “repertoires” that associate each one with a numerical value called a “code point”. These characters are then stored as one or more bytes.
When you input characters through a keyboard or other means, character encoding maps them to the associated bytes in the computer memory. This allows the computer to display the characters properly. Without the proper encoding, the computer will not be able to make sense of the characters and display the proper information.
In order to properly render translated digital content, the correct character encoding must be used. For example, text with special characters should look like this:
Character Encoding 101 by Kaðlín Örvardóttir
may display like this:
Character Encoding 101 by Ka▯l?n ▯rvard?ttir
Here’s some history on character sets, followed by some tips on how to properly leverage them for your website translation projects.
Until the early 1960s, computer programmers created ad-hoc conventions to represent characters internally. Some computers distinguished between upper- and lower-case letters, but most did not. The technique worked because the information was typically processed from end to end in a single machine. Hence, there was no need for standardized character encoding.
However, once information exchange became an important consideration, programmers needed a standard code that allowed data to move between different computer models. This led to the development of ASCII (American Standard Code for Information Interchange).
In 1963, the ASCII (American Standard Code for Information Interchange) character encoding scheme was established as a common code used to represent English characters, with each letter assigned a numeric value from 0 to 127.
Most modern character encoding subsets are based on the ASCII character encoding scheme, and support several additional characters.
When the Windows operating system emerged in 1985, a new standard was quickly adopted known as the ANSI character set. The phrase “ANSI” was also known as the Windows code pages (Code Page 1252), even though it had nothing to do with the American National Standards Institute.
Windows-1252 or CP-1252 (code page 1252) character encoding became popular with the advent of Microsoft Windows, but was eventually superseded when Unicode was implemented within Windows. Unicode, which was first released in 1991, assigns a universal code to every character and symbol for all the languages in the world.
The ISO-8859-1 (also known as Latin-1) character encoding set features all the characters of Windows-1252, including an extended subset of punctuation and business symbols. This standard was easily transportable across multiple word processors, and even newly released versions of HTML 4.
The first edition was published in 1987, and was a direct extension of the ASCII character set. While support was extensive for its time, the format was still limited.
After the debut of ISO-8859-1, the Unicode Consortium regrouped to develop more universal standards for transportable character encoding.
UTF-8 (Unicode Transformation-8-bit) is now the most widely used character encoding format on the web, as it serves as a mapping method within Unicode. UTF-8 was declared mandatory for website content by the Web Hypertext Application Technology Working Group, a community of people interested in evolving the HTML standard and related technologies.
UTF-8 was designed for full backward compatibility with ASCII.
So it’s clear that each character set uses a unique table of identification codes to present a specific character to a user. If you were using the ISO-8859-1 character set to edit a document and then saved that document as a UTF-8 encoded document without declaring that the content was UTF-8, special characters and business symbols will render unreadable.
Most modern web browsers support legacy character encodings, so a website can contain pages encoded in ISO-8859-1, or Windows-1252, or any other type of encoding. The browser should properly render those characters based on the character encoding format not being reported by the server.
However, if the character set is not correctly declared at the time the page is rendered, the web server’s default is usually to fall back without any specific character encoding format (usually ASCII).
This forces your browser or mobile device to determine the page’s proper type of character encoding. Based on the WHATWG specifications adopted by W3C, the most typical default fallback is UTF-8. However, some browsers will fall back to ASCII.
To ensure your users are always seeing the correct content on your HTML production pages, be sure:
Following these specifications will easily facilitate website translation into various languages without having the need to decode and re-encode into other character encodings across the multichannel media that’s used on the web today.
While character encoding is essential for website localization, it’s actually part of a process known as internationalization. Often shortened to i18n, internationalization enables applications to input, process, and output international text. For multilingual websites, it ensures web pages are successfully localized into the target languages.
In the 1990s, internationalization support meant that an application could input, store, and output data in different character sets and encodings. For example, an English-speaking user could converse with you in Latin-1 while a Russian-speaking user could do so in KOI8-R.
Yet this system presented a few problems such as the inability to present data from two different user sets on the same page. Additionally, each piece of data needed to be tagged with the character set it was stored as. That meant the HTML and all the content on the web page had to be output using the correct character set.