Web Guide  > webguide > 6. Construction > 6.6 Language encoding > 6.6a Guidelines for language encoding

6.6a Guidelines for language encoding

Guideline

What is character encoding?

Character encoding is the organization of a set of numeric codes that represent all meaningful characters (single letter, digit, space, punctuation, etc.) of a script system in memory. Each character is stored in memory as a number (e.g. A = 65). When a user enters characters, the user's key strokes are converted to character codes. When the characters are displayed on the screen, the character codes are converted to the glyphs of a font.

In most character encoding standards, the character set changes to represent the language being used, so the upper-level characters may include symbols, accented Roman letters, Cyrillic or other characters, depending on the character encoding chosen. For example, the character Ó in the Macintosh Standard Roman Character Set is in the same code point (205) as the equal sign (=) in Windows extended ASCII encoding.

Character sets

The character set used for English, French and Spanish (and most common European languages) is ISO-8859-1.

For Asian languages that use ideographs, or hieroglyphs, instead of letters, different character sets should be used. The most common one used for Chinese is gb_2312-80.

For Arabic, the character set is different for Windows (windows-1256) and Macintosh (ISO-8859-6). In addition, it is important to specify in the Web page that the text should be displayed from right to left:

<HTML dir="RTL" lang="ar">

Unicode

UTF-8 (Unicode Transformation Format, 8 bit encoding form) is an encoding form of the Unicode Standard, the universal character encoding standard used for the representation of text for computer processing.

The Unicode Standard has the capacity to encode all of the characters used for the written languages of the world. Each Unicode index refers unambiguously to a given character.

One disadvantage of Unicode is that it takes more space to store plain text and transmission of Unicode data can therefore use more bandwidth. It is difficult editing Web pages in Arabic and Chinese if these languages are not installed on the computer in use, as the Unicode characters are less intuitive than accented characters.

Nevertheless, Unicode is being adopted more widely. When all software and operating system developers adopt Unicode, we will end up with the best standardized forum for processing and exchanging multilingual documents.

Remember... Character Encoding Tips!

  • Insert the correct encoding in all your Web pages, otherwise they may not display correctly on some computers! This is particularly important for Arabic and Chinese.
  • Make sure that all the accented characters are written as character entities (i.e. &eacute; instead of é), otherwise they will not display correctly on some computers! This applies particularly to French and Spanish.