The Concept of Code Pages
Definitions
A code page is a table of values that describes the character set used for encoding a particular set of glyphs, usually combined with a number of control characters.
Character encoding is used to represent a repertoire of characters by some kind of an encoding system.
A glyph is an elemental symbol within an agreed set of symbols intended to represent a readable character for the purposes of writing.
A control character or non-printing character is a code point, a number, in a character set, that does not represent a written symbol. They are used as in-band signaling to cuase effects other than the addition of a symbol to the text.
Nowadays, most locales are UTF-8 based which means characters can take up from 1 to 6 bytes. When dealing with data that is meant to be bytes, with text utilities, you'll want to set LC_ALL=C
. It will also improve performance significantly because parsing UTF-8 data has a cost.
Unicode
SQL Server
Unicode is a standard for mapping code points to characters. Because it is designed to cover all the characters of all the languages of the world, there is no need for different code pages to handle different sets of characters. If you store character data that reflects multiple languages, always use Unicode data types (nchar, nvarchar, and ntext) instead of the non-Unicode data types (char, varchar, and text).
Significant limitations are associated with non-Unicode data types. This is because a non-Unicode computer will be limited to use of a single code page. You might experience performance gain by using Unicode because fewer code-page conversions are required. Unicode collations must be selected individually at the database, column or expression level because they are not supported at the server level.
The code pages that a client uses are determined by the operating system settings. To set client code pages on the Windows operating system, use Regional Settings in Control Panel.
Code Pages
- EBCDIC-based code pages
- ISO/IEC 646-related code pages
- ISO/IEC 10646 / Unicode code pages
Linux
To determine the active code page the system is running on Linux, run:
locale
The output will look like:
LANG=en_US.UTF-8 LANGUAGE=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=en_US.UTF-8
Some applications use the following variables:
LC_ALL LC_CTYPE LANG
When you set LC_ALL
, the following variables are all set:
LANG
This variable determines the locale category for native language, local customs and coded character set in the absence of the LC_ALL and other LC_* (LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, LC_TIME) environment variables. This can be used by applications to determine the language to use for error messages and instructions, collating sequences, date formats, and so forth.
LANG=C
The C
locale is a special locale that is meant to be the
simplest locale. You could also say that while the other locales are for
humans, the C
locale is for computers. In the C
locale, characters are single bytes, the charset is ASCII (well, is not
required to, but in practice will be in the systems most of us will ever get
to use), the sorting order is based on the byte values, the language is
usually US English (though for application messages (as opposed to things
like month or day names or messages by system libraries), it's at the
discretion of the application author) and things like currency symbols are
not defined.