Content-type: text/html
iconv_intro, iconv - Introduction to codeset conversion
Conversion of character encoding from one coded character set (codeset) to another is an operation that often has to be performed by the operating system and some applications. For example, the man command supports codeset conversion to allow one set of reference page files to meet the needs of locales that support the same language and territory but different codesets (see man(1)).
The following commands and library interfaces give users and application developers direct access to codeset conversion operations: The iconv command converts characters in a data file from one codeset to another (see iconv(1)). The iconv(), iconv_open(), and iconv_close() functions convert a string of characters from one codeset to another (see iconv(3), iconv_open(3), and iconv_close(3)). The iconv command uses these interfaces to convert characters.
There are two types of codeset converters: algorithmic and table. Algorithmic converters, which reside in the /usr/lib/nls/loc/iconv directory, are shared libraries with a predefined entry point for invocation by functions in the libiconv.so library. Algorithmic converters are needed for the conversion of multibyte codesets, in part because table converters cannot handle the required number of character values and also because some of these codesets require complex handling (see NOTES). Algorithmic converters are supplied as part of the operating system product; the internal interfaces that they require are not published for external use.
Table converters, which reside in the /usr/lib/nls/loc/iconvTable directory, can be created by using the genxlt command (see genxlt(1)). These converters can support single-byte codesets and up to 256 encoded character values.
Names of codeset converters are in the following form:
from-codeset_to-codeset
For example, the following converter converts values from Super DEC Kanji to Japanese Extended UNIX Code:
sdeckanji_eucJP
The codeset converters produce an invalid character error in response to characters that cannot be converted from the source codeset to the destination codeset. This error is always produced for character codes that are invalid in the source codeset. However, if the error results from characters that are valid in the source codeset but have no counterparts in the destination codeset, you can eliminate the error by defining the ICONV_DEFSTR environment variable to specify a substitute output string. See the ENVIRONMENT VARIABLES section for more information about using the ICONV_DEFSTR variable.
It is possible to convert data directly between two codesets or by way
of an intermediate codeset, such as UCS-2, UCS-4, or UTF-8. For conversion
of Chinese characters, be aware that the results of converting a Traditional
Chinese codeset directly to a Simplified Chinese codeset may not be the same
as the results of converting Traditional Chinese first to UCS-2, UCS-4, or
UTF-8 and then to Simplified Chinese.
Some codeset converters require more complex algorithms than can be provided through tables. The following environment variables provide control over conversion behavior for different kinds of codeset converters:
Controls the behavior for the many-to-one value conversions for conversion of Traditional Chinese (except for Traditional Chinese encoded in Telecode) to Simplified Chinese. The valid settings for this environment variable are as follows: Specifies that the preferred mapping value (the first one in the one-to-many mapping list) is always taken. The batch setting is the ICONV_ACTION default. Specifies that all the possible values are printed to the standard output, enclosed by braces ({ }), so that the user can later manually edit the converted file and select the one to use. Specifies that all the possible values are printed to the standard output except for punctuation symbols, for which only the preferred mapping value is printed. As is true for conv-all, the conv_all_nosym setting prints value choices enclosed by braces so that the converted file can later be edited. Sets byte ordering for UCS-2 or UCS-4 converters only. Valid values are little-endian (the default) or big-endian. Setting this environment variable may be necessary when producing UCS-2 or UCS-4 output that will be processed by codeset converters on platforms other than Tru64 UNIX. Defines the default string to be substituted in output for valid input characters that cannot be converted from the source codeset to the destination codeset. The variable value can be an arbitrary string or a code number. If the value is a code number (for example, 10, 07, 0x10, or, for Unicode converters, U+1234), the corresponding character in the output codeset (to-codeset) is printed.
Algorithmic converters
Table converters
Phrase conversion databases
Commands: genxlt(1), iconv(1), phrase(1)
Functions: iconv(3), iconv_close(3), iconv_open(3)
Others: i18n_intro(5), l10n_intro(5)