Chapter graphic

Glossary

abstract character repertoire

a set of characters that are used in a writing system to be encoded. Types of characters include letters of the alphabet, digits, and logographs. Examples of abstract character repertoires are Western European alphabets and the Japanese kanji and kana syllabaries.

accent mark

an auxiliary glyph that is combined with a letter to alter the pronunciation of the letter. For example, the character á results from combining the letter a with the acute accent (´). The character ñ is formed from the letter n with the tilde (~).

accented character

a type of character that is modified by the addition of an accent mark that alters the pronunciation of the character. An example is ñ, which results from combining the tilde (~) with the character n.

alphabet

a set of characters that represent the phonemes of a language in writing. A true alphabet contains separate letters for both consonants and vowels; examples are the Latin and the Cyrillic alphabets.

American Standard Code for Information Interchange

a 7-bit encoding standard that provides a basic set of 128 characters, supporting a variety of computer systems. ASCII encodes the uppercase and lowercase letters of the English alphabet, punctuation marks, the digits 0–9, and control characters. This set of 128 characters is also included in most other encodings. Short form: ASCII.

Arabic script

a script for writing the Arabic, Urdu, or Persian (Farsi) languages. The Arabic script is bidirectional.

ASCII

See American Standard Code for Information Interchange.

bidirectional text

pertaining to a writing system such as Arabic and Hebrew that generally runs from right to left, except for numbers and embedded text written in other languages that run from left to right. Example: Abdel dit, “كيف الحال” (Ça va?).

bidi

See bidirectional text.

Big5

the name of an encoding for the Traditional Chinese character set. It is used in Taiwan, Hong Kong, and Macau.

BOM

See byte-order mark.

byte-order mark

the Unicode character that indicates the byte order of the Unicode text that follows in the text file or stream. The code point of the byte-order mark is U+FEFF (hexadecimal). Short form: BOM.

CCS

See coded character set.

CCSID

See coded character set identifier.

CEDA

See Cross-Environment Data Access.

CES

See character encoding scheme.

character

the smallest component of a writing system that has a semantic value, such as the letters of an alphabet, digits, or logographs.

character encoding

See coded character set.

character encoding scheme

a map of a coded character set to a sequence of binary codes. The ISO Latin-1 coded character set provides the Western European alphabet and symbols, their nonnegative integer representations, and their binary equivalents. For example, the ISO Latin-1 letter Ä is represented as 00C4 (hexadecimal) and 11000011 10000100 (binary). Short form: CES.

character set

See coded character set.

Chinese, Japanese, and Korean

the collective group of languages Chinese, Japanese, and Korean. Short form: CJK.

Chinese, Japanese, Korean, and Vietnamese

the collective group of languages Chinese, Japanese, Korean, and Vietnamese. Short form: CJKV.

Chinese National Standards

the standards that are promoted by the National Bureau of Standards of the Republic of China (Taiwan). The CNS are analogous to such international standards as ISO and IEC. Short form: CNS.

CJK

See Chinese, Japanese, and Korean.

CJKV

See Chinese, Japanese, Korean, and Vietnamese.

CNS

See Chinese National Standards.

code page

See coded character set.

code table

See coded character set.

coded character repertoire

See coded character set.

coded character set

a map of an abstract character repertoire to a set of numeric values. The ISO Latin-1 coded character set provides the Western European alphabet and symbols and their numeric representations. For example, the letter Ä is represented as C4 (hexadecimal). Short forms: CCS, character set.

coded character set identifier

in IBM terminology, a 16-bit number that represents a specific set of encoding scheme identifiers, character set identifiers, code page identifiers, and other information that uniquely identifies the coded graphic character representation. Example: CCSID 37 (decimal) encompasses the encoding scheme 1100, the character set 00697, and the code page 00037. Short form: CCSID.

code point

the position of a character in a coded character set marked by a unique number (nonnegative integer).

code position

a synonym for a code point used in the ISO character encoding standards.

Cross-Environment Data Access

a feature of SAS software that enables a SAS data file that was created in any directory-based operating environment (for example, Solaris, Windows, HP-UX, OpenVMS) to be read by a SAS session that is running in another environment. You can access the SAS data files without using any intermediate conversion steps. Short form: CEDA.

CS

See Simplified Chinese characters.

DATA step

in a SAS program, a group of statements that begins with a DATA statement and that ends with either a RUN statement, another DATA statement, a PROC statement, or the end of the job. The DATA step enables you to read raw data or other SAS data sets and to create SAS data sets.

DBCS

See double-byte character set.

diacritic

See accent mark.

digraph

(1) two characters used together to represent a single uttered sound, such as sh in the English language.

(2) in computing, a two-character combination used for representing a single character that is not on the keyboard. For example, in SAS, the CHARCODE system option enables the display of the grave accent (`) using the question mark and colon (?:) keystrokes.

double-byte character set

a character set that requires a variable-width encoding because many characters occupy two bytes of memory. The term DBCS, as traditionally applied to languages such as Japanese, Korean, and Chinese, is somewhat misleading because some DBCS characters actually require less than two bytes. Short form: DBCS.

EBCDIC

See Extended Binary Coded Decimal Interchange Code.

encode

to represent data in a particular character encoding scheme. For example, in ASCII, the letter A is represented as 41 (hexadecimal).

encoding

a mapping of a coded character set to code values.

encoding method

the application of established industry rules to a coded character set to produce an encoded character scheme. Such rules prescribe the number of bits required for storing the numeric representation of a specific character and its code position in the encoding. ISO 2022 and UTF-8 are examples of encoding methods.

EUC

See extended UNIX code.

extended ASCII characters

See extended characters.

Extended Binary Coded Decimal Interchange Code

a family of single-byte and multi-byte encodings for the representation of data on IBM mainframe and mid-range computers. EBCDIC encodes the uppercase and lowercase letters of the English alphabet, punctuation marks, the digits 0–9, and an extended set of control characters. Short form: EBCDIC.

extended characters

the grouping of 8-bit characters that occupy code points outside the traditional 128 code-point range in 7-bit ASCII. Such extensions support characters from non-English languages. Various proprietary extensions have been developed by manufacturers of computers sold in international markets. For example, the ISO/IEC 8859-1 encoding includes the extended character À (the Latin letter A with the grave accent) in code location 192.

extended UNIX code

a multi-byte encoding scheme used to encode, in most cases, the Simplified Chinese, Traditional Chinese, Japanese, and Korean writing systems on UNIX systems. Short form: EUC.

font

a typeface with a specific character shape, spacing, weight, and size. The characters in a font can be figures, symbols, or alphanumeric.

GB

See Guojia Biaozhun.

GB 18030

the name of the coded character set that supports all writing systems used in the People’s Republic of China, including the Simplified Chinese characters as well as the Traditional Chinese characters plus some characters from ethnic minority groups such as Mongolian and Tibetan.

GB 2312

the name of a character set of the People’s Republic of China, preceding GB18030 and GBK, but still in widespread use. GB2312 includes Simplified Chinese characters, Japanese kana, the Greek and Cyrillic alphabets, and pinyin letters with tone marks.

GBK

the name of an extension of the GB2312 character set. It includes all the characters of GB2312, and additionally Traditional Chinese characters and characters that were simplified after the establishment of GB2312 in 1981.

glyph

the most basic element (a grapheme or combination of graphemes) of a typeface or font that is used to render text in a writing system.

glyph collection

the set of glyphs that constitute the particular font.

glyph image

an image of a glyph that is rendered from a glyph representation on a presentation medium, such as on paper, in printed output, or on a computer screen.

glyph metrics

the information from a glyph representation that is used for defining the dimensions and positioning of the glyph shape on a presentation medium such as a computer screen.

glyph presentation

the delivery of a glyph to an output medium such as a writing system, a printer, or a computer screen.

glyph representation

the shape and metrics that are associated with a specific glyph in a font resource.

glyph shape

the information in a glyph representation for defining the shape of a particular glyph.

grapheme

the arrangement of one or more consecutive characters in a writing system that represents a phoneme. For example, in the English language, the graphemes f in “fin,” the ph in “phantom,” and the gh in “laugh” represent the f phoneme.

graphic character

a type of character that can be written, printed, or displayed.

Guojia Biaozhun

the pinyin transcription of the Chinese words 国家标准, which means “National Standards.” Short form: GB.

han

the name of the comprehensive set of characters that are common to the writing systems of the Chinese, Japanese, Korean, and Vietnamese languages.

Han unification

the identification and collection of all the han characters that are common to the writing systems of the Chinese, Japanese, Korean, and Vietnamese languages.

hangul

the name of the native writing system for the Korean language. Hangul contrasts with the hanja writing system that was borrowed from China. Hangul characters are alphabetic but are grouped as syllabic blocks.

hanja

the name of the set of Chinese characters that are used in the Korean language. Hanja characters are used to represent words or morphemes rather than syllables.

hiragana

the cursive form of the Japanese syllabary. Hiragana (along with katakana, kanji, and, in some cases, the Latin alphabet) is a component of the Japanese writing system.

high ASCII characters

See extended characters.

I18N

See internationalization.

IBM 3270

a family of block-oriented terminals (information display systems) made by IBM until the middle of the 1990s. They were used to communicate with IBM mainframe computers. The IBM 3270 protocol is still commonly used via terminal emulation to access mainframe-based applications.

ICU

the open-source project containing C/C++ and Java libraries that provide Unicode and globalization support for software applications.

ideogram

a graphic symbol that represents an idea or concept. Often but inaccurately used as a synonym for logogram.

ideograph

See ideogram.

IEC

See International Electrotechnical Commission.

IME

See input method editor.

input method editor

a type of software that facilitates the entry of characters from an alternate writing system through a keyboard. The Dayi keyboard is an example of an input method that maps Simplified Character characters to a U.S. English keyboard. Short form: IME.

International Components for Unicode

See ICU.

international configuration option

the group of SAS options that relate to customizations for language, country, and cultural conventions in the user’s environment. LOCALE= is one such option.

International Electrotechnical Commission

a non-profit, non-governmental organization for the preparation and publication of International Standards for all electrical, electronic and related technologies—collectively known as “electrotechnology.” Short form: IEC.

International Organization for Standardization

an organization that promotes the development of standards and that sponsors related activities that help to disseminate products and services among nations. Also, it supports the exchange of intellectual, scientific, and technological information. Short form: ISO.

Internationalization

the process of designing a software product without making assumptions that are based on a single language or locale. Internationalization ensures that international conventions (including rules for sorting strings and for formatting dates, times, numbers, and currencies) are supported. It also facilitates a consistent user experience across different language editions of a product. Short form: I18N.ISO

See International Organization for Standardization.

ISO 646

a 7-bit encoding that is defined by the ISO 646 standard. The encoding contains both the 116 invariant ASCII code positions and the 12 variant code positions that can be replaced by national characters. For example, code position 23 (hexadecimal) is reserved for a variant character. In the U.S., the pound sign (#) occupies this position. In the UK, the pound monetary symbol (£) is specified. The national variants of ISO 646 are largely obsolete.

ISO 8859 family

a group of 16 8-bit encodings that are defined in the ISO 8859 standard. Each encoding contains both the 128 ASCII characters and the 128 extended characters, which are used in the language or languages that are supported by the encoding. For example, ISO 8859-1, also called Latin-1, is a commonly used encoding in the ISO 8859 family that contains the ASCII characters as well as characters used by Western European languages.

jamo

the name of the individual character unit in the hangul alphabet. Two or more jamo are combined to form a syllable. Examples of jamo: (h) (a) (n).

kana

the name that encompasses the two Japanese syllabaries, katakana and hiragana, which are components of the modern Japanese writing system.

kanji

the name of the set of Chinese characters that are used in the modern Japanese writing system. Kanji (along with hiragana, katakana, and, in some cases, the Arabic numerals) is a component of the modern Japanese writing system.

katakana

the name of the syllabary that is used to write words of foreign origin in the Japanese language. Katakana (along with hiragana, kanji, and, in some cases, the Latin alphabet) is a component of the Japanese writing system.

L10N

See localization.

locale

a setting that reflects the language, local conventions, and culture for a geographic region. Local conventions can include specific formatting rules for paper sizes, dates, times, and numbers, and a currency symbol for the country or region. Some examples of locale values are French_Canada, Portuguese_Brazil, and Chinese_Singapore.

localization

the process of adapting software for a particular geo-cultural region (locale). Translation of the user interface, system messages, and documentation is a large part of the localization process. Short form: L10N.

logical order

bidirectional text stored in memory in the order in which it is normally typed (in contrast with visual order). The software needs to make sure it is rendered in the correct visual display.

logogram

a visual symbol that represents a word or morpheme rather than a speech sound. An example of a logogram in the Chinese language is for the word “mountain.”

logograph

See logogram.

MBCS

See multi-byte character set.

mojibake

literally means “ghost characters” or “disguised characters,” from the Japanese 文字 (moji) “character” + 化け (bake) “change,” is the occurrence of incorrect, unintelligible characters displayed on a screen or printed on a paper when software fails to correctly handle encodings or fonts in the data stream.

monospaced

used to describe a font whose characters are all the same width. Example: the characters l and m are the same width in Courier New.

morpheme

the smallest unit of speech that has semantic meaning; it may be a word such as man, or a component of a word, such as -ed in walked.

multi-byte character set

a character set that requires a variable-width encoding because that characters occupies more than one byte of memory. DBCS and MBCS are sometimes used interchangeably, but MBCS is more accurate for describing the character sets of languages such as Japanese, Korean, and Chinese. Short form: MBCS.

multilingual

software that supports more than one natural language simultaneously, enabling the user to access multilingual content, but usually from a particular locale.

national character

a type of character that is specific to a language as it is written in a nation or group of nations. For example, in the Spanish language, the character ñ is produced by inserting the tilde (~) over the letter n.

national language support

the set of features that enable a software product to function properly in every global market for which the product is targeted. Short form: NLS.

natural language

language used by humans that can be spoken, written, or visually transmitted (by sign patterns). Natural language is distinguished from a controlled language or a computer language.

NLS

See national language support.

nonspacing character

an accent character that is meaningful only when combined with an alphabetic character in the same position in a code table. An example is the tilde (~) that appears over the letter n to produce the character ñ in a particular coded character set.

phoneme

the smallest unit of uttered sound that corresponds to the distinct pronunciation of a letter in a specific language or dialect. Example: In the English language, the f in “fat” and the p in “pat” produce two different sounds.

pictogram

See pictograph.

pictograph

a sign or symbol that visually represents an object. To be understood in an international context, a pictograph needs to be culturally neutral and give unambiguous information. Pictographs are often used in hospital, airport, and other public signage systems. The term is sometimes applied to scripts such as the Egyptian hieroglyphs or Chinese characters in their earliest forms, which consist of symbols depicting objects or actions. A standard set of pictographs was defined in the international standard ISO 7001: Public Information Symbols. Unicode Block “Miscellaneous Symbols And Pictographs” was introduced with version 6.0.0 of the Unicode Standard, and is located in the Supplementary Multilingual Plane: U+1F300 – U+1F5FF.

pinyin

the name of the writing system that uses the Latin alphabet to transcribe Chinese text.

presentation

See glyph presentation.

presentation form

a ligature or variant glyph that has been encoded as a character for compatibility.

proportional

used to describe a font in which the width of characters varies depending on the letter or symbol. Example: The character m is wider than the character l in Times New Roman.

radical

a basic component (building block) of a Chinese character. The most commonly used Chinese character set contains 214 radicals. For example, radical 102 corresponds to , which means “rice paddy.”

RDBMS

See relational database management system.

relational database management system

a database management system that organizes and accesses data according to relationships between data items. The main characteristic of a relational database management system is the two-dimensional table. Examples of relational database management systems are DB2, Oracle, Sybase, and Microsoft SQL Server. Short form: RDBMS.

romaji

the name of the set of Roman or Latin characters that are used in the phonetic transcription of the Japanese language. Examples of encodings that implement romaji are Hepburn romanization, Kunrei-shiki Rōmaji (ISO 3602), and Nihon-shiki Rōmaji (ISO 3602 Strict).

romanization

the representation of a word or language using the Roman (Latin) alphabet, where the original word or language uses a different writing system. For example, pinyin is the name of the official writing system that uses the Latin alphabet to transcribe Chinese text.

SAS data set

a file whose contents are in one of the native SAS file formats. There are two types of SAS data sets: SAS data files and SAS data views. SAS data files contain data values in addition to descriptor information that is associated with the data. SAS data views contain only the descriptor information plus other information that is required for retrieving data values from other SAS data sets or from files whose contents are in other software vendors’ file formats.

SBCS

See single-byte character set.

script

the graphic form of a writing system. This term is often used interchangeably with writing system, from which it should be distinguished. A writing system needs a script for its physical representation, but these terms are conceptually independent of each other. The same writing system may be written in a variety of scripts. For example, the Latin, Cyrillic, and Greek scripts are different graphic instantiations of the same writing system, the alphabet.

setinit

the product authorization code that is referenced in the Master License Agreement. The setinit contains information on the product, expiration, and site. It is the unlock code that enables SAS software to operate through a specific date. The code consists of a unique password that is derived from the Global Installation Database (GIDB) components that create the setinit.

Shift-JIS

the name of an encoding of the Japanese standard JIS X 0208, widely used in PCs.

Simplified Chinese characters

one of the two groupings of standardized characters used in Chinese writing systems. Besides Simplified Chinese characters, there are the Traditional Chinese characters. Simplified Chinese characters are used by the People’s Republic of China, Singapore, and Malaysia. Simplified Chinese contains fewer strokes in character formation than those of the Traditional Chinese characters. Short form: CS.

single-byte character set

a type of encoding for which each character is represented using one byte of computer memory. An example of a single-byte character set is ISO Latin 1. Short form: SBCS.

software globalization

the processes that encompass both internationalization and localization, including the development, manufacture, and marketing of software products for worldwide distribution.

special character

a type of character other than alphanumeric characters, the underscore (_), and the blank. An example is the asterisk (*).

syllabary

a set of written symbols for representing syllables, which make up words. A symbol in a syllabary typically represents an optional consonant sound followed by a vowel sound. The Japanese language uses syllabic writing.

terminal emulator

software, hardware, or a combination of the two that simulates the operation of a particular model of terminal. For example, terminal emulators simulate VT100 and IBM 3270 terminals.

Traditional Chinese characters

one of the two groupings of standardized characters used in Chinese writing systems. Besides Traditional Chinese characters, there are the Simplified Chinese characters. Traditional Chinese characters are used in Taiwan, Hong Kong, Macau, and in many overseas Chinese communities.

transcription

the process (or the result of the process) that maps the sounds from one language to the best matching letters of another language. For example, because these Greek letters η, ι, υ, ει, oι, υι are pronounced as the Latin sound i, they are all transcribed as i.

transliteration

the process (or the result of the process) that systematically maps characters from one writing system to another. Examples of Greek to Latin transliterations are η (ē), ι ( i ), υ (y), ει (ei), oι (oi), and υι (yi).

Unicode

a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world’s writing systems. Unicode includes more than 110,000 characters covering dozens of scripts, plus standards for character properties such as uppercase and lowercase, for rendering bidirectional script, and a number of related items.

Unicode consortium

an organization that develops and promotes the Unicode standard.

Unicode server

a workspace server that is configured in SAS to handle Unicode.

Unicode Transformation Format 8

See UTF-8.

Unicode Transformation Format 16

See UTF-16.

Unicode Transformation Format 32

See UTF-32.

UTF-8

a multi-byte encoding that represents each Unicode character with 1 to 4 bytes. It is backward-compatible with ASCII.

UTF-16

a multi-byte encoding that represents each Unicode character with 2 or 4 bytes. It is not backward-compatible with ASCII.

UTF-32

a multi-byte encoding that represents each Unicode character with 4 bytes. It is not backward-compatible with ASCII.

variant character

a character that uses different code positions across IBM EBCDIC code pages. The following characters are considered variant: ! # $ @ [ ] ^ `{ } | ~.

ISO 646 also has variant characters whose representation differs in some of the national variant character sets.

visual order

bidirectional text stored in memory in the same order that you would expect to see it displayed (in contrast with logical order). Legacy systems frequently stored text in visual order to avoid reordering for display.

writing system

a system to record human languages by means of visible or tactile signs . The term is used in two different senses. One refers to the basic types of systems designed to represent language—i.e., logographic (ideographic) systems, syllabaries, and alphabets (consonant alphabets and full alphabets). The other refers to the set of rules for using one or more scripts to write a particular language. For instance, the Japanese writing system uses several scripts; the British English and the French writing systems use different subsets of the Latin script; Russian and Ukrainian use different subsets of the Cyrillic script; and so on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset