Using Linguistic Sorts and Searches

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

10.4. Using Linguistic Sorts and Searches

Different languages follow different rules when it comes to sorting text. Unfortunately, that means that there is no "one-size-fits-all" algorithm that can be used. Instead, Oracle's global support functionality allows not only binary sorting methods, but also linguistic sorting and searching methodologies to provide the flexibility to support the needs of many languages.

In the following sections, you will learn the methods that Oracle uses when it performs text-sorting operations. You will also learn about the different NLS parameters that impact linguistic sorting and searching. Next, you will learn about the different types of linguistic sorts supported by Oracle. Lastly, you will learn about linguistic text searches.

10.4.1. An Overview of Text Sorting

There are many ways in which text sorting can be accomplished. Sort order can be case-sensitive or case can be ignored. Diacritics(accent marks) can be considered or ignored. Sorting can be done phonetically or based on the appearance of the character.

Some languages even consider groupings of characters to have a specific sort order. For example, traditional Spanish treats ch as a character that sorts after the letter C. Therefore, when sorting the words cat, dog, cow, and chinchilla, the correct sort sequence would be cat, cow, chinchilla, and dog.

To support these different sorting methods, Oracle offers two basic categories of sorting: binary and linguistic.

10.4.1.1. Binary Sorts

Binary sorts are the fastest and most efficient sorting method offered by Oracle. However, they are also the most limited. Binary sorts perform a numeric sort based on the encoded value (in the character set) for each character. As long as the character set encodes all of the characters in the proper sort order, this method works very well. The performance is also exceptional.

For example, in the US7ASCII character set, alphabetical characters are encoded as shown in Table 10.4. Note that this is just a subset of the character set.

Because the encoded values ascend in correlation with the characters, a binary sort will always perform a proper alphabetical sort. In addition, uppercase characters will always sort higher than lowercase characters.

However, different languages may share the same alphabet (and therefore, the same character set), yet utilize a sort order that deviates from the encoding order of the character set. In this situation, binary sorting will fail to produce an acceptable result.

NOTE

Binary sorting is Oracle's default sorting method.

Table 10.4. US7ASCII Alphabetical Characters
Char	Value	Char	Value	Char	Value	Char	Value
A	65	N	78	a	97	n	110
B	66	0	79	b	98	o	111
C	67	P	80	c	99	p	112
D	68	Q	81	d	100	q	113
E	69	R	82	e	101	r	114
F	70	S	83	f	102	s	115
G	71	T	84	g	103	t	116
H	72	U	85	H	104	u	117
I	73	V	86	i	105	V	118
J	74	w	87	i	106	w	119
K	75	X	88	K	107	X	120
L	76	Y	89	I	108	y	121
M	77	z	90	m	109	z	122

10.4.1.2. Linguistic Sorts

Linguistic sorts, unlike binary sorts, operate independently of the underlying encoded values. Instead, they allow character data to be sorted based on the rules of specific languages. Linguistic sorts offer the flexibility to deal with the caveats imposed by different languages.

Oracle provides a rich set of linguistic sort definitions that cover most of the languages of the world. However, it also provides the ability to define new definitions or to modify existing definitions. The Oracle Locale Builder, a graphical tool that ships with Oracle 10g, can be used to view, create, and modify sort definitions and other locale definitions. However, as mentioned earlier, Locale Builder is not covered in this book.

Linguistic sorts are defined using a variety of rules that govern the sorting process. The following elements are available and can be used to create a comprehensive linguistic sort:

Base letters Base letters are the individual letters upon which other characters are based. The derived characters would map back to the base letter to determine the sorting value. For example, the character A is a base letter, while À Á Ã Ä a àdä would all map to A as a base letter.

Ignorable characters Ignorable characters, just as the name implies, can be defined as having no effect on sort order. Diacritics, such as the umlaut, can be classified as ignorable, as can certain punctuation characters such as the hyphen. Therefore, a word such as e-mail would be sorted the same as email.

Contracting characters Contracting characters represent two or more characters that are treated linguistically as a single character. An example—the traditional Spanish ch string—was explained earlier in this section. Contracting characters require flexibility in the sorting algorithm to read ahead to the next character to determine if a contracting character has been found.

Expanding characters Some locales choose to compress repeating or commonly occurring strings of characters into a single character for brevity's sake. When sorting, however, the character needs to sort as if all characters are present. These are referred to as expanding characters.

For example, the ö character should be treated as the string oe for sorting purposes.

Context-sensitive characters Certain languages sort characters differently based upon their relationship to other characters. These are generally characters that modify the preceding character. For example, a Japanese length mark is sorted according to the vowel that precedes it.

Canonical equivalence When using a Unicode character set, the character ö and the string ö can be considered equal from a sorting perspective. This is because the code points of the two-character string match the code point of the individual character. The two are said to have canonical equivalence.

In situations of canonical equivalence, the value of the CANONICAL_EQUIVALENCE linguistic flag (with a value of TRUE or FALSE) determines whether the rules of canonical equivalence should be followed.

Reverse secondary sorting In some languages, strings containing diacritics will be sorted from left to right on the base characters, and then from right to left on the diacritics. This is referred to as reverse secondary sorting. For example, resumé would sort before résume because of the position of the diacritic from left to right.

The REVERSE_SECONDARY=TRUE linguistic flag enables this functionality.

Character rearrangement In certain languages (notably Thai and Laotian dialects), sorting rules declare that certain characters should switch places with the following character before sorting. This generally happens when a consonant is preceded by a vowel sound. In this case, the consonant is given priority, forcing the characters to be switched before the sort begins. The SWAP_WITH_NEXT linguistic flag can determine whether character rearrangement will occur within a sort definition.

NOTE

Don't confuse linguistic flags with Oracle parameter settings. Linguistic flags are defined for specific sort order definitions when they are created.

These different sorting methods represent the toolset available to sort and search text. Different languages may use one or more of the linguistic methods listed here. But as long as the rules of a language can be described using these methods, Oracle is able to perform linguistic sorts, either through an existing sort definition or through a custom sort definition.

10.4.2. Using Linguistic Sort Parameters

Linguistic sorts are generally applicable to a specific language or to a specific character set. And, as mentioned previously, there are many pre-defined linguistic sort definitions that may be utilized. Therefore, it is unlikely that you would ever need to define your own.

You can instruct Oracle to utilize specific linguistic sorts by setting the appropriate NLS sort parameters. In this section, you will learn about the NLS_SORT and NLS_COMP parameters, and how they affect linguistic sorting operations.

10.4.2.1. NLS_SORT

The NLS_SORT parameter defines which type of sorting—binary or linguistic—should be performed for SQL sort operations. By default, the value for NLS_SORT is the default sort method defined for the language identified in the NLS_LANGUAGE parameter. For example, if the NLS_LANGUAGE parameter is set to AMERICAN, the default value for the NLS_SORT parameter will be BINARY.

To instruct Oracle to use linguistic sorting, this parameter can be set to the name of any valid linguistic sort definition, as shown here:

SQL> alter session set NLS_SORT = German;
Session altered.

A list of valid sort definition names is shown here. You could also query the V$NLS_VALID_VALUES views (as shown earlier in this chapter).

ARABIC                       GERMAN                    SWISS
ARABIC_ABJ_MATCH             GERMAN_DIN                TCHINESE_RADICAL_M

ARABIC_ABJ_SORT              GREEK                     TCHINESE_STROKE_M
ARABIC_MATCH                 HEBREW                    THAI_DICTIONARY
ASCII7                       HKSCS                     THAI_M
AZERBAIJANI                  HUNGARIAN                 THAI_TELEPHONE
BENGALI                      ICELANDIC                 TURKISH
BIG5                         INDONESIAN                UKRAINIAN
BINARY                       ITALIAN                   UNICODE_BINARY
BULGARIAN                    JAPANESE                  VIETNAMESE
CANADIAN FRENCH              JAPANESE_M                WEST_EUROPEAN
CANADIAN_M                   KOREAN_M                  XAZERBAIJANI
CATALAN                      LATIN                     XCATALAN
CROATIAN                     LATVIAN                   XCROATIAN
CZECH                        LITHUANIAN                XCZECH
CZECH_PUNCTUATION            MALAY                     XCZECH_PUNCTUATION
DANISH                       NORWEGIAN                 XDANISH
DANISH_M                     POLISH                    XDUTCH
DUTCH                        PUNCTUATION               XFRENCH
EBCDIC                       ROMANIAN                  XGERMAN
EEC_EURO                     RUSSIAN                   XGERMAN_DIN
EEC_EUROPA3                  SCHINESE_PINYIN_M         XHUNGARIAN
ESTONIAN                     SCHINESE_RADICAL_M        XPUNCTUATION
FINNISH                      SCHINESE_STROKE_M         XSLOVAK
FRENCH                       SLOVAK                    XSLOVENIAN
FRENCH_M                     SLOVENIAN                 XSPANISH
GBK                          SPANISH                   XSWISS
GENERIC_BASELETTER           SPANISH_M                 XTURKISH
GENERIC_M                    SWEDISH                   XWEST_EUROPEAN

By using the NLS_SORT parameter, you can make the following changes to Oracle's default functionality:

Set the default sort method for all ORDER BY operations.
Set the default sort value for the NLSSORT function.

It is important to note that not all SQL functionality supports linguistic sorting. Certain functions support only binary sorts. However, most of the commonly used methods are supported. Also, all NLS-specific SQL functions (for example, NLSSORT) will support linguistic sorts.

The methods listed here support linguistic sorting:

ORDER BY
BETWEEN
CASE WHEN
HAVING
IN/OUT
START WITH
WHERE

By default, all of these operations will perform binary sorts. By setting the NLS_SORT parameter, SQL statements using the WHERE operation will perform linguistic sorts by default.

The following example demonstrates the use of the NLS_SORT parameter. Initially, you can see that your session has no value set for NLS_SORT. Therefore, it will inherit the default setting from the NLS_LANGUAGE parameter (BINARY).

SQL> show parameters NLS_LANGUAGE

NAME                                 TYPE        VALUE
------------------------------------ ----------- ----------
nls_language                         string      AMERICAN

SQL> show parameters NLS_SORT

NAME                                 TYPE        VALUE
------------------------------------ ----------- ----------
nls_sort                             string

As you learned earlier, the default sort for the language AMERICAN is BINARY. The default setting for NLS_COMP is BINARY as well. Therefore, you can expect that, by default, Oracle will perform a binary sort unless otherwise specified, as shown here:

SQL> select * from sort_test
   order by name;

NAME
--------------------------------
Finsteraarhornhutte
Grünhornlücke
einschließlich
finsteraarhornhütte
grünhornlücke

5 rows selected.

As expected, Oracle sorted the rows based on the encoded value of the characters in the US7ASCII character set. Therefore, all uppercase characters sort before lowercase characters.

Because the words in this table are of German origin, it is logical that you might want to sort them using a German sorting method. To change the sorting method for a GROUP BY clause, the NLS_SORT parameter can be set as shown here:

SQL> alter session set NLS_SORT=German_din;

Session altered.

SQL> select * from sort_test
  order by name;

NAME
--------------------------------------------
einschließlich
Finsteraarhornhutte
finsteraarhornhütte
Grünhornlücke
grünhornlücke

5 rows selected.

As you can see, setting the NLS_SORT parameter changed the default sorting method to a linguistic sort instead of a binary sort.

In the next step, another query is executed, this time using a WHERE condition rather than an ORDER BY clause:

SQL> select * from sort_test
  2 where name < 'einschließlich';

NAME
---------------------------------------
finsteraarhornhütte
grünhornlücke

2 rows selected.

The result of this query might not be what you expect. Instead of the expected four rows (which a linguistic sort would have returned), only two rows are returned, indicating that a binary sort took place instead.

Remember, the NLS_SORT parameter overrides the default sorting method for ORDER BY operations and for the NLSSORT function, but has no effect on other sort operations, such as WHERE conditions. Therefore, this query ignored the parameter entirely.

To perform a linguistic sort, you call the NLSSORT function. Normally, the function would be called like this:

SQL> select * from sort_test
  where nlssort(name, 'NLS_SORT=German_din') >
  nlssort('einschließlich','NLS_SORT=German_din');


NAME
--------------------------------------------------
Finsteraarhornhutte
finsteraarhornhütte
Grünhornlücke
grünhornlücke

4 rows selected.

However, because the NLS_SORT parameter defines the default sort for the NLSSORT function, specifying the sort inside the function is unnecessary. The following method works as well:

SQL> select * from sort_test
  where nlssort(name) < nlssort('einschließlich');

NAME
--------------------------------------------------
Finsteraarhornhutte
finsteraarhornhütte
Grünhornlücke
grünhornlücke

4 rows selected.

As you can see, in both queries, the sort was performed linguistically and returned the expected rows.

The NLS_SORT parameter is very limited in relation to linguistic sorting. The NLS_COMP parameter, on the other hand, makes linguistic sorting much easier.

10.4.2.2. NLS_COMP

The NLS_COMP parameter works in conjunction with the NLS_SORT parameter to make linguistic sorts easier to use. When the NLS_COMP parameter is set to a value of ANSI, all of the following SQL operations will default to linguistic sorting (using the language specified in NLS_SORT parameter):

ORDER BY
BETWEEN
CASE WHEN
HAVING
IN/OUT
START WITH
WHERE

Setting the NLS_COMP parameter makes it unnecessary to call the NLSSORT function when using these sort operations. As you can guess, this makes linguistic sorting much easier to perform. The NLS_COMP parameter can be set to either BINARY (the default) or ANSI. The following example shows the usage of the NLS_COMP parameter:

SQL> alter session set NLS_SORT=German_din;

Session altered.

SQL> alter session set NLS_COMP=ANSI;

Session altered.

SQL> select * from sort_test
  2 where name > 'einschließlich';

NAME
------------------------------------------
Finsteraarhornhutte
finsteraarhornhütte
Grünhornlücke
grünhornlücke

4 rows selected.

As you can see, the query performed the linguistic sort and returned the expected results, even without using the NLSSORT function.

If the NLS_COMP parameter is set back to BINARY, binary sorting occurs once again:

SQL> alter session set NLS_COMP=BINARY;

Session altered.

SQL> select * from sort_test
  2 where name > 'einschließlich';

NAME
--------------------------------------------
finsteraarhornhütte
grünhornlücke

2 rows selected.

10.4.3. Linguistic Sort Types

When performing linguistic sorts, Oracle uses different methodologies, depending upon the number of languages involved in the sort. If character data in only one language is being sorted, this is classified as a monolingual linguistic sort. If more than one language is involved in the sort, it is classified as a multilingual linguistic sort.

In the following sections, you will learn the methodology that Oracle implements in performing both monolingual and multilingual linguistic sorts. You will also learn about accent-insensitive and case-insensitive linguistic sorts.

10.4.3.1. Monolingual Linguistic Sorts

When dealing with only a single language inside a sort, Oracle performs a two-step process to compare character strings. First, the major value of the strings is compared. Next, if necessary, the minor value of the strings is compared.

Major and minor values are sort values assigned to letters in the character set. A base letter and those derived from it will generally share a common major value, but they will have different minor values based on the desired sort order.

Here's an example:

Letter	Major Value	Minor Value
a	30	10
A	30	20
ä	30	30
Ä	30	40
b	40	10

The example shows that all four variations of the letter A have identical major values, but differing minor values. When two letters share a major value, the minor value will determine the sort order.

The major value numbers are assigned to a Unicode code point(a 16-bit binary value that defines a character in a Unicode character set).

10.4.3.2. Multilingual Linguistic Sorts

Multilingual sorts offer the ability to sort mixed languges within the same sort. For example, if you have a table that stores names in both English and Spanish, a multilingual sort should be used.

Multilingual sorts perform three levels of evaluation: primary, secondary, and tertiary.

Primary sorts assign a primary sort value based on the base letter of each character (diacritics and case are ignored). If a character is defined as ignorable, it is assigned a primary value of zero.

Secondary level sorts consider diacritics to differentiate accented letters from base letters in assigning a secondary sort level. For example, A and ä share the same base letter, so they have the same primary sort level, but they will have different secondary levels.

Tertiary level sorts consider character case to differentiate uppercase and lowercase letters. Tertiary sorts also handle special characters such as *, +, and −.

Consider the following words:

Fahrvergnügen
Fahrvergnugen
farhrvergnugen
fahrvergnügen

Because all of these words share the same base letters in the same order, all of them would be considered equivalent at the primary sort level. After a secondary level sort, they would be ordered similarly to the following list:

Fahrvergnugen
farhrvergnugen
Fahrvergnügen
fahrvergnügen

The secondary level sort is concerned only with diacritics, so it will sort characters without diacritics above those with diacritics. After that, the words are displayed in their primary sort order. Because the words in this example have identical primary sort orders, there is no guarantee which word will appear before the other. The only guarantee is that those with diacritics will sort after those without.

After the tertiary level sort is performed, the list should look exactly like this:

farhrvergnugen

Fahrvergnugen

fahrvergnügen

Fahrvergnügen

The tertiary level sort applies the case rule to the data, forcing lowercase letters to sort before uppercase letters. This is the final result of the sort after applying all three multilingual sorting levels.

In keeping with the ISO 14651 standard for multilingual sorting, Oracle appends an _M to the sort name to identify it as multilingual. For example, FRENCH_M identifies a French multilingual sort, whereas FRENCH identifies a French monolingual sort. Table 10.5 shows the multilingual sorts predefined in Oracle 10g.

Table 10.5. Multilingual Sorts Available in Oracle 10g
Multilingual Sort Name	Description
`CANADIAN_M`	Canadian French
`DANISH_M`	Danish
`FRENCH_M`	French
`GENERIC_M`	Generic based on ISO14651
`JAPANESE_M`	Japanese
`KOREAN_M`	Korean
`SPANISH_M`	Traditional Spanish
`THAI_M`	Thai
`SCHINESE_RADICAL_M`	Simplified Chinese
`SCHINESE_STROKE_M`	Simplified Chinese
`SCHINESE_PINYIN_M`	Simplified Chinese
`TCHINESE_RADICAL_M`	Traditional Chinese
`TCHINESE_STROKE_M`	Traditional Chinese

10.4.3.3. Case-Insensitive and Accent-Insensitive Linguistic Sorts

Oracle, by default, will always consider both the case of the characters and any diacritics when performing sort operations. As you've seen in previous examples, linguistic sorts have rules to govern precedence between uppercase and lowercase words, as well as those words containing diacritics.

However, you may wish to override this functionality from time to time and choose to ignore case and/or diacritics. Oracle 10g offers case-insensitive and accent-insensitive sorting options to allow for these cases.

To specify case-insensitive or accent-insensitive sorts, the NLS_SORT parameter is used, but with the following changes:

For a case-insensitive linguistic sort, append the string _CI to the sort name.
For an accent-insensitive linguistic sort, append the string _AI to the sort name.

NOTE

Accent-insensitive sorts are also case-insensitive by default.

For example, to specify a French, multilingual, accent-insensitive sort, use the following:

NLS_SORT = French_M_AI

Here is how to specify a German, monolingual, case-insensitive sort:

NLS_SORT = German_CI

10.4.3.4. Case-Insensitive and Accent-Insensitive Binary Sorts

Binary sorts can also be designated as case-insensitive or accent-insensitive. The NLS_SORT parameter can be set to BINARY_CI or BINARY_AI. Table 10.6 shows how the sort will be affected by these settings.

Table 10.6. Binary Case and Accent-insensitive Sort Options
Sort Named	Sort type	Case-insensitive?	Accent-Insensitive?
`BINARY_CI`	Binary	Yes	No
`BINARY_AI`	Binary	Yes	Yes

As you can see, using the BINARY_AI sort will result in both an accent-insensitive and case-insensitive sort.

10.4.4. Searching Linguistic Strings

Linguistic searches are closely related to linguistic sorts and are directly affected by the NLS_SORT setting. To accomplish linguistically meaningful searches, Oracle must apply the same rules it applies for linguistic sorts.

Earlier in this section, you saw several examples of linguistic string searching, including the following:

SQL> select * from sort_test
  2 where name > 'einschließlich';

NAME
------------------------------------------
Finsteraarhornhutte
finsteraarhornhütte
Grünhornlücke
grünhornlücke

4 rows selected.

By setting the NLS_COMP parameter to ANSI and the NLS_SORT parameter to the desired sort language, the WHERE operator (as well as several others) will perform linguistic searching by default.

And, just as you did with sorts, if the NLS_SORT is set to ignore case or accents, linguistic searches will follow suit, as in this example:

SQL> alter session set NLS_COMP=ANSI;

Session altered.
SQL> alter session set NLS_SORT=German_din_ci;

Session altered.

SQL> select * from sort_test
  2 where name = 'Grünhornlücke';

NAME
-----------------------------------------
Grünhornlücke
grünhornlücke

As you can see, the search ignored the case and returned both rows that matched. This is the expected functionality of the case-insensitive search.

In the next example, the NLS_SORT parameter will be set to define an accent-insensitive search:

SQL> alter session set NLS_SORT=German_din_ai;

Session altered.

SQL> select * from sort_test
  where name = 'Finsteraarhornhutte';


NAME
-----------------------------------------------
Finsteraarhornhutte
finsteraarhornhütte

When the NLS_SORT parameter defined an accent-insensitive search, both accents and case were ignored. This is the expected functionality for accent-insensitive searches.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Using Linguistic Sorts and Searches

Create new playlist

Sign In

Sign Up

10.4. Using Linguistic Sorts and Searches

10.4.1. An Overview of Text Sorting

10.4.1.1. Binary Sorts

10.4.1.2. Linguistic Sorts

10.4.2. Using Linguistic Sort Parameters

10.4.2.1. NLS_SORT

10.4.2.2. NLS_COMP

10.4.3. Linguistic Sort Types

10.4.3.1. Monolingual Linguistic Sorts

10.4.3.2. Multilingual Linguistic Sorts

10.4.3.3. Case-Insensitive and Accent-Insensitive Linguistic Sorts

10.4.3.4. Case-Insensitive and Accent-Insensitive Binary Sorts

10.4.4. Searching Linguistic Strings

Table of Contents for
Using Linguistic Sorts and Searches