Chapter 5. C# Text Manipulation and File I/O

Topics in This Chapter

  • Characters and UnicodeBy default, .NET stores a character as a 16-bit Unicode character. This enables an application to support international character sets—a technique referred to as localization.

  • String OverviewIn .NET, strings are immutable. To use strings efficiently, it is necessary to understand what this means and how immutability affects string operations.

  • String OperationsIn addition to basic string operations, .NET supports advanced formatting techniques for numbers and dates.

  • StringBuilderThe StringBuilder class offers an efficient alternative to concatenation for constructing screens.

  • Regular ExpressionsThe .NET Regex class provides an engine that uses regular expressions to parse, match, and extract values in a string.

  • Text StreamsStream classes permit data to be read and written as a stream of bytes that can be encrypted and buffered.

  • Text Reading and WritingThe StreamReader and StreamWriter classes make it easy to read from and write to physical files.

  • System.IOClasses in this namespace enable an application to work with underlying directory structure of the host operating system.

This chapter introduces the string handling capabilities provided by the .NET classes. Topics include how to use the basic String methods for extracting and manipulating string content; the use of the String.Format method to display numbers and dates in special formats; and the use of regular expressions (regexes) to perform advanced pattern matching. Also included is a look at the underlying features of .NET that influence how an application works with text. Topics include how the Just-In-Time (JIT) compiler optimizes the use of literal strings; the importance of Unicode as the cornerstone of character and string representations in .NET; and the built-in localization features that permit applications to automatically take into account the culture-specific characteristics of languages and countries.

This chapter is divided into two major topics. The first topic focuses on how to create, represent, and manipulate strings using the System.Char, System.String, and Regex classes; the second takes up a related topic of how to store and retrieve string data. It begins by looking at the Stream class and how to use it to process raw bytes of data as streams that can be stored in files or transmitted across a network. The discussion then moves to using the TextReader/TextWriter classes to read and write strings as lines of text. The chapter concludes with examples of how members of the System.IO namespace are used to access the Microsoft Windows directory and file structure.

Characters and Unicode

One of the watershed events in computing was the introduction of the ASCII 7-bit character set in 1968 as a standardized encoding scheme to uniquely identify alphanumeric characters and other common symbols. It was largely based on the Latin alphabet and contained 128 characters. The subsequent ANSI standard doubled the number of characters—primarily to include symbols for European alphabets and currencies. However, because it was still based on Latin characters, a number of incompatible encoding schemes sprang up to represent non-Latin alphabets such as the Greek and Arabic languages.

Recognizing the need for a universal encoding scheme, an international consortium devised the Unicode specification. It is now a standard, accepted worldwide, that defines a unique number for every character “no matter what the platform, no matter what the program, no matter what the language.”[1]

Unicode

NET fully supports the Unicode standard. Its internal representation of a character is an unsigned 16-bit number that conforms to the Unicode encoding scheme. Two bytes enable a character to represent up to 65,536 values. Figure 5-1 illustrates why two bytes are needed.

Unicode memory layout of a character

Figure 5-1. Unicode memory layout of a character

The uppercase character on the left is a member of the Basic Latin character set that consists of the original 128 ASCII characters. Its decimal value of 75 can be depicted in 8 bits; the unneeded bits are set to zero. However, the other three characters have values that range from 310 (0x0136) to 56,609 (0xDB05), which can be represented by no less than two bytes.

Unicode characters have a unique identifier made up of a name and value, referred to as a code point. The current version 4.0 defines identifiers for 96,382 characters. These characters are grouped in over 130 character sets that include language scripts, symbols for math, music, OCR, geometric shapes, Braille, and many other uses.

Because 16 bits cannot represent the nearly 100,000 characters supported worldwide, more bytes are required for some character sets. The Unicode solution is a mechanism by which two sets of 16-bit units define a character. This pair of code units is known as a surrogate pair. Together, this high surrogate and low surrogate represent a single 32-bit abstract character into which characters are mapped. This approach supports over 1,000,000 characters. The surrogates are constructed from values that reside in a reserved area at the high end of the Unicode code space so that they are not mistaken for actual characters.

As a developer, you can pretty much ignore the details of whether a character requires 16 or 32 bits because the .NET API and classes handle the underlying details of representing Unicode characters. One exception to this—discussed later in this section—occurs if you parse individual bytes in a stream and need to recognize the surrogates. For this, .NET provides a special object to iterate through the bytes.

Core Note

Core Note

Unicode characters can only be displayed if your computer has a font supporting them. On a Windows operating system, you can install a font extension (ttfext.exe) that displays the supported Unicode ranges for a .ttf font. To use it, right-click the .ttf font name and select Properties. Console applications cannot print Unicode characters because console output always displays in a non-proportional typeface.

Working with Characters

A single character is represented in .NET as a char (or Char) structure. The char structure defines a small set of members (see char in Chapter 2, “C# Language Fundamentals”) that can be used to inspect and transform its value. Here is a brief review of some standard character operations.

Assigning a Value to a Char Type

The most obvious way to assign a value to a char variable is with a literal value. However, because a char value is represented internally as a number, you can also assign it a numeric value. Here are examples of each:

string klm = "KLM";
byte     b = 75;
char k;
// Different ways to assign 'K' to variable K
k = 'K';
k = klm[0];              // Assign "K" from first value in klm
k = (char) 75;           // Cast decimal
k = (char) b;            // cast byte
k = Convert.ToChar(75);  // Converts value to a char

Converting a Char Value to a Numeric Value

When a character is converted to a number, the result is the underlying Unicode (ordinal) value of the character. Casting is the most efficient way to do this, although Convert methods can also be used. In the special case where the char is a digit and you want to assign the linguistic value—rather than the Unicode value—use the static GetNumericValue method.

// '7' has Unicode value of 55
char k = '7';
int n = (int) k;             // n = 55
n = (int) char.GetNumericValue(k);   // n = 7

Characters and Localization

One of the most important features of .NET is the capability to automatically recognize and incorporate culture-specific rules of a language or country into an application. This process, known as localization, may affect how a date or number is formatted, which currency symbol appears in a report, or how string comparisons are carried out. In practical terms, localization means a single application would display the date May 9, 2004 as 9/5/2004 to a user in Paris, France and as 5/9/2004 to a user in Paris, Texas. The Common Language Runtime (CLR) automatically recognizes the local computer's culture and makes the adjustments.

The .NET Framework provides more than a hundred culture names and identifiers that are used with the CultureInfo class to designate the language/country to be used with culture sensitive operations in a program. Although localization has a greater impact when working with strings, the Char.ToUpper method in this example is a useful way to demonstrate the concept.

// Include the System.Globalization namespace
// Using CultureInfo – Azerbaijan
char i = 'i';
// Second parameter is false to use default culture settings
// associated with selected culture
CultureInfo myCI = new CultureInfo("az", false );
i = Char.ToUpper(i,myCI);

An overload of ToUpper() accepts a CultureInfo object that specifies the culture (language and country) to be used in executing the method. In this case, az stands for the Azeri language of the country Azerbaijan (more about this follows). When the Common Language Runtime sees the CultureInfo parameter, it takes into account any aspects of the culture that might affect the operation. When no parameter is provided, the CLR uses the system's default culture.

Core Note

Core Note

On a Windows operating system, the .NET Framework obtains its default culture information from the system's country and language settings. It assigns these values to the Thread.CurrentThread.CurrentCulture property. You can set these options by choosing Regional Options in the Control Panel.

So why choose Azerbaijan, a small nation on the Caspian Sea, to demonstrate localization? Among all the countries in the world that use the Latin character set, only Azerbaijan and Turkey capitalize the letter i not with I (U+0049), but with an I that has a dot above it (U+0130). To ensure that ToUpper() performs this operation correctly, we must create an instance of the CultureInfo class with the Azeri culture name—represented by az—and pass it to the method. This results in the correct Unicode character—and a satisfied population of 8.3 million Azerbaijani.

Characters and Their Unicode Categories

The Unicode Standard classifies Unicode characters into one of 30 categories. .NET provides a UnicodeCategory enumeration that represents each of these categories and a Char.GetUnicodecategory() method to return a character's category. Here is an example:

Char k  = 'K';
int iCat = (int) char.GetUnicodeCategory(k);   // 0
Console.WriteLine(char.GetUnicodeCategory(k)); // UppercaseLetter
char cr = (Char)13;
iCat = (int) char.GetUnicodeCategory(cr);      // 14
Console.WriteLine(char.GetUnicodeCategory(cr)); // Control

The method correctly identifies K as an UppercaseLetter and the carriage return as a Control character. As an alternative to the unwieldy GetUnicodeCategory, char includes a set of static methods as a shortcut for identifying a character's Unicode category. They are nothing more than wrappers that return a true or false value based on an internal call to GetUnicodeCategory. Table 5-1 lists these methods.

Table 5-1. Char Methods That Verify Unicode Categories

Method

Unicode Category

Description

IsControl

4

Control code whose Unicode value is U+007F, or in the range U+0000 through U+001F, or U+0080 through U+009F.

IsDigit

8

Is in the range 0–9.

IsLetter

0, 1, 2, 4

Letter.

IsLetterorDigit

0, 1, 8,

Union of letters and digits.

IsLower

1

Lowercase letter.

IsUpper

0

Uppercase letter.

IsPunctuation

18, 19, 20, 21, 22, 23, 24

Punctuation symbol—for example, DashPunctuation(19) or OpenPunctuation(20), OtherPunctuation(24).

IsSeparator

11, 12, 13

Space separator, line separator, paragraph separator.

IsSurrogate

16

Value is a high or low surrogate.

IsSymbol

25, 26, 28

Symbol.

IsWhiteSpace

11

Whitespace can be any of these characters: space (0x20), carriage return (0x0D), horizontal tab (0x09), line feed (0x0A), form feed (0x0C), or vertical tab (0x0B).

Using these methods is straightforward. The main point of interest is that they have overloads that accept a single char parameter, or two parameters specifying a string and index to the character within the string.

Console.WriteLine(Char.IsSymbol('+'));         // true
Console.WriteLine(Char.IsPunctuation('+')):    // false
string str = "black magic";
Console.WriteLine(Char.IsWhiteSpace(str, 5));  // true
char p = '.';
Console.WriteLine(Char.IsPunctuation(p));      // true
Int iCat = (int) char.GetUnicodeCategory(p);   // 24
Char p = '(';
Console.WriteLine(Char.IsPunctuation(p));      // true
int iCat = (int) char.GetUnicodeCategory(p);   // 20

The String Class

The System.String class was introduced in Chapter 2. This section expands that discussion to include a more detailed look at creating, comparing, and formatting strings. Before proceeding to these operations, let's first review the salient points from Chapter 2:

  • The System.String class is a reference type having value semantics. This means that unlike most reference types, string comparisons are based on the value of the strings and not their location.

  • A string is a sequence of Char types. Any reference to a character within a string is treated as a char.

  • Strings are immutable. This means that after a string is created, it cannot be changed at its current memory location: You cannot shorten it, append to it, or change a character within it. The string value can be changed, of course, but the modified string is stored in a new memory location. The original string remains until the Garbage Collector removes it.

  • The System.Text.StringBuilder class provides a set of methods to construct and manipulate strings within a buffer. When the operations are completed, the contents are converted to a string. StringBuilder should be used when an application makes extensive use of concatenation and string modifications.

Creating Strings

A string is created by declaring a variable as a string type and assigning a value to it. The value may be a literal string or dynamically created using concatenation. This is often a perfunctory process and not an area that most programmers consider when trying to improve code efficiency. In .NET, however, an understanding of how literal strings are handled can help a developer improve program performance.

String Interning

One of the points of emphasis in Chapter 1, “Introduction to .NET and C#,” was to distinguish how value and reference types are stored in memory. Recall that value types are stored on a stack, whereas reference types are placed on a managed heap. It turns out that that the CLR also sets aside a third area in memory called the intern pool, where it stores all the string literals during compilation. The purpose of this pool is to eliminate duplicate string values from being stored.

Consider the following code:

string poem1 = "Kubla Khan";
string poem2 = "Kubla Khan";
string poem3 = String.Copy(poem2); // Create new string object
string poem4 = "Christabel";

Figure 5-2 shows a simplified view of how the strings and their values are stored in memory.

String interning

Figure 5-2. String interning

The intern pool is implemented as a hash table. The hash table key is the actual string and its pointer references the associated string object on the managed heap. When the JITcompiler compiles the preceding code, it places the first instance of "Kubla Khan" (poem1) in the pool and creates a reference to the string object on the managed heap. When it encounters the second string reference to "Kubla Khan" (poem2), the CLR sees that the string already exists in memory and, instead of creating a new string, simply assigns poem2 to the same object as poem1. This process is known as string interning. Continuing with the example, the String.Copy method creates a new string poem3 and creates an object for it in the managed heap. Finally, the string literal associated with poem4 is added to the pool.

To examine the practical effects of string interning, let's extend the previous example. We add code that uses the equivalence (==) operator to compare string values and the Object.ReferenceEquals method to compare their addresses.

Console.WriteLine(poem1 == poem2);                // true
Console.WriteLine(poem1 == poem3);                // true
Console.WriteLine(ReferenceEquals(poem1, poem3)); // false
Console.WriteLine(ReferenceEquals(poem1,
                  "Kubla Khan"));                 // true

The first two statements compare the value of the variables and—as expected—return a true value. The third statement compares the memory location of the variables poem3 and poem2. Because they reference different objects in the heap, a value of false is returned.

The .NET designers decided to exclude dynamically created values from the intern pool because checking the intern pool each time a string was created would hamper performance. However, they did include the String.Intern method as a way to selectively add dynamically created strings to the literal pool.

string khan = " Khan";
string poem5 = "Kubla" + khan;
Console.WriteLine(ReferenceEquals(poem5, poem1)); // false
// Place the contents of poem5 in the intern pool—if not there
poem5 = String.Intern(poem5);
Console.WriteLine(ReferenceEquals(poem5, poem1)); // true

The String.Intern method searches for the value of poem5 ("Kubla Khan") in the intern pool; because it is already in the pool, there is no need to add it. The method returns a reference to the already existing object (Object1) and assigns it to poem5. Because poem5 and poem1 now point to the same object, the comparison in the final statement is true. Note that the original object created for poem5 is released and swept up during the next Garbage Collection.

Core Recommendation

Core Recommendation

Use the String.Intern method to allow a string variable to take advantage of comparison by reference, but only if it is involved in numerous comparisons.

Overview of String Operations

The System.String class provides a large number of static and instance methods, most of which have several overload forms. For discussion purposes, they can be grouped into four major categories based on their primary function:

  • String ComparisonsThe String.Equals, String.Compare, and String.CompareOrdinal methods offer different ways to compare string values. The choice depends on whether an ordinal or lexical comparison is needed, and whether case or culture should influence the operation.

  • Indexing and SearchingA string is an array of Unicode characters that may be searched by iterating through it as an array or by using special index methods to locate string values.

  • String TransformationsThis is a catchall category that includes methods for inserting, padding, removing, replacing, trimming, and splitting character strings.

  • FormattingNET provides format specifiers that are used in conjunction with String.Format to represent numeric and DateTime values in a number of standard and custom formats.

Many of the string methods—particularly for formatting and comparisons—are culture dependent. Where applicable, we look at how culture affects the behavior of a method.

Comparing Strings

The most efficient way to determine if two string variables are equal is to see if they refer to the same memory address. We did this earlier using the ReferenceEquals method. If two variables do not share the same memory address, it is necessary to perform a character-by-character comparison of the respective values to determine their equality. This takes longer than comparing addresses, but is often unavoidable.

.NET attempts to optimize the process by providing the String.Equals method that performs both reference and value comparisons automatically. We can describe its operation in the following pseudo-code:

If string1 and string2 reference the same memory location
   Then strings must be equal
Else
   Compare strings character by character to determine equality

This code segment demonstrates the static and reference forms of the Equals method:

string poem1 = "Kubla Khan";
string poem2 = "Kubla Khan";
string poem3 = String.Copy(poem2);
string poem4 = "kubla khan";
//
Console.WriteLine(String.Equals(poem1,poem2));  // true
Console.WriteLine(poem1.Equals(poem3));         // true
Console.WriteLine(poem1 == poem3);      // equivalent to Equals
Console.WriteLine(poem1 == poem4);      // false – case differs

Note that the == operator, which calls the Equals method underneath, is a more convenient way of expressing the comparison.

Although the Equals method satisfies most comparison needs, it contains no overloads that allow it to take case sensitivity and culture into account. To address this shortcoming, the string class includes the Compare method.

Using String.Compare

String.Compare is a flexible comparison method that is used when culture or case must be taken into account. Its many overloads accept culture and case-sensitive parameters, as well as supporting substring comparisons.

Syntax:

int Compare (string str1, string str2)
Compare (string str1, string str2, bool IgnoreCase)
Compare (string str1, string str2, bool IgnoreCase,
         CultureInfo ci)
Compare (string str1, int index1, string str2, int index2,
         int len)

Parameters:

str1 and str2

Specify strings to be compared.

IgnoreCase

Set true to make comparison case-insensitive (default is false).

index1 and index2

Starting position in str1 and str2.

ci

A CultureInfo object indicating the culture to be used.

Compare returns an integer value that indicates the results of the comparison. If the two strings are equal, a value of 0 is returned; if the first string is less than the second, a value less than zero is returned; if the first string is greater than the second, a value greater than zero is returned.

The following segment shows how to use Compare to make case-insensitive and case-sensitive comparisons:

int result;
string stringUpper = "AUTUMN";
string stringLower = "autumn";
// (1) Lexical comparison: "A" is greater than "a"
result = string.Compare(stringUpper,stringLower);       // 1
// (2) IgnoreCase set to false
result = string.Compare(stringUpper,stringLower,false); // 1
// (3)Perform case-insensitive comparison
result = string.Compare(stringUpper,stringLower,true);  // 0

Perhaps even more important than case is the potential effect of culture information on a comparison operation. .NET contains a list of comparison rules for each culture that it supports. When the Compare method is executed, the CLR checks the culture associated with it and applies the rules. The result is that two strings may compare differently on a computer with a US culture vis-à-vis one with a Japanese culture. There are cases where it may be important to override the current culture to ensure that the program behaves the same for all users. For example, it may be crucial that a sort operation order items exactly the same no matter where the application is run.

By default, the Compare method uses culture information based on the Thread.CurrentThread.CurrentCulture property. To override the default, supply a CultureInfo object as a parameter to the method. This statement shows how to create an object to represent the German language and country:

CultureInfo ci = new CultureInfo("de-DE");  // German culture

To explicitly specify a default culture or no culture, the CultureInfo class has two properties that can be passed as parameters—CurrentCulture, which tells a method to use the culture of the current thread, and InvariantCulture, which tells a method to ignore any culture.

Let's look at a concrete example of how culture differences affect the results of a Compare() operation.

using System.Globalization;   // Required for CultureInfo

// Perform case-sensitive comparison for Czech culture
string s1 = "circle";
string s2 = "chair";
result = string.Compare(s1, s2,
         true, CultureInfo.CurrentCulture));       //  1
result = string.Compare(s1, s2,
         true, CultureInfo.InvariantCulture));     //  1
// Use the Czech culture
result = string.Compare(s1, s2,
         true, new CultureInfo("cs-CZ"));          //  -1

The string values "circle" and "chair" are compared using the US culture, no culture, and the Czech culture. The first two comparisons return a value indicating that "circle" > "chair", which is what you expect. However, the result using the Czech culture is the opposite of that obtained from the other comparisons. This is because one of the rules of the Czech language specifies that "ch" is to be treated as a single character that lexically appears after "c".

Core Recommendation

Core Recommendation

When writing an application that takes culture into account, it is good practice to include an explicit CultureInfo parameter in those methods that accept such a parameter. This provides a measure of self-documentation that clarifies whether the specific method is subject to culture variation.

Using String.CompareOrdinal

To perform a comparison that is based strictly on the ordinal value of characters, use String.CompareOrdinal. Its simple algorithm compares the Unicode value of two strings and returns a value less than zero if the first string is less than the second; a value of zero if the strings are equal; and a value greater than zero if the first string is greater than the second. This code shows the difference between it and the Compare method:

string stringUpper = "AUTUMN";
string stringLower = "autumn";
//
result = string.Compare(stringUpper,stringLower,
         false, CultureInfo.InvariantCulture);            // 1
result = string.CompareOrdinal(stringUpper,stringLower);  // -32

Compare performs a lexical comparison that regards the uppercase string to be greater than the lowercase. CompareOrdinal examines the underlying Unicode values. Because A (U+0041) is less than a (U+0061), the first string is less than the second.

Searching, Modifying, and Encoding a String's Content

This section describes string methods that are used to perform diverse but familiar tasks such as locating a substring within a string, changing the case of a string, replacing or removing text, splitting a string into delimited substrings, and trimming leading and trailing spaces.

Searching the Contents of a String

A string is an implicit zero-based array of chars that can be searched using the array syntax string[n], where n is a character position within the string. For locating a substring of one or more characters in a string, the string class offers the IndexOf and IndexOfAny methods. Table 5-2 summarizes these.

Table 5-2. Ways to Examine Characters Within a String

String Member

Description

[ n ]

Indexes a 16-bit character located at position n within a string.

int ndx= 0;
while (ndx < poem.Length)
{
   Console.Write(poem[ndx]); //Kubla Khan
   ndx += 1;
}
IndexOf/LastIndexOf

(string, [int start],
[int count])

count. Number of chars to examine.

Returns the index of the first/last occurrence of a specified string within an instance. Returns –1 if no match.

string poem = "Kubla Khan";
int n = poem.IndexOf("la");  // 3
n = poem.IndexOf('K'),       // 0
n = poem.IndexOf('K',4);     // 6

IndexOfAny/LastIndexOfAny

Returns the index of the first/last character in an array of Unicode characters.

string poem = "Kubla Khan";
char[] vowels = new char[5]
      {'a', 'e', 'i', 'o', 'u'};
n = poem.IndexOfAny(vowels);     // 1
n = poem.LastIndexOfAny(vowels); // 8
n = poem.IndexOfAny(vowels,2);   // 4

Searching a String That Contains Surrogates

All of these techniques assume that a string consists of a sequence of 16-bit characters. Suppose, however, that your application must work with a Far Eastern character set of 32-bit characters. These are represented in storage as a surrogate pair consisting of a high and low 16-bit value. Clearly, this presents a problem for an expression such as poem[ndx], which would return only half of a surrogate pair.

For applications that must work with surrogates, .NET provides the StringInfo class that treats all characters as text elements and can automatically detect whether a character is 16 bits or a surrogate. Its most important member is the GetTextElementEnumerator method, which returns an enumerator that can be used to iterate through text elements in a string.

TextElementEnumerator tEnum =
       StringInfo.GetTextElementEnumerator(poem) ;
while (tEnum.MoveNext())  // Step through the string
{
   Console.WriteLine(tEnum.Current);  // Print current char
}

Recall from the discussion of enumerators in Chapter 4, “Working with Objects in C#,” that MoveNext() and Current are members implemented by all enumerators.

String Transformations

Table 5-3 summarizes the most important string class methods for modifying a string. Because the original string is immutable, any string constructed by these methods is actually a new string with its own allocated memory.

Table 5-3. Methods for Manipulating and Transforming Strings

Tag

Description

Insert (int, string)

Inserts a string at the specified position.

string mariner = "and he stoppeth three";
string verse = mariner.Insert(
      mariner.IndexOf(" three")," one of");
// verse --> "and he stoppeth one of three"

PadRight/PadLeft

Pads a string with a given character until it is a specified width. If no character is specified, whitespace is used.

string rem = "and so on";
rem =rem.PadRight(rem.Length+3,'.);
// rem --> "and so on..."

Remove(p , n)

Removes n characters beginning at position p.

string verse = "It is an Ancient Mariner";
string newverse = (verse.Remove(0,9));
// newverse --> "Ancient Mariner"

Replace (A , B)

Replaces all occurrences of A with B, where A and B are chars or strings.

string aString = "nap ace sap path";
string iString = aString.Replace('a','i'),
// iString --> "nip ice sip pith"

Split( char[])

The char array contains delimiters that are used to break a string into substrings that are returned as elements in a string array.

string words = "red,blue orange ";
string [] split = words.Split(new Char []
                  {' ', ','});
Console.WriteLine(split[2]); // orange
ToUpper()
ToUpper(CultureInfo)
ToLower()
ToLower(CultureInfo)

Returns an upper- or lowercase copy of the string.

string poem2="Kubla Khan";
poem2= poem2.ToUpper(
      CultureInfo.InvariantCulture);
Trim()
Trim(params char[])

Removes all leading and trailing whitespaces. If a char array is provided, all leading and trailing characters in the array are removed.

string name = "  Samuel Coleridge";
name = name.Trim(); // "Samuel Coleridge"
TrimEnd (params char[])
TrimStart(params char[])

Removes all leading or trailing characters specified in a char array. If null is specified, whitespaces are removed.

string name = "  Samuel Coleridge";
trimName    = name.TrimStart(null);
shortname   = name.TrimEnd('e','g','i'),
// shortName --> "Samuel Colerid"
Substring(n)
Substring(n, l)

Extracts the string beginning at a specified position (n) and of length l, if specified.

string title="Kubla Khan";
Console.WriteLine(title.Substring(2,3));
//bla
ToCharArray()
ToCharArray(n, l)

Extracts characters from a string and places in an array of Unicode characters.

string myVowels = "aeiou";
char[] vowelArr;
vowelArr = myVowels.ToCharArray();
Console.WriteLine(vowelArr[1]);  // "e"

Most of these methods have analogues in other languages and behave as you would expect. Somewhat surprisingly, as we see in the next section, most of these methods are not available in the StringBuilder class. Only Replace, Remove, and Insert are included.

String Encoding

Encoding comes into play when you need to convert between strings and bytes for operations such as writing a string to a file or streaming it across a network. Character encoding and decoding offer two major benefits: efficiency and interoperability. Most strings read in English consist of characters that can be represented by 8 bits. Encoding can be used to strip an extra byte (from the 16-bit Unicode memory representation) for transmission and storage. The flexibility of encoding is also important in allowing an application to interoperate with legacy data or third-party data encoded in different formats.

The .NET Framework supports many forms of character encoding and decoding. The most frequently used include the following:

  • UTF-8Each character is encoded as a sequence of 1 to 4 bytes, based on its underlying value. ASCII compatible characters are stored in 1 byte; characters between 0x0080 and 0x07ff are stored in 2 bytes; and characters having a value greater than or equal to 0x0800 are converted to 3 bytes. Surrogates are written as 4 bytes. UTF-8 (which stands for UCS Transformation Format, 8-bit form) is usually the default for .NET classes when no encoding is specified.

  • UTF-16Each character is encoded as 2 bytes (except surrogates), which is how characters are represented internally in .NET. This is also referred to as Unicode encoding.

  • ASCIIEncodes each character as an 8-bit ASCII character. This should be used when all characters are in the ASCII range (0x00 to 0x7F). Attempting to encode a character outside of the ACII range yields whatever value is in the character's low byte.

Encoding and decoding are performed using the Encoding class found in the System.Text namespace. This abstract class has several static properties that return an object used to implement a specific encoding technique. These properties include ASCII, UTF8, and Unicode. The latter is used for UTF-16 encoding.

An encoding object offers several methods—each having several overloads—for converting between characters and bytes. Here is an example that illustrates two of the most useful methods: GetBytes, which converts a text string to bytes, and GetString, which reverses the process and converts a byte array to a string.

string text= "In Xanadu did Kubla Khan";
Encoding UTF8Encoder = Encoding.UTF8;
byte[] textChars = UTF8Encoder.GetBytes(text);
Console.WriteLine(textChars.Length);         // 24
// Store using UTF-16
textChars = Encoding.Unicode.GetBytes(text);
Console.WriteLine(textChars.Length);         // 48
// Treat characters as two bytes
string decodedText = Encoding.Unicode.GetString(textChars);
Console.WriteLine(decodedText); // "In Xanadu did ...  "

You can also instantiate the encoding objects directly. In this example, the UTF-8 object could be created with

UTF8Encoding UTF8Encoder = new UTF8Encoding();

With the exception of ASCIIEncoding, the constructor for these classes defines parameters that allow more control over the encoding process. For example, you can specify whether an exception is thrown when invalid encoding is detected.

StringBuilder

The primary drawback of strings is that memory must be allocated each time the contents of a string variable are changed. Suppose we create a loop that iterates 100 times and concatenates one character to a string during each iteration. We could end up with a hundred strings in memory, each differing from its preceding one by a single character.

The StringBuilder class addresses this problem by allocating a work area (buffer) where its methods can be applied to the string. These methods include ways to append, insert, delete, remove, and replace characters. After the operations are complete, the ToString method is called to convert the buffer to a string that can be assigned to a string variable. Listing 5-1 introduces some of the StringBuilder methods in an example that creates a comma delimited list.

Example 5-1. Introduction to StringBuilder

using System;
using System.Text;
public class MyApp
{
   static void Main()
   {
      // Create comma delimited string with quotes around names
      string namesF = "Jan Donna Kim ";
      string namesM = "Rob James";
      StringBuilder sbCSV = new StringBuilder();
      sbCSV.Append(namesF).Append(namesM);
      sbCSV.Replace(" ","','");
      // Insert quote at beginning and end of string
      sbCSV.Insert(0,"'").Append("'");
      string csv = sbCSV.ToString();
      // csv = 'Jan','Donna','Kim','Rob','James'
   }
}

All operations occur in a single buffer and require no memory allocation until the final assignment to csv. Let's take a formal look at the class and its members.

StringBuilder Class Overview

Constructors for the StringBuilder class accept an initial string value as well as integer values that specify the initial space allocated to the buffer (in characters) and the maximum space allowed.

// Stringbuilder(initial value)
StringBuilder sb1 = new StringBuilder("abc");
// StringBuilder(initial value, initial capacity)
StringBuilder sb2 = new StringBuilder("abc", 16);
// StringBuiler(Initial Capacity, maximum capacity)
StringBuilder sb3 = new StringBuilder(32,128);

The idea behind StringBuilder is to use it as a buffer in which string operations are performed. Here is a sample of how its Append, Insert, Replace, and Remove methods work:

int i = 4;
char[] ch = {'w','h','i','t','e'};
string myColor = " orange";
StringBuilder sb = new StringBuilder("red blue green");
sb.Insert(0, ch);              // whitered blue green
sb.Insert(5," ");              // white red blue green
sb.Insert(0,i);                // 4white red blue green
sb.Remove(1,5);                // 4 red blue green
sb.Append(myColor);            // 4 red blue green orange
sb.Replace("blue","violet");   // 4 red violet green orange
string colors = sb.ToString();

StringBuilder Versus String Concatenation

Listing 5-2 tests the performance of StringBuilder versus the concatenation operator. The first part of this program uses the + operator to concatenate the letter a to a string in each of a loop's 50,000 iterations. The second half does the same, but uses the StringBuilder.Append method. The Environment.TickCount provides the beginning and ending time in milliseconds.

Example 5-2. Comparison of StringBuilder and Regular Concatenation

using System;
using System.Text;
public class MyApp
{
   static void Main()
   {
      Console.WriteLine("String routine");
      string a = "a";
      string str = string.Empty;
      int istart, istop;
      istart = Environment.TickCount;
      Console.WriteLine("Start: "+istart);
      // Use regular C# concatenation operator
      for(int i=0; i<50000; i++)
      {
         str += a;
      }
      istop = Environment.TickCount;
      Console.WriteLine("Stop: "+istop);
      Console.WriteLine("Difference: " + (istop-istart));
      // Perform concatenation with StringBuilder
      Console.WriteLine("StringBuilder routine");
      StringBuilder builder = new StringBuilder();
      istart = Environment.TickCount;
      Console.WriteLine("Start: "+istart);
      for(int i=0; i<50000; i++)
      {
         builder.Append(a);
      }
      istop = Environment.TickCount;
      str = builder.ToString();
      Console.WriteLine("Stop: "+Environment.TickCount);
      Console.WriteLine("Difference: "+ (istop-istart));
   }
}

Executing this program results in the following output:

String routine
Start: 1422091687
Stop: 1422100046
Difference: 9359
StringBuilder routine
Start: 1422100046
Stop: 1422100062
Difference: 16

The results clearly indicate the improved performance StringBuilder provides: The standard concatenation requires 9,359 milliseconds versus 16 milliseconds for StringBuilder. When tested with loops of 1,000 iterations, StringBuilder shows no significant advantage. Unless your application involves extensive text manipulation, the standard concatenation operator should be used.

Formatting Numeric and DateTime Values

The String.Format method is the primary means of formatting date and numeric data for display. It accepts a string composed of text and embedded format items followed by one or more data arguments. Each format item references a data argument and specifies how it is to be formatted. The CLR creates the output string by converting each data value to a string (using ToString), formatting it according to its corresponding format item, and then replacing the format item with the formatted data value. Here is a simple example:

String s= String.Format("The square root of {0} is {1}.",64,8);
// output: The square root of 64 is 8.

The method has several overloads, but this is the most common and illustrates two features common to all: a format string and a list of data arguments. Note that Console.WriteLine accepts the same parameters and can be used in place of String.Format for console output.

Constructing a Format Item

Figure 5-3 breaks down a String.Format example into its basic elements. The most interesting of these is the format item, which defines the way data is displayed.

String.Format example

Figure 5-3. String.Format example

As we can see, each format item consists of an index and an optional alignment and format string. All are enclosed in brace characters:

  1. The index is a zero-based integer that indicates the argument to which it is to be applied. The index can be repeated to refer to the same argument more than once.

  2. The optional alignment is an integer that indicates the minimum width of the area that contains the formatted value. If alignment value is positive, the argument value is right justified; if the value is negative, it is left justified.

  3. The optional format string contains the formatting codes to be applied to the argument value. If it is not specified, the output of the argument's ToString method is used. .NET provides several standard format codes to be used with numbers and dates as well as codes that are used to create custom format strings.

Formatting Numeric Values

Nine characters, or format specifiers, are available to format numbers into currency, scientific, hexadecimal, and other representations. Each character can have an integer appended to it that specifies a precision particular to that format—usually this indicates the number of decimal places. C# recognizes the standard format specifiers[2] shown in Table 5-4.

Table 5-4. Formatting Numeric Values with Standard Numeric Format Strings

Format Specifier

Description

Pattern

Output

C or c

Currency. Number is represented as a currency amount. The precision specifies the number of decimal places.

{0:C2}, 1458.75

$ 1,458.75

D or d

Decimal. Applies to integral values. The precision indicates the total number of spaces the number occupies; is padded with zeros on left if necessary.

{0:D5}, 455

{0:D5}, -455

00455
-00455

E or e

Scientific. The number is converted to a scientific notation: ddddE+nnn. The precision specifies the number of digits after the decimal point.

{0,10:E2}, 3298.78

{0,10:E4}, -54783.4

3.30+E003
-5.4783+E004

F or f

Fixed Point. The number is converted to format of: ddd.ddd. The precision indicates the number of decimal places.

{0,10:F0}, 162.57

{0,10:F2}, 8162.57

    162
8162.57

G or g

General. The number is converted to fixed point or scientific notation based on the precision and type of number. Scientific is used if the exponent is greater than or equal to the specified precision or less than –4.

{0,10:G}, .0000099

{0,10:G2}, 455.89

{0,10:G3}, 455.89

{0,10:G}, 783229.34

  9.9E-06
  4.6E+02
      456
783229.34

N or n

Number. Converts to a string that uses commas as thousands separators. The precision specifies the number of decimal places.

{0,10:N}, 1045.78

{0,10:N1}, 45.98

1,045.78
    45.9

P or p

Percent. Number is multiplied by 100 and presented as percent with number of decimal places specified by precision.

{0,10:P}, 0.78

{0,10:P3}, 0.7865

 78.00 %
78.650 %

R or r

Round-trip. Converts to a string that retains all decimal place accuracy. Then number to be converted must be floating point.

{0,10:R}, 1.62736

1.62736

X or x

Hexadecimal. Converts the number to its hex representation. The precision indicates the minimum number of digits shown. Number is padded with zeros if needed.

{0,10:X}, 25

{0,10:X4}, 25

{0,10:x4}, 31

  19
0019
001f

The patterns in this table can also be used directly with Console.Write and Console.WriteLine:

Console.WriteLine("The Hex value of {0} is {0:X} ",31); //1F

The format specifiers can be used alone to enhance output from the ToString method:

decimal pct = .758M;
Console.Write("The percent is "+pct.ToString("P2")); // 75.80 %

.NET also provides special formatting characters that can be used to create custom numeric formats. The most useful characters are pound sign (#), zero (0), comma (,), period (.), percent sign (%), and semi-colon (;). The following code demonstrates their use:

decimal dVal = 2145.88M;   // decimal values require M suffix
string myFormat;
myFormat = dVal.ToString("#####");           //  2146
myFormat = dVal.ToString("#,###.00");        //  2,145.88
myFormat = String.Format("Value is {0:#,###.00;
(#,###.00)}",-4567);
// semicolon specifies alternate formats. (4,567.00)
myFormat = String.Format("Value is {0:$#,###.00}", 4567);
                                             //  $4,567.00
Console.WriteLine("{0:##.00%}",.18);         //  18.00 %

The role of these characters should be self-explanatory except for the semicolon (;), which deserves further explanation. It separates the format into two groups: the first is applied to positive values and the second to negative. Two semicolons can be used to create three groups for positive, negative, and zero values, respectively.

Formatting Dates and Time

Date formatting requires a DateTime object. As with numbers, this object has its own set of standard format specifiers. Table 5-5 summarizes these.

Table 5-5. Formatting Dates with Standard Characters

Format Specifier

Description

Example—English

Example—German

d

Short date pattern

1/19/2004

19.1.2004

D

Long date pattern

Monday, January 19, 2004

Montag, 19 Januar, 2004

f

Full date/time pattern (short time)

Monday, January 19, 2004 4:05 PM

Montag, 19 Januar, 2004 16:05

F

Full date/time pattern (full time)

Monday, January 19, 2004 4:05:20 PM

Montag, 19 Januar, 2004 16:05:20

g

General date/time pattern (short time)

1/19/2004 4:05 PM

19/1/2004 16:05

G

General date/time pattern (long time)

1/19/2004 4:05:20 PM

19/1/2004 16:05:20

M, m

Month day pattern

January 19

19 Januar

Y, y

Year month pattern

January, 2004

Januar, 2004

t

Short time pattern

4:05 PM

16:05

T

Long time pattern

4:05:20 PM

16:05:20

s

Universal Sortable Date-Time pattern. Conforms to ISO 8601. Uses local time.

2004-01-19T16:05:20

2004-01-19T16:05:20

u

Universal Sortable Date-Time pattern

2004-01-19 16:05:20Z

2004-01-19 16:05:20Z

U

Universal Sortable Date-Time pattern. Uses universal time.

Monday, January 19, 2004 21:05:20 PM

Montag, 19. Januar, 2004 21:05:20

Here are some concrete examples that demonstrate date formatting. In each case, an instance of a DateTime object is passed an argument to a format string.

DateTime curDate = DateTime.Now;  // Get Current Date
Console.Writeline("Date: {0:d} ", curDate);   // 1/19/2004
// f: --> Monday, January 19, 2004 5:05 PM
Console.Writeline("Date: {0:f} ", curDate);
// g: --> 1/19/2004 5:05 PM
Console.Writeline("Date: {0:g} ", curDate);

If none of the standard format specifiers meet your need, you can construct a custom format from a set of character sequences designed for that purpose. Table 5-6 lists some of the more useful ones for formatting dates.

Table 5-6. Character Patterns for Custom Date Formatting

Format

Description

Example

d

Day of month. No leading zero.

5

dd

Day of month. Always has two digits.

05

ddd

Day of week with three-character abbreviation.

Mon

dddd

Day of week full name.

Monday

M

Month number. No leading zero.

1

MM

Month number with leading zero if needed.

01

MMM

Month name with three-character abbreviation.

Jan

MMMM

Full name of month.

January

y

Year. Last one or two digits.

5

yy

Year. Last one or two digits with leading zero if needed.

05

yyyy

Four-digit year.

2004

HH

Hour in 24-hour format.

15

mm

Minutes with leading zero if needed.

20

Here are some examples of custom date formats:

DateTime curDate = DateTime.Now;
f = String.Format("{0:dddd} {0:MMM} {0:dd}", curDate);
// output: Monday Jan 19

f = currDate.ToString("dd MMM yyyy")
// output: 19 Jan 2004

// The standard short date format (d) is equivalent to this:
Console.WriteLine(currDate.ToString("M/d/yyyy"));  // 1/19/2004
Console.WriteLine(currDate.ToString("d"));         // 1/19/2004

CultureInfo ci = new CultureInfo("de-DE");         // German
f = currDate.ToString("dd-MMMM-yyyy HH:mm", ci)
// output: 19-Januar-2004 23:07

ToString is recommended over String.Format for custom date formatting. It has a more convenient syntax for embedding blanks and special separators between the date elements; in addition, its second parameter is a culture indicator that makes it easy to test different cultures.

Dates and Culture

Dates are represented differently throughout the world, and the ability to add culture as a determinant in formatting dates shifts the burden to .NET from the developer. For example, if the culture on your system is German, dates are automatically formatted to reflect a European format: the day precedes the month; the day, month, and year are separated by periods (.) rather than slashes (/); and the phrase Monday, January 19 becomes Montag, 19. Januar. Here is an example that uses ToString with a German CultureInfo parameter:

CultureInfo ci = new CultureInfo("de-DE");       // German
Console.WriteLine(curDate.ToString("D",ci));
// output ---> Montag, 19. Januar 2004
Console.WriteLine(curDate.ToString("dddd",ci));  // -->Montag

The last statement uses the special custom format "dddd" to print the day of the week. This is favored over the DateTime.DayofWeek enum property that returns only an English value.

NumberFormatInfo and DateTimeFormatInfo Classes

These two classes govern how the previously described format patterns are applied to dates and numbers. For example, the NumberFormatInfo class includes properties that specify the character to be used as a currency symbol, the character to be used as a decimal separator, and the number of decimal digits to use when displaying a currency value. Similarly, DateTimeFormatInfo defines properties that correspond to virtually all of the standard format specifiers for dates. One example is the FullDateTimePattern property that defines the pattern returned when the character F is used to format a date.

NumberFormatInfo and DateTimeFormatInfo are associated with specific cultures, and their properties are the means for creating the unique formats required by different cultures. .NET provides a predefined set of property values for each culture, but they can be overridden.

Their properties are accessed in different ways depending on whether the current or non-current culture is being referenced (current culture is the culture associated with the current thread). The following statements reference the current culture:

NumberFormatInfo.CurrentInfo.<property>
CultureInfo.CurrentCulture.NumberFormat.<property>

The first statement uses the static property CurrentInfo and implicitly uses the current culture. The second statement specifies a culture explicitly (CurrentCulture) and is suited for accessing properties associated with a non-current CultureInfo instance.

CultureInfo ci = new CultureInfo("de-DE");
string f = ci.NumberFormat.CurrencySymbol;

NumberFormatInfo and DateTimeFormatInfo properties associated with a non-current culture can be changed; those associated with the current thread are read-only. Listing 5-3 offers a sampling of how to work with these classes.

Example 5-3. Using NumberFormatInfo and DateTimeFormatInfo

using System
using System.Globalization
Class MyApp
{
   // NumberFormatInfo
   string curSym = NumberFormatInfo.CurrentInfo.CurrencySymbol;
   int dd  = NumberFormatInfo.CurrentInfo.CurrencyDecimalDigits;
   int pdd = NumberFormatInfo.CurrentInfo.PercentDecimalDigits;
   // --> curSym = "$"   dd = 2  pdd = 2
   // DateTimeFormatInfo
   string ldp= DateTimeFormatInfo.CurrentInfo.LongDatePattern;
   // --> ldp = "dddd, MMMM, dd, yyyy"
   string enDay = DateTimeFormatInfo.CurrentInfo.DayNames[1];
   string month = DateTimeFormatInfo.CurrentInfo.MonthNames[1];
   CultureInfo ci = new CultureInfo("de-DE");
   string deDay = ci.DateTimeFormat.DayNames[1];
   // --> enDay = "Monday"  month = February  deDay = "Montag"
   // Change the default number of decimal places
   // in a percentage
   decimal passRate = .840M;
   Console.Write(passRate.ToString("p",ci));  // 84,00%
   ci.NumberFormat.PercentDecimalDigits = 1;
   Console.Write(passRate.ToString("p",ci));  // 84,0%
}

In summary, .NET offers a variety of standard patterns that satisfy most needs to format dates and numbers. Behind the scenes, there are two classes, NumberFormatInfo and DateTimeFormatInfo, that define the symbols and rules used for formatting. .NET provides each culture with its own set of properties associated with an instance of these classes.

Regular Expressions

The use of strings and expressions to perform pattern matching dates from the earliest programming languages. In the mid-1960s SNOBOL was designed for the express purpose of text and string manipulation. It influenced the subsequent development of the grep tool in the Unix environment that makes extensive use of regular expressions. Those who have worked with grep or Perl or other scripting languages will recognize the similarity in the .NET implementation of regular expressions.

Pattern matching is based on the simple concept of applying a special pattern string to some text source in order to match an instance or instances of that pattern within the text. The pattern applied against the text is referred to as a regular expression, or regex, for short.

Entire books have been devoted to the topic of regular expressions. This section is intended to provide the essential knowledge required to get you started using regular expressions in the .NET world. The focus is on using the Regex class, and creating regular expressions from the set of characters and symbols available for that purpose.

The Regex Class

You can think of the Regex class as the engine that evaluates regular expressions and applies them to target strings. It provides both static and instance methods that use regexes for text searching, extraction, and replacement. The Regex class and all related classes are found in the System.Text.RegularExpressions namespace.

Syntax:

Regex( string pattern )
Regex( string pattern, RegexOptions)

Parameters:

Pattern

Regular expression used for pattern matching.

RegexOptions

An enum whose values control how the regex is applied. Values include:

CultureInvariant—. Ignore culture.

IgnoreCase—. Ignore upper- or lowercase.

RightToLeft—. Process string right to left.

Example:

Regex r1 = new Regex("  ");   // Regular expression is a blank
String words[] = r1.Split("red blue orange yellow");
// Regular expression matches upper- or lowercase "at"
Regex r2 = new Regex("at", RegexOptions.IgnoreCase);

As the example shows, creating a Regex object is quite simple. The first parameter to its constructor is a regular expression. The optional second parameter is one or more (separated by |) RegexOptions enum values that control how the regex is applied.

Regex Methods

The Regex class contains a number of methods for pattern matching and text manipulation. These include IsMatch, Replace, Split, Match, and Matches. All have instance and static overloads that are similar, but not identical.

Core Recommendation

Core Recommendation

If you plan to use a regular expression repeatedly, it is more efficient to create a Regex object. When the object is created, it compiles the expression into a form that can be used as long as the object exists. In contrast, static methods recompile the expression each time they are used.

Let's now examine some of the more important Regex methods. We'll keep the regular expressions simple for now because the emphasis at this stage is on understanding the methods—not regular expressions.

IsMatch()

This method matches the regular expression against an input string and returns a boolean value indicating whether a match is found.

string searchStr = "He went that a way";
Regex myRegex = new Regex("at");
// instance methods
bool match = myRegex.IsMatch(searchStr);         // true
// Begin search at position 12 in the string
match = myRegex.IsMatch(searchStr,12);           // false
// Static Methods – both return true
match = Regex.IsMatch(searchStr,"at");
match = Regex.IsMatch(searchStr,"AT",RegexOptions.IgnoreCase);

Replace()

This method returns a string that replaces occurrences of a matched pattern with a specified replacement string. This method has several overloads that permit you to specify a start position for the search or control how many replacements are made.

Syntax:

static Replace (string input, string pattern, string replacement
                [,RegexOptions])

Replace(string input, string replacement)
Replace(string input, string replacement, int count)
Replace(string input, string replacement, int count, int startat)

The count parameter denotes the maximum number of matches; startat indicates where in the string to begin the matching process. There are also versions of this method—which you may want to explore further—that accept a MatchEvaluator delegate parameter. This delegate is called each time a match is found and can be used to customize the replacement process.

Here is a code segment that illustrates the static and instance forms of the method:

string newStr;
newStr = Regex.Replace("soft rose","o","i");   // sift rise
// instance method
Regex myRegex = new Regex("o");                // regex = "o"
// Now specify that only one replacement may occur
newStr = myRegex.Replace("soft rose","i",1);   // sift rose

Split()

This method splits a string at each point a match occurs and places that matching occurrence in an array. It is similar to the String.Split method, except that the match is based on a regular expression rather than a character or character string.

Syntax:

String[] Split(string input)
String[] Split(string input, int count)
String[] Split(string input, int count, int startat)
Static String[] Split(string input, string pattern)

Parameters:

input

The string to split.

count

The maximum number of array elements to return. A count value of 0 results in as many matches as possible. If the number of matches is greater than count, the last match consists of the remainder of the string.

startat

The character position in input where the search begins.

pattern

The regex pattern to be matched against the input string.

This short example parses a string consisting of a list of artists' last names and places them in an array. A comma followed by zero or more blanks separates the names. The regular expression to match this delimiter string is: ",[ ]*". You will see how to construct this later in the section.

string impressionists = "Manet,Monet, Degas, Pissarro,Sisley";
// Regex to match a comma followed by 0 or more spaces
string patt = @",[ ]*";
// Static method
string[] artists = Regex.Split(impressionists, patt);
// Instance method is used to accept maximum of four matches
Regex myRegex = new Regex(patt);
string[] artists4 = myRegex.Split(impressionists, 4);
foreach (string master in artists4)
   Console.Write(master);
// Output --> "Manet" "Monet" "Degas" "Pissarro,Sisley"

Match() and Matches()

These related methods search an input string for a match to the regular expression. Match() returns a single Match object and Matches() returns the object MatchCollection, a collection of all matches.

Syntax:

Match Match(string input)
Match Match(string input, int startat)
Match Match(string input, int startat, int numchars)
static Match(string input, string pattern, [RegexOptions])

The Matches method has similar overloads but returns a MatchCollection object.

Match and Matches are the most useful Regex methods. The Match object they return is rich in properties that expose the matched string, its length, and its location within the target string. It also includes a Groups property that allows the matched string to be further broken down into matching substrings. Table 5-7 shows selected members of the Match class.

Table 5-7. Selected Members of the Match Class

Member

Description

Index

Property returning the position in the string where the first character of the match is found.

Groups

A collection of groups within the class. Groups are created by placing sections of the regex with parentheses. The text that matches the pattern in parentheses is placed in the Groups collection.

Length

Length of the matched string.

Success

True or False depending on whether a match was found.

Value

Returns the matching substring.

NextMatch()

Returns a new Match with the results from the next match operation, beginning with the character after the previous match, if any.

The following code demonstrates the use of these class members. Note that the dot (.) in the regular expression functions as a wildcard character that matches any single character.

string verse = "In Xanadu did Kubla Khan";
string patt = ".an...";       // "." matches any character
Match verseMatch = Regex.Match(verse, patt);
Console.WriteLine(verseMatch.Value);  // Xanadu
Console.WriteLine(verseMatch.Index);  // 3
//
string newPatt = "K(..)";             //contains group(..)
Match kMatch = Regex.Match(verse, newPatt);
while (kMatch.Success) {
   Console.Write(kMatch.Value);       // -->Kub -->Kha
   Console.Write(kMatch.Groups[1]);   // -->ub  -->ha
   kMatch = kMatch.NextMatch();
}

This example uses NextMatch to iterate through the target string and assign each match to kMatch (if NextMatch is left out, an infinite loop results). The parentheses surrounding the two dots in newPatt break the pattern into groups without affecting the actual pattern matching. In this example, the two characters after K are assigned to group objects that are accessed in the Groups collection.

Sometimes, an application may need to collect all of the matches before processing them—which is the purpose of the MatchCollection class. This class is just a container for holding Match objects and is created using the Regex.Matches method discussed earlier. Its most useful properties are Count, which returns the number of captures, and Item, which returns an individual member of the collection. Here is how the NextMatch loop in the previous example could be rewritten:

string verse = "In Xanadu did Kubla Khan";
String newpatt = "K(..)";
foreach (Match kMatch in Regex.Matches(verse, newpatt))
   Console.Write(kMatch.Value);  // -->Kub  -->Kha
// Could also create explicit collection and work with it.
MatchCollection mc = Regex.Matches(verse, newpatt);
Console.WriteLine(mc.Count);     // 2

Creating Regular Expressions

The examples used to illustrate the Regex methods have employed only rudimentary regular expressions. Now, let's explore how to create regular expressions that are genuinely useful. If you are new to the subject, you will discover that designing Regex patterns tends to be a trial-and-error process; and the endeavor can yield a solution of simple elegance—or maddening complexity. Fortunately, almost all of the commonly used patterns can be found on one of the Web sites that maintain a searchable library of Regex patterns (www.regexlib.com is one such site).

A regular expression can be broken down into four different types of metacharacters that have their own role in the matching process:

  • Matching charactersThese match a specific type of character—for example, d matches any digit from 0 to 9.

  • Repetition charactersUsed to prevent having to repeat a matching character or item—for example, d{3}can be used instead of ddd to match three digits.

  • Positional charactersDesignate the location in the target string where a match must occur—for example, ^d{3} requires that the match occur at the beginning of the string.

  • Escape sequencesUse the backslash () in front of characters that otherwise have special meaning—for example, } permits the right brace to be matched.

Table 5-8 summarizes the most frequently used patterns.

Table 5-8. Regular Expression Patterns

Pattern

Matching Criterion

Example

+

Match one or more occurrences of the previous item.

to+ matches too and tooo. It does not match t.

*

Match zero or more occurrences of the previous item.

to* matches t or too or tooo.

?

Match zero or one occurrence of the previous item. Performs “non-greedy” matching.

te?n matches ten or tn. It does not match teen.

{n}

Match exactly n occurrences of the previous character.

te{2}n matches teen. It does not match ten or teeen.

{n,}

Match at least n occurrences of the previous character.

te{1,}n matches ten and teen. It does not match tn.

{n,m}

Match at least n and no more than m occurrences of the previous character.

te{1,2}n matches ten and teen.

Treat the next character literally. Used to match characters that have special meaning such as the patterns +, *, and ?.

A+B matches A+B. The slash () is required because + has special meaning.

d D

Match any digit (d) or non-digit (D). This is equivalent to [0-9] or [^0-9], respectively.

dd matches 55.

DD matches xx.

w W

Match any word plus underscore (w) or non-word (W) character. w is equivalent to [a-zA-Z0-9_]. W is equivalent to [^a-zA-Z0-9_].

wwww matches A_19.

WWW matches ($).
  
  	
v  f

Match newline, carriage return, tab, vertical tab, or form feed, respectively.

N/A

s S

Match any whitespace (s) or non-whitespace (S). A whitespace is usually a space or tab character.

wswsw matches A B C.

. (dot)

Matches any single character. Does not match a newline.

a.c matches abc.

It does not match abcc.

|

Logical OR.

"in|en" matches enquiry.

[. . . ]

Match any single character between the brackets. Hyphens may be used to indicate a range.

[aeiou] matches u. [dD] matches a single digit or non-digit.

[^. . .]

All characters except those in the brackets.

[^aeiou] matches x.

A Pattern Matching Example

Let's apply these character patterns to create a regular expression that matches a Social Security Number (SSN):

bool iMatch = Regex.IsMatch("245-09-8444",
                            @"ddd-dd-dddd");

This is the most straightforward approach: Each character in the Social Security Number matches a corresponding pattern in the regular expression. It's easy to see, however, that simply repeating symbols can become unwieldy if a long string is to be matched. Repetition characters improve this:

bool iMatch = Regex.IsMatch("245-09-8444",
                            @"d{3}-d{2}-d{4}");

Another consideration in matching the Social Security Number may be to restrict where it exists in the text. You may want to ensure it is on a line by itself, or at the beginning or end of a line. This requires using position characters at the beginning or end of the matching sequence.

Let's alter the pattern so that it matches only if the Social Security Number exists by itself on the line. To do this, we need two characters: one to ensure the match is at the beginning of the line, and one to ensure that it is also at the end. According to Table 5-9, ^ and $ can be placed around the expression to meet these criteria. The new string is

@"^d{3}-d{2}-d{4}$"

Table 5-9. Characters That Specify Where a Match Must Occur

Position Character

Description

^

Following pattern must be at the start of a string or line.

$

Preceding pattern must be at end of a string or line.

A

Preceding pattern must be at the start of a string.

 B

Move to a word boundary (), where a word character and non-word character meet, or a non-word boundary.

z 

Pattern must be at the end of a string (z) or at the end of a string before a newline.

These positional characters do not take up any space in the expression—that is, they indicate where matching may occur but are not involved in the actual matching process.

As a final refinement to the SSN pattern, let's break it into groups so that the three sets of numbers separated by dashes can be easily examined. To create a group, place parentheses around the parts of the expression that you want to examine independently. Here is a simple code example that uses the revised pattern:

string ssn = "245-09-8444";
string ssnPatt = @"^(d{3})-(d{2})-(d{4})$";
Match ssnMatch = Regex.Match(ssn, ssnPatt);
if (ssnMatch.Success){
   Console.WriteLine(ssnMatch.Value);         // 245-09-8444
   Console.WriteLine(ssnMatch.Groups.Count);  // 4
   // Count is 4 since Groups[0] is set to entire SSN
   Console.Write(ssnMatch.Groups[1]);         // 245
   Console.Write(ssnMatch.Groups[2]);         // 09
   Console.Write(ssnMatch.Groups[3]);         // 8444
}

We now have a useful pattern that incorporates position, repetition, and group characters. The approach that was used to create this pattern—started with an obvious pattern and refined it through multiple stages—is a useful way to create complex regular expressions (see Figure 5-4).

Regular expression

Figure 5-4. Regular expression

Working with Groups

As we saw in the preceding example, the text resulting from a match can be automatically partitioned into substrings or groups by enclosing sections of the regular expression in parentheses. The text that matches the enclosed pattern becomes a member of the Match.Groups[] collection. This collection can be indexed as a zero-based array: the 0 element is the entire match, element 1 is the first group, element 2 the second, and so on.

Groups can be named to make them easier to work with. The name designator is placed adjacent to the opening parenthesis using the syntax ?<name>. To demonstrate the use of groups, let's suppose we need to parse a string containing the forecasted temperatures for the week (for brevity, only two days are included):

string txt ="Monday Hi:88 Lo:56 Tuesday Hi:91 Lo:61";

The regex to match this includes two groups: day and temps. The following code creates a collection of matches and then iterates through the collection, printing the content of each group:

string rgPatt = @"(?<day>[a-zA-Z]+)s*(?<temps>Hi:d+s*Lo:d+)";
MatchCollection mc = Regex.Matches(txt, rgPatt); //Get matches
foreach(Match m in mc)
{
   Console.WriteLine("{0} {1}",
                     m.Groups["day"],m.Groups["temps"]);
}
//Output:   Monday Hi:88 Lo:56
//          Tuesday Hi:91 Lo:61

Core Note

Core Note

There are times when you do not want the presence of parentheses to designate a group that captures a match. A common example is the use of parentheses to create an OR expression—for example, (an|in|on). To make this a non-capturing group, place ?: inside the parentheses—for example, (?:an|in|on).

Backreferencing a Group

It is often useful to create a regular expression that includes matching logic based on the results of previous matches within the expression. For example, during a grammatical check, word processors flag any word that is a repeat of the preceding word(s). We can create a regular expression to perform the same operation. The secret is to define a group that matches a word and then uses the matched value as part of the pattern. To illustrate, consider the following code:

string speech = "Four score and and seven years";
patt = @"([a-zA-Z]+)s1";          // Match repeated words
MatchCollection mc = Regex.Matches(speech, patt);
foreach(Match m in mc) {
      Console.WriteLine(m.Groups[1]);   // --> and
}

This code matches only the repeated words. Let's examine the regular expression:

Text/Pattern

Description

and and
@"([a-zA-Z]+)s

Matches a word bounded on each side by a word boundary () and followed by a whitespace.

and and
1

The backreference indicator. Any group can be referenced with a slash () followed by the group number. The effect is to insert the group's matched value into the expression.

A group can also be referenced by name rather than number. The syntax for this backreference is k followed by the group name enclosed in <>:

patt = @"(?<word>[a-zA-Z]+)sk<word>";

Examples of Using Regular Expressions

This section closes with a quick look at some patterns that can be used to handle common pattern matching challenges. Two things should be clear from these examples: There are virtually unlimited ways to create expressions to solve a single problem, and many pattern matching problems involve nuances that are not immediately obvious.

Using Replace to Reverse Words

string userName = "Claudel, Camille";
userName = Regex.Replace( userName, @"(w+),s*(w+)", "$2 $1" );
Console.WriteLine(userName);   // Camille Claudel

The regular expression assigns the last and first name to groups 1 and 2. The third parameter in the Replace method allows these groups to be referenced by placing $ in front of the group number. In this case, the effect is to replace the entire matched name with the match from group 2 (first name) followed by the match from group 1 (last name).

Parsing Numbers

String myText = "98, 98.0, +98.0, +98";
string numPatt = @"d+";                     // Integer
numPatt = @"(d+.?d*)|(.d+)";            // Allow decimal
numPatt = @"([+-]?d+.?d*)|([+-]?.d+)";  // Allow + or -

Note the use of the OR (|) symbol in the third line of code to offer alternate patterns. In this case, it permits an optional number before the decimal.

The following code uses the ^ character to anchor the pattern to the beginning of the line. The regular expression contains a group that matches four bytes at a time. The * character causes the group to be repeated until there is nothing to match. Each time the group is applied, it captures a 4-digit hex number that is placed in the CaptureCollection object.

string hex = "00AA001CFF0C";
string hexPatt =  @"^(?<hex4>[a-fA-Fd]{4})*";
Match hexMatch = Regex.Match(hex,hexPatt);
Console.WriteLine(hexMatch.Value); // --> 00AA001CFFOC
CaptureCollection cc = hexMatch.Groups["hex4"].Captures;
foreach (Capture c in cc)
   Console.Write(c.Value); // --> 00AA 001C FF0C

Figure 5-5 shows the hierarchical relationship among the Match, GroupCollection, and CaptureCollection classes.

Hex numbers captured by regular expression

Figure 5-5. Hex numbers captured by regular expression

System.IO: Classes to Read and Write Streams of Data

The System.IO namespace contains the primary classes used to move and process streams of data. The data source may be in the form of text strings, as discussed in this chapter, or raw bytes of data coming from a network or device on an I/O port. Classes derived from the Stream class work with raw bytes; those derived from the TextReader and TextWriter classes operate with characters and text strings (see Figure 5-6). We'll begin the discussion with the Stream class and look at how its derived classes are used to manipulate byte streams of data. Then, we'll examine how data in a more structured text format is handled using the TextReader and TextWriter classes.

Selected System.IO classes

Figure 5-6. Selected System.IO classes

The Stream Class

This class defines the generic members for working with raw byte streams. Its purpose is to abstract data into a stream of bytes independent of any underlying data devices. This frees the programmer to focus on the data stream rather than device characteristics. The class members support three fundamental areas of operation: reading, writing, and seeking (identifying the current byte position within a stream). Table 5-10 summarizes some of its important members. Not included are methods for asynchronous I/O, a topic covered in Chapter 13, “Asynchronous Programming and Multithreading.”

Table 5-10. Selected Stream Members

Member

Description

CanRead
CanSeek
CanWrite

Indicates whether the stream supports reading, seeking, or writing.

Length

Length of stream in bytes; returns long type.

Position

Gets or sets the position within the current stream; has long type.

Close()

Closes the current stream and releases resources associated with it.

Flush()

Flushes data in buffers to the underlying device—for example, a file.

Read(byte array, offset, count)
ReadByte()

Reads a sequence of bytes from the stream and advances the position within the stream to the number of bytes read. ReadByte reads one byte. Read returns number of bytes read; ReadByte returns –1 if at end of the stream.

SetLength()

Sets the length of the current stream. It can be used to extend or truncate a stream.

Seek()

Sets the position within the current stream.

Write(byte array, offset, count)
WriteByte()

Writes a sequence of bytes (Write) or one byte (WriteByte) to the current stream. Neither has a return value.

These methods and properties provide the bulk of the functionality for the FileStream, MemoryStream, and BufferedStream classes, which we examine next.

FileStreams

A FileStream object is created to process a stream of bytes associated with a backing store—a term used to refer to any storage medium such as disk or memory. The following code segment demonstrates how it is used for reading and writing bytes:

try
{
   // Create FileStream object
   FileStream fs = new FileStream(@"c:artistslog.txt",
         FileMode.OpenOrCreate, FileAccess.ReadWrite);
   byte[] alpha = new byte[6] {65,66,67,68,69,70}; //ABCDEF
   // Write array of bytes to a file
   // Equivalent to: fs.Write(alpha,0, alpha.Length);
   foreach (byte b in alpha) {
      fs.WriteByte(b);}
   // Read bytes from file
   fs.Position = 0;         // Move to beginning of file
   for (int i = 0; i< fs.Length; i++)
      Console.Write((char) fs.ReadByte()); //ABCDEF
   fs.Close();
catch(Exception ex)
{
   Console.Write(ex.Message);
}

As this example illustrates, a stream is essentially a byte array with an internal pointer that marks a current location in the stream. The ReadByte and WriteByte methods process stream bytes in sequence. The Position property moves the internal pointer to any position in the stream. By opening the FileStream for ReadWrite, the program can intermix reading and writing without closing the file.

Creating a FileStream

The FileStream class has several constructors. The most useful ones accept the path of the file being associated with the object and optional parameters that define file mode, access rights, and sharing rights. The possible values for these parameters are shown in Figure 5-7.

Options for FileStream constructors

Figure 5-7. Options for FileStream constructors

The FileMode enumeration designates how the operating system is to open the file and where to position the file pointer for subsequent reading or writing. Table 5-11 is worth noting because you will see the enumeration used by several classes in the System.IO namespace.

Table 5-11. FileMode Enumeration Values

Value

Description

Append

Opens an existing file or creates a new one. Writing begins at the end of the file.

Create

Creates a new file. An existing file is overwritten.

CreateNew

Creates a new file. An exception is thrown if the file already exists.

Open

Opens an existing file.

OpenOrCreate

Opens a file if it exists; otherwise, creates a new one.

Truncate

Opens an existing file, removes its contents, and positions the file pointer to the beginning of the file.

The FileAccess enumeration defines how the current FileStream may access the file; FileShare defines how file streams in other processes may access it. For example, FileShare.Read permits multiple file streams to be created that can simultaneously read the same file.

MemoryStreams

As the name suggests, this class is used to stream bytes to and from memory as a substitute for a temporary external physical store. To demonstrate, here is an example that copies a file. It reads the original file into a memory stream and then writes this to a FileStream using the WriteTo method:

FileStream fsIn = new FileStream(@"c:manet.bmp",
              FileMode.Open, FileAccess.Read);
FileStream fsOut = new FileStream(@"c:manetcopy.bmp",
              FileMode.OpenOrCreate, FileAccess.Write);
MemoryStream ms = new MemoryStream();
// Input image byte-by-byte and store in memory stream
int imgByte;
while ((imgByte = fsIn.ReadByte())!=-1){
   ms.WriteByte((byte)imgByte);
}
ms.WriteTo(fsOut);              // Copy image from memory to disk
byte[] imgArray = ms.ToArray(); // Convert to array of bytes
fsIn.Close();
fsOut.Close();
ms.Close();

BufferedStreams

One way to improve I/O performance is to limit the number of reads and writes to an external device—particularly when small amounts of data are involved. Buffers have long offered a solution for collecting small amounts of data into larger amounts that could then be sent more efficiently to a device. The BufferedStream object contains a buffer that performs this role for an underlying stream. You create the object by passing an existing stream object to its constructor. The BufferedStream then performs the I/O operations, and when the buffer is full or closed, its contents are flushed to the underlying stream. By default, the BufferedStream maintains a buffer size of 4096 bytes, but passing a size parameter to the constructor can change this.

Buffers are commonly used to improve performance when reading bytes from an I/O port or network. Here is an example that associates a BufferedStream with an underlying FileStream. The heart of the code consists of a loop in which FillBytes (simulating an I/O device) is called to return an array of bytes. These bytes are written to a buffer rather than directly to the file. When fileBuffer is closed, any remaining bytes are flushed to the FileStream fsOut1. A write operation to the physical device then occurs.

private void SaveStream() {
   Stream fsOut1 = new FileStream(@"c:captured.txt",
      FileMode.OpenOrCreate, FileAccess.Write);
   BufferedStream fileBuffer = new BufferedStream(fsOut1);
   byte[] buff;         // Array to hold bytes written to buffer
   bool readMore=true;
   while(readMore) {
      buff = FillBytes();         // Get array of bytes
      for (int j = 0;j<buff[16];j++){
        fileBuffer.WriteByte(buff[j]);   // Store bytes in buffer
      }
      if(buff[16]< 16) readMore=false;   // Indicates no more data
   }
   fileBuffer.Close();  // Flushes all remaining buffer content
   fsOut1.Close();      // Must close after bufferedstream
}
// Method to simulate I/O device receiving data
private static byte[] FillBytes() {
   Random rand = new Random();
   byte[] r = new Byte[17];
   // Store random numbers to return in array
   for (int j=0;j<16;j++) {
      r[j]= (byte) rand.Next();
      if(r[j]==171)        // Arbitrary end of stream value
      {
         r[16]=(byte)(j);  // Number of bytes in array
         return r;
      }
   }
   System.Threading.Thread.Sleep(500);  // Delay 500ms
   return r;
}

Using StreamReader and StreamWriter to Read and Write Lines of Text

Unlike the Stream derived classes, StreamWriter and StreamReader are designed to work with text rather than raw bytes. The abstract TextWriter and TextReader classes from which they derive define methods for reading and writing text as lines of characters. Keep in mind that these methods rely on a FileStream object underneath to perform the actual data transfer.

Writing to a Text File

StreamWriter writes text using its Write and WriteLine methods. Note their differences:

  • WriteLine works only with strings and automatically appends a newline (carriage returnlinefeed).

  • Write does not append a newline character and can write strings as well as the textual representation of any basic data type (int32, single, and so on) to the text stream.

The StreamWriter object is created using one of several constructors:

Syntax (partial list):

public StreamWriter(string path)
public StreamWriter(stream s)
public StreamWriter(string path, bool append)
public StreamWriter(string path, bool append, Encoding encoding)

Parameters:

path

Path and name of file to be opened.

s

Previously created Stream object—typically a FileStream.

append

Set to true to append data to file; false overwrites.

encoding

Specifies how characters are encoded as they are written to a file. The default is UTF-8 (UCS Transformation Format) that stores characters in the minimum number of bytes required.

This example creates a StreamWriter object from a FileStream and writes two lines of text to the associated file:

string filePath = @"c:cup.txt";
// Could use: StreamWriter sw = new StreamWriter(filePath);
// Use FileStream to create StreamWriter
FileStream fs = new FileStream(filePath, FileMode.OpenOrCreate,
                FileAccess.ReadWrite);
StreamWriter sw2 = new StreamWriter(fs);
// Now that it is created, write to the file
sw2.WriteLine("The world is a cup");
sw2.WriteLine("brimming
with water.");
sw2.Close();  // Free resources

Reading from a Text File

A StreamReader object is used to read text from a file. Much like StreamWriter, an instance of it can be created from an underlying Stream object, and it can include an encoding specification parameter. When it is created, it has several methods for reading and viewing character data (see Table 5-12).

Table 5-12. Selected StreamReader Methods

Member

Description

Peek()

Returns the next available character without moving the position of the reader. Returns an int value of the character or –1 if none exists.

Read()
Read(char buff, int ndx,
    int count)

Reads next character (Read()) from a stream or reads next count characters into a character array beginning at ndx.

ReadLine()

Returns a string comprising one line of text.

ReadToEnd()

Reads all characters from the current position to the end of the TextReader. Useful for downloading a small text file over a network stream.

This code creates a StreamReader object by passing an explicit FileStream object to the constructor. The FileStream is used later to reposition the reader to the beginning of the file.

String path= @"c:cup.txt";
if(File.Exists(path))
{
   FileStream fs = new FileStream(path,
         FileMode.OpenOrCreate, FileAccess.ReadWrite);
   StreamReader reader = new StreamReader(fs);
   // or StreamReader reader = new StreamReader(path);
   // (1) Read first line
   string line = reader.ReadLine();
   // (2) Read four bytes on next line
   char[] buff  = new char[4];
   int count = reader.Read(buff,0,buff.Length);
   // (3) Read to end of file
   string cup = reader.ReadToEnd();
   // (4) Reposition to beginning of file
   //     Could also use reader.BaseStream.Position = 0;
   fs.Position = 0;
   // (5) Read from first line to end of file
   line = null;
   while ((line = reader.ReadLine()) != null){
      Console.WriteLine(line);
   }
   reader.Close();
}

Core Note

Core Note

A StreamReader has an underlying FileStream even if it is not created with an explicit one. It is accessed by the BaseStream property and can be used to reposition the reader within the stream using its Seek method. This example moves the reader to the beginning of a file:

reader.BaseStream.Seek(0, SeekOrigin.Begin);

StringWriter and StringReader

These two classes do not require a lot of discussion, because they are so similar in practice to the StreamWriter and StreamReader. The main difference is that these streams are stored in memory, rather than in a file. The following example should be self-explanatory:

StringWriter writer = new StringWriter();
writer.WriteLine("Today I have returned,");
writer.WriteLine("after long months ");
writer.Write("that seemed like centuries");
writer.Write(writer.NewLine);
writer.Close();
// Read String just written from memory
string myString = writer.ToString();
StringReader reader = new StringReader(myString);
string line = null;
while ((line = reader.ReadLine()) !=null) {
   Console.WriteLine(line);
}
reader.Close();

The most interesting aspect of the StringWriter is that it is implemented underneath as a StringBuilder object. In fact, StringWriter has a GetStringBuilder method that can be used to retrieve it:

StringWriter writer = new StringWriter();
writer.WriteLine("Today I have returned,");
// Get underlying StringBuilder
StringBuilder sb = writer.GetStringBuilder();
sb.Append("after long months ");
Console.WriteLine(sb.ToString());
writer.Close();

Core Recommendation

Core Recommendation

Use the StringWriter and StringBuilder classes to work with large strings in memory. A typical approach is to use the StreamReader.ReadToEnd method to load a text file into memory where it can be written to the StringWriter and manipulated by the StringBuilder.

Encryption with the CryptoStream Class

An advantage of using streams is the ability to layer them to add functionality. We saw earlier how the BufferedStream class performs I/O on top of an underlying FileStream. Another class that can be layered on a base stream is the CryptoStream class that enables data in the underlying stream to be encrypted and decrypted. This section describes how to use this class in conjunction with the StreamWriter and StreamReader classes to read and write encrypted text in a FileStream. Figure 5-8 shows how each class is composed from the underlying class.

Layering streams for encryption/decryption

Figure 5-8. Layering streams for encryption/decryption

CryptoStream is located in the System.Security.Cryptography namespace. It is quite simple to use, requiring only a couple of lines of code to apply it to a stream. The .NET Framework provides multiple cryptography algorithms that can be used with this class. Later, you may want to investigate the merits of these algorithms, but for now, our interest is in how to use them with the CryptoStream class.

Two techniques are used to encrypt data: assymmetric (or public key) and symmetric (or private key). Public key is referred to as asymmetric because a public key is used to decrypt data, while a different private key is used to encrypt it. Symmetric uses the same private key for both purposes. In our example, we are going to use a private key algorithm. The .NET Framework Class Library contains four classes that implement symmetric algorithms:

  • DESCryptoServiceProvider—. Digital Encryption Standard (DES) algorithm

  • RC2CryptoServiceProvider—. RC2 algorithm

  • RijndaelManaged—. Rijndael algorithm

  • TrippleDESCryptoServiceProvider—. TripleDES algorithm

We use the DES algorithm in our example, but we could have chosen any of the others because implementation details are identical. First, an instance of the class is created. Then, its key and IV (Initialization Vector) properties are set to the same key value. DES requires these to be 8 bytes; other algorithms require different lengths. Of course, the key is used to encrypt and decrypt data. The IV ensures that repeated text is not encrypted identically. After the DES object is created, it is passed as an argument to the constructor of the CryptoStream class. The CryptoStream object simply treats the object encapsulating the algorithm as a black box.

The example shown here includes two methods: one to encrypt and write data to a file stream, and the other to decrypt the same data while reading it back. The encryption is performed by WriteEncrypt, which receives a FileStream object parameter encapsulating the output file and a second parameter containing the message to be encrypted; ReadEncrypt receives a FileStream representing the file to be read.

fs = new FileStream("C:\test.txt", FileMode.Create,
                    FileAccess.Write);
MyApp.WriteEncrypt(fs, "Selected site is in Italy.");
fs= new FileStream("C:\test.txt",FileMode.Open,
                   FileAccess.Read);
string msg = MyApp.ReadEncrypt(fs);
Console.WriteLine(msg);
fs.Close();

WriteEncrypt encrypts the message and writes it to the file stream using a StreamWriter object that serves as a wrapper for a CrytpoStream object. CryptoStream has a lone constructor that accepts the file stream, an object encapsulating the DES algorithm logic, and an enumeration specifying its mode.

// Encrypt FileStream
private static void WriteEncrypt(FileStream fs, string msg) {
   // (1) Create Data Encryption Standard (DES) object
   DESCryptoServiceProvider crypt = new
         DESCryptoServiceProvider();
   // (2) Create a key and Initialization Vector –
   // requires 8 bytes
   crypt.Key = new byte[] {71,72,83,84,85,96,97,78};
   crypt.IV  = new byte[] {71,72,83,84,85,96,97,78};
   // (3) Create CryptoStream stream object
   CryptoStream cs = new CryptoStream(fs,
      crypt.CreateEncryptor(),CryptoStreamMode.Write);
   // (4) Create StreamWriter using CryptoStream
   StreamWriter sw = new StreamWriter(cs);
   sw.Write(msg);
   sw.Close();
   cs.Close();
}

ReadEncrypt reverses the actions of WriteEncrypt. It decodes the data in the file stream and returns the data as a string object. To do this, it layers a CryptoStream stream on top of the FileStream to perform decryption. It then creates a StreamReader from the CryptoStream stream that actually reads the data from the stream.

// Read and decrypt a file stream.
private static string ReadEncrypt(FileStream fs) {
   // (1) Create Data Encryption Standard (DES) object
   DESCryptoServiceProvider crypt =
         new DESCryptoServiceProvider();
   // (2) Create a key and Initialization Vector
   crypt.Key = new byte[] {71,72,83,84,85,96,97,78};
   crypt.IV  = new byte[] {71,72,83,84,85,96,97,78};
   // (3) Create CryptoStream stream object
   CryptoStream cs = new CryptoStream(fs,
         crypt.CreateDecryptor(),CryptoStreamMode.Read);
   // (4) Create StreamReader using CryptoStream
   StreamReader sr = new StreamReader(cs);
   string msg = sr.ReadToEnd();
   sr.Close();
   cs.Close();
   return msg;
}

System.IO: Directories and Files

The System.IO namespace includes a set of system-related classes that are used to manage files and directories. Figure 5-9 shows a hierarchy of the most useful classes. Directory and DirectoryInfo contain members to create, delete, and query directories. The only significant difference in the two is that you use Directory with static methods, whereas a DirectoryInfo object must be created to use instance methods. In a parallel manner, File and FileInfo provide static and instance methods for working with files.

Directory and File classes in the System.IO namespace

Figure 5-9. Directory and File classes in the System.IO namespace

FileSystemInfo

The FileSystemInfo class is a base class for DirectoryInfo and FileInfo. It defines a range of members that are used primarily to provide information about a file or directory. The abstract FileSystemInfo class takes advantage of the fact that files and directories share common features. Its properties include CreationTime, LastAccessTime, LastWriteTime, Name, and FullName. It also includes two important methods: Delete to delete a file or directory and Refresh that updates the latest file and directory information.

Here is a quick look at some of the FileSystemInfo members using DirectoryInfo and FileInfo objects. Note the use of the Refresh method before checking the directory and file attributes.

// DirectoryInfo
string dir  = @"c:artists";
DirectoryInfo di = new DirectoryInfo(dir);
di.Refresh();
DateTime IODate = di.CreationTime;
Console.WriteLine("{0:d}",IODate)           // 10/9/2001
// FileInfo
string file = @"C:artistsmanet.jpg";
FileInfo fi = new FileInfo(file);
if (fi.Exists) {
   fi.Refresh();
   IODate = fi.CreationTime;
   Console.WriteLine("{0:d}",IODate);       // 5/15/2004
   Console.WriteLine(fi.Name);              // monet.txt
   Console.WriteLine(fi.Extension);         // .txt
   FileAttributes attrib = fi.Attributes;
   Console.WriteLine((int) attrib);         // 32
   Console.WriteLine(attrib);               // Archive
}

Working with Directories Using the DirectoryInfo, Directory, and Path Classes

When working with directories, you usually have a choice between using the instance methods of DirectoryInfo or the corresponding static methods of Directory. As a rule, if you are going to refer to a directory in several operations, use an instance of DirectoryInfo. Table 5-13 provides a comparison summary of the available methods.

Table 5-13. Comparison of Selected DirectoryInfo and Directory Members

DirectoryInfo

Directory

Member

Description

Member

Description

Create()
CreateSubdirectory()

Create a directory or subdirectory.

CreateDirectory()

Pass the string path to the method. Failure results in an exception.

Delete()

Delete a directory.

Delete(string)
Delete(string, bool)

First version deletes an empty directory. Second version deletes a directory and all subdirectories if boolean value is true.

GetDirectories()

Returns an array of DirectoryInfo type containing all subdirectories in the current directory.

GetDirectories(string)
GetDirectories(string,
  string filter)

Returns a string array containing the names of directories in the path. A filter may be used to specify directory names.

GetFiles()

Returns an array of FileInfo types containing all files in the directory.

GetFiles(string)
GetFiles(string, filter)

Returns string array of files in directory. A filter may be used to match against file names. The filter may contain wildcard characters ? or * to match a single character or zero or more characters.

Parent

Retrieves parent directory of current path.

GetParent()

Retrieves parent directory of specified path.

N/A

GetLogicalDrives()

Returns string containing logical drives on system. Format:

<drive>:

Let's look at some examples using both static and instance methods to manipulate and list directory members. The sample code assumes the directory structure shown in Figure 5-10.

Directory structure used in Directory examples

Figure 5-10. Directory structure used in Directory examples

Create a Subdirectory

This code adds a subdirectory named cubists below expressionists:

// Directory static method to create directory
string newPath =
      @"c:artistsexpressionistscubists";
if (!Directory.Exists(newPath))
   Directory.CreateDirectory(newPath);

// DirectoryInfo
string curPath= @"c:artistsexpressionists";
di = new DirectoryInfo(curPath);
if (di.Exists) di.CreateSubdirectory(newPath);

Delete a Subdirectory

This code deletes the cubists subdirectory just created:

string newPath = @"c:artistsexpressionistscubists";
// Directory
if (Directory.Exists(newPath)) Directory.Delete(newPath);
// The following fails because the directory still contains files
Directory.Delete(@"c:artistsexpressionists");
// The following succeeds because true is passed to the method
Directory.Delete(@"c:artistsexpressionists",true);

// DirectoryInfo
DirectoryInfo di = new DirectoryInfo(newPath);
If (di.Exists) di.Delete();

List Directories and Files

This code defines a method that recursively loops through and lists the subdirectories and selected files on the C:artists path. It uses the static Directory methods GetDirectories and GetFiles. Both of these return string values.

static readonly int Depth=4; // Specify directory level to search
ShowDir (@"c:artists", 0); // Call method to list files
// Display directories and files using recursion
public static void ShowDir(string sourceDir, int recursionLvl)
{
   if (recursionLvl<= Depth) // Limit subdirectory search depth
   {
      // Process the list of files found in the directory
      Console.WriteLine(sourceDir);
      foreach( string fileName in
               Directory.GetFiles(sourceDir,"s*.*"))
      {
         Console.WriteLine("  "+Path.GetFileName(fileName));

      // Use recursion to process subdirectories
      foreach(string subDir in
              Directory.GetDirectories(sourceDir))
         ShowDir(subDir,recursionLvl+1);  // Recursive call
   }
}

GetFiles returns a full path name. The static Path.GetFileName method is used to extract the file name and extension from the path. For demonstration purposes, a filter has been added to the GetFiles method to have it return only the path of files that begins with s.

Here is the same operation using the DirectoryInfo class. Its GetDirectories and GetFiles methods behave differently than the Directory versions: They return objects rather than strings, and they return the immediate directory or file name rather than the entire path.

// DirectoryInfo
public static void ShowDir(DirectoryInfo sourceDir,
                           int recursionLvl)
{
   if (recursionLvl<= Depth)  // Limit subdirectory search depth
   {
      // Process the list of files found in the directory
      Console.WriteLine(sourceDir.FullName);
      foreach( FileInfo fileName in
               sourceDir.GetFiles("s*.*"))
         Console.WriteLine("  "+fileName);
      // Use recursion to process subdirectories
      foreach(DirectoryInfo subDir in sourceDir.GetDirectories())
         ShowDir2(subDir,recursionLvl+1);  // Recursive call
   }
}

The method is called with two parameters: a DirectoryInfo object that encapsulates the path and an initial depth of 0.

DirectoryInfo dirInfo = new DirectoryInfo(@"c:artists");
ShowDir(dirInfo, 0);

Using the Path Class to Operate on Path Names

To eliminate the need for developers to create code that manipulates a path string, .NET provides a Path class that consists of static methods designed to operate on a path string. The methods—a shortcut for Regex patterns—extract selected parts of a path or return a boolean value indicating whether the path satisfies some criterion. Note that because the format of a path is platform dependent (a Linux path is specified differently than a Windows path), the .NET implementation of this class is tailored to the platform on which it runs.

To illustrate the static Path methods, let's look at the results of applying selected methods to this path:

string fullPath = @"c:artistsimpressionistsmonet.htm";

Method

Returns

Path.GetDirectoryName(fullPath)

c:artistsimpressionists

Path.GetExtension(fullPath)

.htm

GetFileName(fullPath)

monet.htm

GetFullPath(fullPath)

c:artistsimpressionistsmonet.htm

GetPathRoot(fullPath)

c:

Path.HasExtension(fullPath)

true

Working with Files Using the FileInfo and File Classes

The FileInfo and File classes are used for two purposes: to provide descriptive information about a file and to perform basic file operations. The classes include methods to copy, move, and delete files, as well as open files for reading and writing. This short segment uses a FileInfo object to display a file's properties, and the static File.Copy method to copy a file:

string fname= @"c:artistsimpressionistsdegas.txt";
// Using the FileInfo class to print file information
FileInfo fi = new FileInfo(fname);  // Create FileInfo object
if (fi.Exists)
{
   Console.Write("Length: {0}
Name: {1}
Directory: {2}",
   fi.Length, fi.Name, fi.DirectoryName);
   // output: --> 488  degas.txt  c:artistsimpressionists
}
// Use File class to copy a file to another directory
if (File.Exists(fname))
{
   try
   {
      // Exception is thrown if file exists in target directory
      // (source, destination, overwrite=false)
      File.Copy(fname,@"c:artists19thcenturydegas.txt",false);
   }
   catch(Exception ex)
   {
      Console.Write(ex.Message);
   }
}

Using FileInfo and File to Open Files

The File and FileInfo classes offer an alternative to creating FileStream, StreamWriter, and StreamReader objects directly. Table 5-14 summarizes the FileInfo methods used to open a file. The static File methods are identical except that their first parameter is always a string containing the name or path of the file to open.

Table 5-14. Selected FileInfo Methods for Opening a File

Member

Returns

Description

Open(mode)
Open(mode,access)
Open(mode,access,share)

FileStream

Opens a file with access and sharing privileges. The three overloads take FileMode, FileAccess, and FileShare enumerations.

Create()

FileStream

Creates a file and returns a FileStream object. If file exists, returns reference to it.

OpenRead()

FileStream

Opens the file in read mode.

OpenWrite()

FileStream

Opens a file in write mode.

AppendText()

StreamWriter

Opens the file in append mode. If file does not exist, it is created. Equivalent to StreamWriter(string, true).

CreateText()

StreamWriter

Opens a file for writing. If the file exists, its contents are overwritten. Equivalent to StreamWriter(string, false).

OpenText()

StreamReader

Opens the file in read mode. Equivalent to StreamReader(string).

The FileInfo.Open method is the generic and most flexible way to open a file:

public FileStream Open(FileMode mode, FileAccess access,
                       FileShare share)

Create, OpenRead, and OpenWrite are specific cases of Open that offer an easy-to-use method that returns a FileStream object and requires no parameters. Similarly, the OpenText, AppendText, and CreateText methods return a StreamReader or StreamWriter object.

The decision to create a FileStream (or StreamReader/StreamWriter) using FileInfo or the FileStream constructor should be based on how the underlying file is used in the application. If the file is being opened for a single purpose, such as for input by a StreamReader, creating a FileStream directly is the best approach. If multiple operations are required on the file, a FileInfo object is better. This example illustrates the advantages of using FileInfo. First, it creates a FileStream that is used for writing to a file; then, another FileStream is created to read the file's contents; finally, FileInfo.Delete is used to delete the file.

FileInfo     fi = new FileInfo(@"c:	emp.txt");
FileStream   fs = fi.Create();          // Create file
StreamWriter sw= new StreamWriter(fs);  // Create StreamWriter
sw.Write("With my crossbow
I shot the Albatross. ");
sw.Close();                   // Close StreamWriter
// Now use fi to create a StreamReader
fs = fi.OpenRead();           // Open for reading
StreamReader sr = new StreamReader(fs);
while(( string l = sr.ReadLine())!= null)
{
   Console.WriteLine(l);      // --> With my crossbow
}                             // --> I shot the Albatross.
sr.Close();
fs.Close();
fi.Delete();                  // Delete temporary file

Summary

The demands of working with text have increased considerably from the days when it meant dealing with 7-bit ASCII or ANSII characters. Today, the Unicode standard defines the representation of more than 90,000 characters comprising the world's alphabets. We've seen that .NET fully embraces this standard with its 16-bit characters. In addition, it supports the concept of localization, which ensures that a machine's local culture information is taken into account when manipulating and representing data strings.

String handling is facilitated by a rich set of methods available through the String and StringBuilder classes. A variety of string comparison methods are available with options to include case and culture in the comparisons. The String.Format method is of particular note with its capacity to display dates and numbers in a wide range of standard and custom formats. String manipulation and concatenation can result in an inefficient use of memory. We saw how the StringBuilder class is an efficient alternative for basic string operations. Applications that require sophisticated pattern matching, parsing, and string extraction can use regular expressions in conjunction with the Regex class.

The System.IO namespace provides a number of classes for reading and writing data: The FileStream class is used to process raw bytes of data; MemoryStream and BufferedStream allow bytes to be written to memory or buffered; the StreamReader and StreamWriter classes support the more traditional line-oriented I/O. Operations related to managing files and directories are available as methods on the File, FileInfo, Directory, and DirectoryInfo classes. These are used to create, copy, delete, and list files and directories.

Test Your Understanding

1:

Which class encapsulates information about a specific culture?

2:

Name two ways to access the default culture within an application.

3:

Match each regular expression:

  1. @"([^Wa-z0-9_][^WA-Z0-9_]*)"
    
  2. @"([^Wa-z0-9_]+)"
    
  3. @"([^WA-Z0-9_]+)"
    

with the function it performs:

  1. Find all capitalized words.

  2. Find all lowercase words.

  3. Find all words with the initial letter capitalized.

4:

When is it more advantageous to use the instance methods of Regex rather than the static ones?

5:

Which string comparison method(s) is (are) used to implement this statement:

if (myString == "Monday") bool sw = true;

6:

Match each statement:

  1. curdt.ToString("D")
    
  2. curdt.ToString("M")
    
  3. curdt.ToString("ddd MMM dd")
    

with its output:

  1. March 9

  2. Tuesday, March 9, 2004

  3. Tue Mar 09

7:

Which of these objects is not created from an existing FileStream?

  1. FileInfo
    
  2. StreamReader
    
  3. BufferedStream
    

8:

You can create a FileStream object with this statement:

FileStream fs = new FileStream(fname,
                   FileMode.OpenOrCreate,
                   FileAccess.Write,FileShare.None);

Which one of the following statements creates an identical FileStream using an existing FileInfo object, fi?

  1. fs = fi.OpenWrite();
    
  2. fs = fi.Create();
    
  3. fs = fi.CreateText();
    

9:

Indicate whether the following comparisons are true or false:

  1. (string.Compare("Alpha","alpha") >0)
    
  2. (string.Compare("Alpha","alpha",true) ==0)
    
  3. (string.CompareOrdinal("Alpha","alpha")>0)
    
  4. (string.Equals("alpha","Alpha"))
    


[1] Unicode Consortium—www.unicode.org.

[2] Microsoft Windows users can set formats using the Control Panel – Regional Options settings.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset