Chapter 7. Strings and Regular Expressions

Java and C# both recognize that string handling and manipulation are key functions required by almost every application, and they provide language-level support to make working with strings simple and a rich API to support advanced string manipulation. The basic capabilities of strings and the C# syntax used to manipulate them are covered in Chapter 4. This chapter focuses on the Microsoft .NET System.String class and the functionality available in the Microsoft .NET class library for working with strings.

Java developers will find support in .NET for all the string-handling mechanisms they are accustomed to in Java. This includes the more advanced features such as byte encoding, text formatting, and regular expressions. While there are many implementation differences, a Java developer will have little trouble adapting to the .NET methodology.

Strings

The .NET System.String class is equivalent to the java.lang.String class. Both classes offer predominantly the same functionality, although the implementation specifics present significant differences in how common tasks are accomplished. This section discusses and contrasts the functionality of both String classes in the context of describing everyday string manipulations.

As pointed out in Chapter 4, the C# keyword string is an alias for the .NET System.String class; we’ll use both conventions in the examples throughout this chapter.

Creating Strings

Both Java and C# support the use of assignment operators to create new instances of string objects from literals and other string instances, as illustrated in this fragment of C# code:

String someString = "This is a string.";
String someOtherString = someString;

Notice that the .NET String class doesn’t provide constructor equivalents of these assignment statements.

The only other difference of note is the support .NET provides for creating a string given a pointer to a char or byte array. Because they use pointers, these constructors are unsafe. See the Unsafe Code section in Chapter 6, for details on unsafe code.

Table 7-1 summarizes the Java and .NET string constructors.

Table 7-1. Creating Strings in Java and .NET

Java

.NET

Comments

new String()

String.Copy("")

The .NET String class doesn’t provide empty constructor support. The static String.Copy method can be used, although it’s easier to use direct assignment of an empty string.

new String(String)

String.Copy(String)

The .NET String class doesn’t provide a constructor that takes a String as an argument. The static String.Copy method can be used, although it’s easier to use direct assignment.

new String(byte[])

new String(byte[], int, int)

new String(sbyte*)

new String(sbyte*, int, int)

Takes a pointer to a byte array. These constructors are unsafe because they use pointers.

new String(byte[], charset)

new String(byte[], int, int, charset)

new String(sbyte*, int, int, Encoding)

Reads from byte array using the specified encoding scheme. This constructor is unsafe because it uses pointers.

new String(char[])

new String(char[], int, int)

new String(char[])

new String(char[], int, int)

new String(char*)

String(char*, int, int)

Supports both a char array and a pointer to a char array. The two constructors that take pointers to char arrays are unsafe.

String(StringBuffer)

StringBuilder.ToString()

 

N/A

String(char, int)

Creates a string that contains a char repeated int times.

Comparing Strings

Java does not provide support for comparing strings using operators. The == or != operator in Java will compare object references and not the contents of the string. In C# support for operator overloading, the == and != operators have been implemented to support the comparison of string content. Both operators are backed by the String.Equals method. The following C# statements demonstrate the use of the overloaded == operator to compare a string variable with a string literal:

if (someString == "This is a string.") {
    // some statements
}

Java and .NET both provide methods for comparing whole or partial strings. These are summarized in Table 7-2.

Table 7-2. String Comparison in Java and .NET

Java

.NET

Comments

N/A

==

String equality operator.

N/A

!=

String inequality operator.

equals()

Equals()

.NET also provides an overloaded static Equals method for comparing two strings.

equalsIgnoreCase()

Compare()

The .NET String class provides static, overloaded Compare methods for the case-insensitive comparison of strings and substrings.

N/A

CompareOrdinal()

The .NET String class provides static, overloaded CompareOrdinal methods for the case-insensitive comparison of strings and substrings, ignoring localization.

compareTo()

CompareTo()

 

compareToIgnoreCase()

Compare()

 

contentEquals()

N/A

 

endsWith()

EndsWith()

 

regionMatches()

Compare()

 

startsWith()

StartsWith()

 

Copying Strings

Java and C# both provide operator support for copying strings. The = operator is used to assign a string literal or variable to another string variable; .NET also provides the static String.Copy method, which can be used to copy strings. For example, the following two C# statements have the same effect:

string SomeString = "Some string value.";
string SomeOtherString = string.Copy("Some string value.");

String Length

The length of a Java String is obtained by calling the String.length method; .NET provides the read-only property String.Length.

String Concatenation

Like Java, C# provides both operator and method support for concatenating strings. The support in C# is the same as in Java and allows the use of both the + and += operators. Method-based concatenation support in .NET is more extensive than that provided in Java. Java provides the String.concat method, which is equivalent to the operator syntax described earlier. .NET provides a set of static, overloaded String.Concat methods that concatenate a variable number of strings and string representations of objects into a single string.

The static String.Join methods provide a mechanism to concatenate an array of strings into a single string separating each component with a configurable string token. This is useful for creating strings that represent hierarchical information such as IP addresses, URLs, or file names. For example:

string[] names = {"www", "microsoft", "com"};
string nameString = System.String.Join(".", names);

The value of nameString will be www.microsoft.com.

Changing Case

Both Java and .NET provide capabilities to convert the case of a string. The equivalent methods are shown in Table 7-3.

Table 7-3. String Case Conversion in Java and .NET

java.lang.String

System.String

toUpperCase()

ToUpper()

toLowerCase()

ToLower()

Working with Characters

Both Java and .NET provide mechanisms to retrieve, remove, and replace the individual characters contained within a string. .NET provides a superset of the Java functionality and provides two particularly useful features:

  • Implementation of an indexer to provide read access to the characters that constitute the string.

  • The String.GetEnumerator method returns an IEnumerator that can be used in a foreach statement to iterate through the characters in a string. This is similar to the java.text.StringCharacterIterator class.

Because strings are immutable, these methods do not modify the string instance, instead returning a new string containing any modifications. If performing many edit operations, it’s more efficient to use the StringBuilder class, discussed later in this chapter. When performing complex or repeated matching operations, regular expressions (discussed later in this chapter) provide a more flexible solution. The methods for character access are summarized in Table 7-4.

Table 7-4. String Character Manipulation in Java and .NET

Java

.NET

Description

charAt()

<string>[key]

Indexer that returns the character at a specific index within the string.

N/A

GetEnumerator()

Returns an IEnumerator that supports iteration across the characters of a string.

getChars()

CopyTo()

Extracts a specified number of characters from a string and places them in a character array.

toCharArray()

ToCharArray()

Copies the characters making up the string to a character array.

indexOf()

IndexOf()

Returns the index of the first occurrence of a specified character.

N/A

IndexOfAny()

Returns the index of the first occurrence of any of the characters contained in a character array.

lastIndexOf()

LastIndexOf()

Returns the index of the last occurrence of a specified character.

N/A

LastIndexOfAny()

Returns the index of the last occurrence of any of the characters contained in a character array.

replace()

Replace()

Replaces all instances of a specified character with another character.

N/A

Remove()

Removes a specified range of characters from a string.

trim()

Trim()

Removes white space or specified characters from the beginning and end of a string.

N/A

TrimEnd()

Removes white space or specified characters from the end of a string.

N/A

TrimStart()

Removes white space or specified characters from the beginning of a string.

N/A

PadLeft()

Right-justifies a string and pads the left side with white space or a specified character to make it the required length.

N/A

PadRight()

Left-justifies a string and pads the right side with white space or a specified character to make it the required length.

Working with Substrings

Java and .NET both provide mechanisms for locating and modifying substrings within an existing string. As with the character access just discussed, the use of StringBuilder or regular expressions can provide a more efficient alternative. The substring manipulation methods are summarized in Table 7-5.

Table 7-5. Substring Manipulation in Java and .NET

Java

.NET

Description

replaceAll()

replaceFirst()

Replace()

Replaces all instances of a specified substring with another string. Java methods are actually exposing regular expression functionality. See later in this chapter for details.

N/A

Insert()

Inserts a substring into the string at the specified index.

indexOf()

IndexOf()

Returns the index of the first occurrence of a specified substring within the string.

lastIndexOf()

LastIndexOf()

Returns the index of the last occurrence of a specified substring within the string.

substring()

Substring()

Returns a substring from between the specified index range in the string.

Splitting Strings

Both the Java and .NET String classes provide methods to split a string based on the occurrence of tokens within a string. The Java String.split method exposes regular expression functionality, discussed later in this chapter. Java also provides the java.util.StringTokenizer as a flexible mechanism to split strings without using regular expressions; .NET has no direct equivalent.

The .NET String.Split method splits a string into component parts based on the location of a definable set of characters, returning an array of components from the source string.

Strings as Keys

It’s common to use strings as keys in collections such as dictionaries. The GetHashCode and Equals methods inherited from System.Object are the basis for key comparison. The String class has overridden these two methods to provide appropriate behavior for key comparison. The GetHashCode method returns a hash code based on the entire contents of the string, not the object reference. The Equals method is discussed earlier in this chapter.

Parsing Strings

Parsing is the process used to create date and numeric data types from strings, for example, creating an int from the string 600. The .NET Framework provides a parsing mechanism that is similar to the Java static parseXXX methods in the primitive wrapper classes. Because all .NET value types are objects, there is no need for wrapper classes. In .NET, all of the simple data types (Byte, Int32, Double, and so forth) have static methods named Parse that take a String and return an instance of the appropriate value type. An exception will be thrown if

  • The string to be parsed is null.

  • The string contents cannot be parsed into the target type.

  • The parsed string would exceed the minimum or maximum value of the data type.

System.DateTime, System.TimeSpace, and System.Enumeration classes also implement Parse methods for parsing strings.

.NET also defines the System.Convert class, which provides an impressive selection of static methods for converting between data types; however, all the methods that convert strings simply call the Parse method of the target type.

Formatting Strings

.NET provides functionality for creating strings that contain formatted representations of other data types. This includes the use of formatting conventions for different regions, cultures, and languages. Java provides similar capabilities through the java.text.MessageFormat class, among others; however, the implementations are sufficiently different that it is easier, and more instructive, just to discuss the .NET implementation on its own.

Formatting functionality is exposed through overloaded methods in a number of classes in which string output is generated. The most commonly used methods are

  • System.String.Format

  • System.Text.StringBuilder.AppendFormat

  • System.Console.WriteLine and System.Console.Write

  • System.IO.TextWriter.WriteLine and System.IO.TextWriter.Write

More Info

Details of the System.Console and System.IO.TextWriter classes are provided in Chapter 10.

Most of these classes provide overloaded methods that take different numbers of parameters. For our discussion, we’ll consider the most general case of the System.Console.WriteLine method, which takes a variable number of arguments through the use of a params parameter. The signature is as follows:

public static void WriteLine(string format, params object[] args);

The format parameter is a string that contains both the standard text to output and embedded format specifiers at locations where formatted data will be inserted.

The args array contains the objects to be formatted and inserted into the resulting string. The method will accept a variable number of arguments through the use of the params keyword. The order of these arguments determines the index they occupy in the object array. The array index of each object is used in the format string to identify where each formatted object should be inserted. The best way to convey how this works is by example. This code

double a = 345678.5678;
uint b = 12000;
byte c = 254;
Console.WriteLine("a = {0}, b = {1}, and c = {2}", a, b, c);
Console.WriteLine("a = {0:c0}, b = {1:n4}, and c = {2,10:x5}",
    a, b, c);

will result in the following output:

a = 345678.5678, b = 12000, and c = 254
a = £345,679, b = 12,000.0000, and c =      000fe

The codes in the braces, highlighted in the example, are called format specifications. Changing the contents of the format specification changes the format of the output. The current cultural and regional settings can affect the formatted output. As in the example, the currency symbol, thousands separator, and decimal separator are all affected by localization settings.

Format specifications have the general form

{N,[M][:F]}

where:

  • N is the zero-based index, which indicates the formatted argument to insert.

  • M is an optional integer that specifies the minimum width of the inserted element. If M is positive, the formatted value is right-justified, and left-justified if M is negative. The value will be padded with white space to ensure that it fills the minimum specified width.

  • F is an optional set of formatting codes called a format string. Format strings determine how the data item is to be formatted when rendered to a string value. Different data types support different format strings.

Any object can be passed as a parameter for formatting, although only types that implement the IFormattable interface support format strings. Any type that doesn’t implement IFormattable will always be rendered to a string using the inherited Object.ToString method, irrespective of any format strings provided. Attempts to use invalid format strings on IFormattable objects will cause a FormatException exception to be thrown.

All of the simple types, as well as System.DateTime, System.TimeSpan, and System.Enum, implement IFormattable. We’ll discuss some of the more common format strings in Chapter 8, but the .NET documentation provides complete coverage.

The IFormattable Interface

Any class or struct that needs to support formatting must implement the IFormattable interface. This interface contains a single member with the following signature:

string ToString(string format, IFormatProvider formatProvider);

The format argument is a string containing the format string. This is the portion of the format specifier that follows the colon and contains instructions on how the object should create a string representation of itself; this will be null if no format string was specified.

The formatProvider argument provides a reference to an IFormatProvider instance. An IFormatProvider contains information about the current system settings, culture, region, and preferences. The IFormattable object being formatted can refer to the IFormatProvider to decide how best to render itself to a string given the current environment. The decision can take into consideration such elements as the appropriate currency symbol to use and how many decimal places are required by a given locale.

By default, formatProvider will be null, which means that the default settings of the current system are to be used. However, it is possible to specify a different value in some of the overloaded string formatting methods.

The following example demonstrates a class named MyNumber that implements the IFormattable interface. The class holds an integer. If used in a formatted string, MyNumber can render itself either to digits or to words depending on the format specifier used. Either w or W signals that words should be output.

using System;
using System.Text;

public class MyNumber : IFormattable {
    private int val;

    public MyNumber(int v) { val = v; }

    public string ToString(string format, IFormatProvider provider) {
        if (format != null && format.ToLower() == "w") {
            switch (val) {
                case 1 : return "one";
                case 2 : return "two";
                case 3 : return "three";
                default: return "unknown";
            }
        } else {
            return val.ToString();
        }
    }

    public static void Main() {

        MyNumber numberOne = new MyNumber(3);
        MyNumber numberTwo = new MyNumber(1);

        Console.WriteLine("The first number is {0} and the second is {1}",
            numberOne,  numberTwo);
        Console.WriteLine("The first number is {0:w} and the second is " +
            "{1:w}", numberOne,  numberTwo);
    }
}

When this example is executed, the following output is produced:

The first number is 3 and the second is 1
The first number is three and the second is one

Encoding Strings

Java and .NET both support Unicode and encode characters internally using the UTF-16 scheme. However, Unicode is not the only character-encoding scheme, nor is UTF-16 the only scheme for encoding Unicode. Support for different character encodings is essential for providing cross-platform support and integration with legacy systems.

Java has provided support for character encoding since version 1.1; however, these capabilities were accessible only through other classes such as java.lang.String and java.io.InputStreamReader. The model introduced in Java version 1.4 offers greater flexibility, providing direct access to the underlying encoding mechanisms. The character encoding functionality in Java version 1.4 and .NET is predominantly the same. The most significant difference is how encoding classes are instantiated.

The java.nio.charset.Charset class represents a mapping between 16-bit Unicode characters and encoded bytes. The static factory method Charset.forName(String charsetName) returns a Charset instance for a specific encoding scheme based on a name. The following example demonstrates instantiation of a Charset for performing UTF-8 encoding:

Charset x = Charset.forName("UTF-8");

The .NET equivalent of the Charset class is the abstract class System.Text.Encoding. The Encoding class provides base functionality and factory methods for a number of encoding-specific classes contained in the System.Text namespace. Concrete implementations are provided for UTF-7, UTF-8, Big-Endian and Little-Endian UTF-16, and ASCII. The .NET Framework relies on the underlying operating system to provide support for other encoding schemes, adopting the Java model of returning an Encoding instance based on a scheme name to expose this functionality.

Table 7-6 maps commonly used Java scheme names against their .NET equivalents. The table also shows how to instantiate an Encoding object using the static methods and properties of Encoding. The concrete Encoding subclasses also offer instantiation via constructors.

Table 7-6. Java and .NET Character Encoders

Java Encoding Name

.NET Class

Obtain By

US-ASCII

ASCIIEncoding

Encoding.ASCII

Cp1252

Encoding

Encoding.GetEncoding(1252)

UTF-8

UTF8Encoding

Encoding.UTF8

UTF-16BE

UnicodeEncoding

Encoding.BigEndianUnicode

UTF-16LE

UnicodeEncoding

Encoding.Unicode

ASCIIEncoding, UTF8Encoding, and UnicodeEncoding derive from Encoding but do not introduce any new functionality. Table 7-7 provides a summary of the Encoding functionality.

Table 7-7. Member Summary for System.Text.Encoding

Encoder

Description

Properties

 

ASCII

Static read-only property used to get an ASCIIEncoding instance.

BigEndianUnicode

Static read-only property used to get a UnicodeEncoding instance configured for big-endian byte ordering.

BodyName

Gets the name for the encoding that can be used in mail agent body tags.

CodePage

Gets the code page identifier for the encoding.

Default

Static read-only property used to get an encoding for the system’s current ANSI code page.

EncodingName

Gets a human-readable name for the encoding.

HeaderName

Gets the name for the encoding for use in mail agent header tags.

IsBrowserDisplay

Gets an indication of whether the encoding can be used for display in browser clients.

IsBrowserSave

Gets an indication of whether the encoding can be used for saving by browser clients.

IsMailNewsDisplay

Gets an indication of whether the encoding can be used for display in mail and news clients.

IsMailNewsSave

Gets an indication of whether the encoding can be used for saving by mail and news clients.

Unicode

Static read-only property used to get a UnicodeEncoding instance configured for little-endian byte ordering.

UTF7

Static read-only property used to get a UTF7Encoding instance.

UTF8

Static read-only property used to get a UTF8Encoding instance.

WebName

Gets the IANA registered name for the encoding.

WindowsCodePage

Gets the Windows code page that most closely corresponds to the encoding.

Methods

 

Convert()

A static method that converts a byte array between two encoding schemes.

GetByteCount()

Calculates the exact number of bytes required to encode a specified char[] or string.

GetBytes()

Encodes a string or char[] and returns a byte[].

GetCharCount()

Calculates the exact number of characters produced from decoding a specified byte[].

GetChars()

Decodes a byte[] into a char[].

GetDecoder()

Returns a System.Text.Decoder instance for the encoding. See Table 7-10 for details of the methods provided by the Decoder class.

GetEncoder()

Returns a System.Text.Encoder instance for the encoding. See Table 7-9 for details of the methods provided by the Encoder class.

GetEncoding()

Static factory method that returns an encoding instance for a named encoding scheme.

GetMaxByteCount()

Calculates the maximum number of bytes that will be required to encode a specified number of characters.

GetMaxCharCount()

Calculates the maximum number of characters that will be produced by decoding a specified number of bytes.

Both the Charset and Encoding classes provide convenience methods for performing single-call transformations and provide factory methods to create stateful encoding and decoding objects for processing longer char and byte arrays. These methods are contrasted in Table 7-8.

Table 7-8. Charset and Encoding Mechanisms for Encoding/Decoding

Java Charset

.NET Encoding

Description

encode()

GetBytes()

Encodes a string or a sequence of characters into a byte array.

decode()

GetChars()

Decodes a byte array into a sequence of characters.

newEncoder()

GetEncoder()

Java returns a java.nio.charset.CharsetEncoder instance; .NET returns System.Text.Encoder.

newDecoder()

GetDecoder()

Java returns a java.nio.charset.CharsetDecoder instance; .NET returns System.Text.Decoder.

The .NET Encoder and Decoder classes are simpler than their Java counterparts and are summarized in Table 7-9 and Table 7-10.

Table 7-9. Members of System.Text.Encoder

Member

Description

GetByteCount()

Calculates the number of bytes GetBytes would return if passed a specified char array.

GetBytes()

Takes an array of Unicode characters as an argument and encodes the characters as bytes. Any partial-character sequences at the end of the char array are stored and prepended to the char array provided in the next call to the GetBytes method.

Table 7-10. Members of System.Text.Decoder

Member

Description

GetCharCount()

Calculates the number of characters GetChars would return if passed a specified byte array.

GetChars()

Takes an array of bytes as an argument and decodes the bytes as Unicode characters. Any partial-byte sequences at the end of the byte array are stored and prepended to the byte array provided in the next call to the GetChars method.

The following code demonstrates the use of the encoding and decoding functionality we’ve discussed in this section:

 // Instantiating an Encoding that supports UTF-8
Encoding myEncoding = Encoding.UTF8;

//Instantiating a UTF-8 Encoder
Encoder myEncoder = myEncoding.GetEncoder();

//Determine the byte array size required to encode a String
char[] myChars = "Hello world !".ToCharArray();
int size = myEncoder.GetByteCount(myChars, 0, 13, true);

//Convert the character array to a UTF-8 byte array
byte[] myBytes = new byte[size];
myEncoder.GetBytes(myChars, 0, 13, myBytes, 0, true);

// Encode the byte stream as UTF-7 using Encoding.Convert()
myBytes = Encoding.Convert(Encoding.UTF8, Encoding.UTF7, myBytes);

// Convert the bytes back to Unicode from UTF-7
String myString = new String(Encoding.UTF7.GetChars(myBytes));

//Display result string
System.Console.WriteLine(myString);

Like Java, .NET also provides access to encoding mechanisms through other relevant classes. The String class provides a constructor that takes an Encoding to create a Unicode string from an encoded byte array. Also, the StreamReader and StreamWriter classes in the System.IO namespace provide constructors that allow an Encoding to be specified, forcing all strings written and read through the stream to be encoded and decoded using the specified scheme.

More Info

The StreamReader and StreamWriter classes, as well as examples of using string encoding in the context of stream and file operations, can be found in Chapter 10.

Dynamically Building Strings

The immutability of strings means that manipulation and concatenation result in new string instances being created instead of the contents of existing strings being modified. The overhead of creating many new string instances can dramatically affect the performance of an application. Java and .NET both provide similar solutions to this problem. Java provides the java.lang.StringBuffer class, and .NET provides the System.Text.StringBuilder class. Both classes represent a sequence of Unicode characters that can be modified in situ, extended if necessary, and rendered to a string as required.

The .NET StringBuilder class provides better functionality for formatting and manipulating contents of the character data but lacks some of the StringBuffer capability of inspecting the contents of, and extracting substrings from, the underlying character sequence. Table 7-11 compares the functionality of the StringBuffer and StringBuilder classes.

Table 7-11. Comparison of java.lang.StringBuffer and System.Text.StringBuilder

Java

.NET

Comments

append()

Append()

Numerous overloaded methods that append the string representation of different values to the end of a StringBuilder.

N/A

AppendFormat()

.NET includes numerous overloaded methods that append formatted strings to the end of a StringBuilder.

capacity()

Capacity

In a difference from Java, Capacity is a property and supports both the getting and setting of the StringBuilder capacity. However, an ArgumentOutOfRangeException will be thrown if an attempt is made to set the capacity lower than the length of the current contents.

N/A

MaxCapacity

A read-only property used to get the maximum capacity for a StringBuilder class. This is platform dependent.

charAt()

setCharAt()

<StringBuilder>[key]

An indexer that provides a zero-based integer index into the characters of a StringBuilder

delete()

deleteCharAt()

Remove()

Deletes a range of characters from a StringBuilder.

ensureCapacity()

EnsureCapacity()

Ensures that the capacity of a StringBuilder is at least the specified value.

getChars()

N/A

 

indexOf()

N/A

 

insert()

Insert()

Numerous overloaded methods insert the string representation of different values at a specified location within a StringBuilder.

lastIndexOf()

N/A

Unsupported.

length()

setLength()

Length

A property used to get and set the current length of the StringBuilder contents. As with Java, if the new length is shorter than the current contents, the contents will be truncated. If the new length is greater, the end of the string will be padded with spaces.

replace()

Replace()

The Java StringBuffer method replaces characters at the specified location range. Overloaded versions of the StringBuilder method replace matching characters or substrings.

reverse()

N/A

 

subSequence()

N/A

 

substring()

N/A

 

toString()

ToString()

Returns the current contents of a StringBuilder as a string.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset