Chapter 7. Strings

Strings are so fundamental a data type that their usage is often covered elsewhere, as it relates to other types. Nevertheless, some topics are specific to strings, and those are covered in this chapter.

Convert a String to Bytes (and Vice Versa)

Solution: The System.Text namespace defines a number of classes that help you work with textual data. To convert to and from bytes, use the Encoding class. Although it is possible to create your own text-encoding scheme, usually you will use one of the pre-created static members of the Encoding class, which represent the most common encoding standards, such as ASCII, Unicode, UTF-8, and so on.

The GetBytes() method will get the appropriate byte representation:

string myString = "C# Rocks!";
byte[] bytes = Encoding.ASCII.GetBytes(myString);

If you were to print out the string and the bytes, it would look like this:

Original string: C# Rocks!
ASCII bytes: 43-23-20-52-6F-63-6B-73-21

Using Encoding.Unicode, it would look like this:

Unicode bytes: 43-00-23-00-20-00-52-00-6F-00-63-00-6B-00-73-00-21-00

To convert back, use Encoding.GetString() and pass in the bytes:

string result = Encoding.ASCII.GetString(bytes);

This is all simple enough, but consider the string “C# Rocks!image”. The music note has no ASCII equivalent. This character will be converted to a question mark (?) if you convert to a byte format with no valid encoding, as in this example:

image

This has the following output:

Round trip: C# Rocks!image->43-23-20-52-6F-63-6B-73-21-3F->C# Rocks!?

However, if we use Unicode encoding, it will all work out both ways:

image

Here’s the output:

image

Note

In memory, all strings are Unicode in .NET. It’s only when you’re writing strings to files, networks, databases, interop buffers, and other external locations that you need to worry about byte representation.

Create a Custom Encoding Scheme

Solution: Derive a class from System.Text.Encoding and implement the abstract methods. The Encoding class defines many overloads for both encoding and decoding strings, but they all eventually call a few fundamental methods and these are the only methods you need to define.

Much of the code in this example is argument validation, as stipulated by the MSDN documentation for this class.

The encoding algorithm is the famous substitution cipher, ROT13, where each letter is rotated 13 places to the right.

image

image

image

image

image

A simple test demonstrates this (non)encryption encoder:

image

Here’s the output:

image

Note

The need for custom string encoding can show up in unusual places. I once worked on a system that needed to communicate with buoys via satellites. Because of space, the limited alphabet had to be encoded in 6 bits per character. It was easy to wrap this functionality into a custom Encoder-derived class for use in the application.

Compare Strings Correctly

Solution: Always take culture into account and use the functionality provided in the .NET Framework. There are many ways to compare strings in C#. Some are better suited for localized text because strings can seem equivalent while having different byte values. This is especially true in non-Latin alphabets and when capitalization rules are not what you assume.

Here is a demonstration of the issue:

image

What is the output of this program?

image

As you can see, the culture can drastically change how strings are interpreted.

In general, you should use String.Compare() and supply the culture you want to use. For nonlocalized strings, such as internal program strings or settings names, using the invariant culture is acceptable.

Change Case Correctly

Solution: This is nearly the same issue as comparing two strings. You must specify the culture to do it correctly. Again, use the built-in functionality and remember the culture.

Using our Turkish example again, here is the sample code:

image

What is the difference in output?

Original: file
Uppercase (invariant): FILE
Uppercase (Turkish): FILE

Seemingly nothing, but let’s look at the bytes of the uppercase strings:

Bytes (invariant): 46-00-49-00-4C-00-45-00
Bytes   (Turkish): 46-00-30-01-4C-00-45-00

You can now see the difference, even though the visual representation is the same. The lesson here is that just because a character looks the same, it doesn’t mean it is the same.

Detect Empty Strings

Solution: When handling string input, you generally need to be aware of four possible states:

1. String is null.

2. String is empty.

3. String contains nothing but whitespace.

4. String contains content.

The first two conditions are handled by the static method String.IsNullOrEmpty(), which has existed in .Net for a while.

.Net 4 adds String.IsNullOrWhitespace(), which also handles condition 3.

bool containsContent = !String.IsNullOrWhitespace(myString);

Concatenate Strings: Should You Use StringBuilder?

Solution: Not necessarily. Use of StringBuilder has become somewhat of a dogmatic issue in the C# community, and some explanation is useful.

In C#, strings are immutable objects, meaning they cannot be changed once created. This has ramifications for string manipulation, but they are probably not as drastic as you might be led to believe. There are essentially two options for string concatenation: using plain String objects and using StringBuilder.

Here’s an example using plain string objects:

image

Here’s an example using StringBuilder:

image

Conventional wisdom tells you to use StringBuilder for concatenating strings, but that answer is too simplistic. In fact, the official guidelines gives as good guidance as we’re likely to find. The following is from the MSDN documentation for System.StringBuilder (http://msdn.microsoft.com/en-us/library/system.text.stringbuilder.aspx):

The performance of a concatenation operation for a String or StringBuilder object depends on how often a memory allocation occurs. A String concatenation operation always allocates memory, whereas a StringBuilder concatenation operation only allocates memory if the StringBuilder object buffer is too small to accommodate the new data. Consequently, the String class is preferable for a concatenation operation if a fixed number of String objects are concatenated. In that case, the individual concatenation operations might even be combined into a single operation by the compiler. A StringBuilder object is preferable for a concatenation operation if an arbitrary number of strings are concatenated; for example, if a loop concatenates a random number of strings of user input.

Running my own tests basically confirms this strategy. StringBuilder is only faster once we’re dealing with a lot of strings, as Figure 7.1 shows. Despite string append being much slower, we’re still talking about milliseconds either way.

Figure 7.1 Using StringBuilder doesn’t matter (for performance) when there are relatively few strings.

image

However, a caveat: All of this greatly depends on the size and number of your strings. Whenever performance is involved, there is only one rule: measure it.

You also need to keep in mind the number of objects created—when standard string concatenation is used, a new string is created at each concatenation. This can lead to an explosion of objects for the garbage collector to deal with.

You can perform your own timings with the StringBuilderTime project located in the code samples for this chapter.

In the end, you must measure and profile your own code to determine what’s best in your scenario.

Concatenate Collection Items into a String

Solution: There are a few options for this, depending on what you need. For simple concatenation, use the new String.Concat() method that takes a collection of any type and converts it to a string.

int[] vals = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 };

Console.WriteLine(String.Concat(vals));

This gives the following output:

1234567890

If you want to separate the values with delimiters, use String.Join:

int[] vals = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 };
Console.WriteLine(string.Join(", ", vals));

This produces the output:

1, 2, 3, 4, 5, 6, 7, 8, 9, 0

However, if you want to do this with objects, it gets trickier. Suppose the existence of a Person class:

image

You can try to use the same code (with a required type parameter now):

image

This produces this undesirable output:

image

Instead, you can use an extension method on IEnumerable<T> that was introduced for LINQ (see Chapter 21, “LINQ”) to accumulate just the properties you need. This example combines anonymous delegates with LINQ to pack a lot of functionality into very little code:

image

This gives the following output:

Gates, Ballmer, Jobs

Append a Newline Character

Solution: Consider that .NET works on many operating systems, all of which have different newline conventions. For example, Windows uses a carriage-return/line-feed pair, whereas Linux and Mac OS X use just line-feed. Here is the correct way to format a string with a newline:

string a = "My String Here" + Environment.NewLine;

Environment.NewLine will always return the correct string for the current environment.

Split a String

Solution: The String class contains a handy Split method. You can tell it what characters (or strings) to consider as delimiters and also give it some options (as demonstrated next).

image

The following output is produced:

image

Notice the gaps? Those are the points where two delimiters appear in a row. In this case, the Split() method emits an empty string. This can be useful if, say, you are parsing a data file consisting of comma-separated values and you need to know when a value was missing. Sometimes, though, you’ll want to remove the empty strings. In that case, pass in the StringSplitOptions.RemoveEmptyEntries flag, as shown here:

image

Here’s the output:

image

Convert Binary Data to a String (Base-64 Encoding)

Solution: Use the Convert.ToBase64String method as Listing 7.1 shows.

Listing 7.1 EncodeBase64Bad

image

To use this program, run it with a filename on the command line. It will dump the base-64 equivalent to the console (or you can redirect it to a text file, if desired).

Note

Notice I called this a “bad” implementation. That’s because it reads the entire file into memory in one go and then writes it out as a single string. Try this on an enormous file, and you’ll see why it’s not such a good idea for a general file conversion utility. It would be better to perform the conversion in smaller chunks.

To convert back to binary from base-64, use the following:

byte[] bytes = Convert.FromBase64String(myBase64String);

Reverse Words

Solution: Perform a reversal of the entire string, character by character. Then find the parts containing words and reverse each part individually. Listing 7.2 shows an example.

Listing 7.2 Reverse Words in a String

image

image

Sort Number Strings Naturally

Solution: You must write your own string comparer to do the comparison correctly. You can either write a class that implements IComparer<T> or a method that conforms to the delegate Comparison<T> (see Chapter 15, “Delegates, Events, and Anonymous Methods,” which discusses delegates). Listing 7.3 provides an example of the first option.

Listing 7.3 Natural Sorting

image

image

image

image

Here is the output of the program:

image

image

Note that although it’s easier to split a string using regular expressions, this method is faster (at least in my case—you should always measure performance yourself when it’s important), which has advantages when this method is called often during a sorting session. However, it is not thread-safe because the buffer is shared between all calls to Compare().

Note

The List<T> sort method also allows you to pass in a delegate to perform the sort. Because the example in this section requires a helper method and a buffer, using a delegate may not be the best solution in this particular case. However, in general, for one-off sorting algorithms, it’s perfectly fine and can be implemented something like this (using a lambda expression for the delegate; see Chapter 15, “Delegates, Events, and Anonymous Methods”):

image

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset