Strings are so fundamental a data type that their usage is often covered elsewhere, as it relates to other types. Nevertheless, some topics are specific to strings, and those are covered in this chapter.
Solution: The System.Text
namespace defines a number of classes that help you work with textual data. To convert to and from bytes, use the Encoding
class. Although it is possible to create your own text-encoding scheme, usually you will use one of the pre-created static members of the Encoding
class, which represent the most common encoding standards, such as ASCII, Unicode, UTF-8, and so on.
The GetBytes()
method will get the appropriate byte representation:
string myString = "C# Rocks!";
byte[] bytes = Encoding.ASCII.GetBytes(myString);
If you were to print out the string and the bytes, it would look like this:
Original string: C# Rocks!
ASCII bytes: 43-23-20-52-6F-63-6B-73-21
Using Encoding.Unicode
, it would look like this:
Unicode bytes: 43-00-23-00-20-00-52-00-6F-00-63-00-6B-00-73-00-21-00
To convert back, use Encoding.GetString()
and pass in the bytes:
string result = Encoding.ASCII.GetString(bytes);
This is all simple enough, but consider the string “C# Rocks!”. The music note has no ASCII equivalent. This character will be converted to a question mark (?) if you convert to a byte format with no valid encoding, as in this example:
This has the following output:
Round trip: C# Rocks!->43-23-20-52-6F-63-6B-73-21-3F->C# Rocks!?
However, if we use Unicode encoding, it will all work out both ways:
Here’s the output:
In memory, all strings are Unicode in .NET. It’s only when you’re writing strings to files, networks, databases, interop buffers, and other external locations that you need to worry about byte representation.
Solution: Derive a class from System.Text.Encoding
and implement the abstract methods. The Encoding class defines many overloads for both encoding and decoding strings, but they all eventually call a few fundamental methods and these are the only methods you need to define.
Much of the code in this example is argument validation, as stipulated by the MSDN documentation for this class.
The encoding algorithm is the famous substitution cipher, ROT13, where each letter is rotated 13 places to the right.
A simple test demonstrates this (non)encryption encoder:
The need for custom string encoding can show up in unusual places. I once worked on a system that needed to communicate with buoys via satellites. Because of space, the limited alphabet had to be encoded in 6 bits per character. It was easy to wrap this functionality into a custom Encoder
-derived class for use in the application.
Solution: Always take culture into account and use the functionality provided in the .NET Framework. There are many ways to compare strings in C#. Some are better suited for localized text because strings can seem equivalent while having different byte values. This is especially true in non-Latin alphabets and when capitalization rules are not what you assume.
Here is a demonstration of the issue:
What is the output of this program?
As you can see, the culture can drastically change how strings are interpreted.
In general, you should use String.Compare()
and supply the culture you want to use. For nonlocalized strings, such as internal program strings or settings names, using the invariant culture is acceptable.
Solution: This is nearly the same issue as comparing two strings. You must specify the culture to do it correctly. Again, use the built-in functionality and remember the culture.
Using our Turkish example again, here is the sample code:
What is the difference in output?
Original: file
Uppercase (invariant): FILE
Uppercase (Turkish): FILE
Seemingly nothing, but let’s look at the bytes of the uppercase strings:
Bytes (invariant): 46-00-49-00-4C-00-45-00
Bytes (Turkish): 46-00-30-01-4C-00-45-00
You can now see the difference, even though the visual representation is the same. The lesson here is that just because a character looks the same, it doesn’t mean it is the same.
Solution: When handling string input, you generally need to be aware of four possible states:
1. String is null.
2. String is empty.
3. String contains nothing but whitespace.
4. String contains content.
The first two conditions are handled by the static method String.IsNullOrEmpty()
, which has existed in .Net for a while.
.Net 4 adds String.IsNullOrWhitespace()
, which also handles condition 3.
bool containsContent = !String.IsNullOrWhitespace(myString);
StringBuilder
?Solution: Not necessarily. Use of StringBuilder
has become somewhat of a dogmatic issue in the C# community, and some explanation is useful.
In C#, strings are immutable objects, meaning they cannot be changed once created. This has ramifications for string manipulation, but they are probably not as drastic as you might be led to believe. There are essentially two options for string concatenation: using plain String
objects and using StringBuilder
.
Here’s an example using plain string
objects:
Here’s an example using StringBuilder
:
Conventional wisdom tells you to use StringBuilder
for concatenating strings, but that answer is too simplistic. In fact, the official guidelines gives as good guidance as we’re likely to find. The following is from the MSDN documentation for System.StringBuilder
(http://msdn.microsoft.com/en-us/library/system.text.stringbuilder.aspx):
The performance of a concatenation operation for a String
or StringBuilder
object depends on how often a memory allocation occurs. A String
concatenation operation always allocates memory, whereas a StringBuilder
concatenation operation only allocates memory if the StringBuilder
object buffer is too small to accommodate the new data. Consequently, the String
class is preferable for a concatenation operation if a fixed number of String
objects are concatenated. In that case, the individual concatenation operations might even be combined into a single operation by the compiler. A StringBuilder
object is preferable for a concatenation operation if an arbitrary number of strings are concatenated; for example, if a loop concatenates a random number of strings of user input.
Running my own tests basically confirms this strategy. StringBuilder
is only faster once we’re dealing with a lot of strings, as Figure 7.1 shows. Despite string append being much slower, we’re still talking about milliseconds either way.
However, a caveat: All of this greatly depends on the size and number of your strings. Whenever performance is involved, there is only one rule: measure it.
You also need to keep in mind the number of objects created—when standard string concatenation is used, a new string is created at each concatenation. This can lead to an explosion of objects for the garbage collector to deal with.
You can perform your own timings with the StringBuilderTime project located in the code samples for this chapter.
In the end, you must measure and profile your own code to determine what’s best in your scenario.
Solution: There are a few options for this, depending on what you need. For simple concatenation, use the new String.Concat() method that takes a collection of any type and converts it to a string.
int[] vals = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 };
Console.WriteLine(String.Concat(vals));
This gives the following output:
1234567890
If you want to separate the values with delimiters, use String.Join:
int[] vals = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 };
Console.WriteLine(string.Join(", ", vals));
This produces the output:
1, 2, 3, 4, 5, 6, 7, 8, 9, 0
However, if you want to do this with objects, it gets trickier. Suppose the existence of a Person class:
You can try to use the same code (with a required type parameter now):
This produces this undesirable output:
Instead, you can use an extension method on IEnumerable<T> that was introduced for LINQ (see Chapter 21, “LINQ”) to accumulate just the properties you need. This example combines anonymous delegates with LINQ to pack a lot of functionality into very little code:
This gives the following output:
Gates, Ballmer, Jobs
Solution: Consider that .NET works on many operating systems, all of which have different newline conventions. For example, Windows uses a carriage-return/line-feed pair, whereas Linux and Mac OS X use just line-feed. Here is the correct way to format a string with a newline:
string a = "My String Here" + Environment.NewLine;
Environment.NewLine
will always return the correct string for the current environment.
Solution: The String class contains a handy Split
method. You can tell it what characters (or strings) to consider as delimiters and also give it some options (as demonstrated next).
The following output is produced:
Notice the gaps? Those are the points where two delimiters appear in a row. In this case, the Split() method
emits an empty string. This can be useful if, say, you are parsing a data file consisting of comma-separated values and you need to know when a value was missing. Sometimes, though, you’ll want to remove the empty strings. In that case, pass in the StringSplitOptions.RemoveEmptyEntries
flag, as shown here:
Here’s the output:
Solution: Use the Convert.ToBase64String
method as Listing 7.1 shows.
To use this program, run it with a filename on the command line. It will dump the base-64 equivalent to the console (or you can redirect it to a text file, if desired).
Notice I called this a “bad” implementation. That’s because it reads the entire file into memory in one go and then writes it out as a single string. Try this on an enormous file, and you’ll see why it’s not such a good idea for a general file conversion utility. It would be better to perform the conversion in smaller chunks.
To convert back to binary from base-64, use the following:
byte[] bytes = Convert.FromBase64String(myBase64String);
Solution: Perform a reversal of the entire string, character by character. Then find the parts containing words and reverse each part individually. Listing 7.2 shows an example.
Solution: You must write your own string comparer to do the comparison correctly. You can either write a class that implements IComparer<T>
or a method that conforms to the delegate Comparison<T>
(see Chapter 15, “Delegates, Events, and Anonymous Methods,” which discusses delegates). Listing 7.3 provides an example of the first option.
Here is the output of the program:
Note that although it’s easier to split a string using regular expressions, this method is faster (at least in my case—you should always measure performance yourself when it’s important), which has advantages when this method is called often during a sorting session. However, it is not thread-safe because the buffer is shared between all calls to Compare()
.
The List<T>
sort method also allows you to pass in a delegate to perform the sort. Because the example in this section requires a helper method and a buffer, using a delegate may not be the best solution in this particular case. However, in general, for one-off sorting algorithms, it’s perfectly fine and can be implemented something like this (using a lambda expression for the delegate; see Chapter 15, “Delegates, Events, and Anonymous Methods”):