Ruby is a programmer-friendly language. If you are already familiar with object oriented programming, Ruby should quickly become second nature. If you’ve struggled with learning object-oriented programming or are not familiar with it, Ruby should make more sense to you than other object-oriented languages because Ruby’s methods are consistently named, concise, and generally act the way you expect.
Throughout this book, we demonstrate concepts through interactive Ruby sessions. Strings are a good place to start because not only are they a useful data type, they’re easy to create and use. They provide a simple introduction to Ruby, a point of comparison between Ruby and other languages you might know, and an approachable way to introduce important Ruby concepts like duck typing (see Recipe 1.12), open classes (demonstrated in Recipe 1.10), symbols (Recipe 1.7), and even Ruby gems (Recipe 1.20).
If you use Mac OS X or a Unix environment with Ruby installed, go to
your command line right now and type irb
. If you’re using Windows, you can download
and install the One-Click Installer from http://rubyforge.org/projects/rubyinstaller/, and do the
same from a command prompt (you can also run the fxri
program, if that’s more comfortable for
you). You’ve now entered an interactive Ruby shell, and you can follow
along with the code samples in most of this book’s recipes.
Strings in Ruby are much like strings in other dynamic languages like Perl, Python and PHP. They’re not too much different from strings in Java and C. Ruby strings are dynamic, mutable, and flexible. Get started with strings by typing this line into your interactive Ruby session:
string = "My first string"
You should see some output that looks like this:
=> "My first string"
You typed in a Ruby expression that created a string “My first
string”, and assigned it to the variable string
. The value of that expression is just the
new value of string
, which is what your
interactive Ruby session printed out on the right side of the arrow.
Throughout this book, we’ll represent this kind of interaction in the
following form:[1]
string = "My first string" # => "My first string"
In Ruby, everything that can be assigned to a variable is an object.
Here, the variable string
points to an
object of class String
. That class
defines over a hundred built-in methods: named pieces of code that examine and manipulate
the string. We’ll explore some of these throughout the chapter, and indeed
the entire book. Let’s try out one now: String#length
, which returns the number of bytes
in a string. Here’s a Ruby method call:
string.length # => 15
Many programming languages make you put parentheses after a method call:
string.length() # => 15
In Ruby, parentheses are almost always optional. They’re especially
optional in this case, since we’re not passing any arguments into String#length
. If you’re passing arguments into
a method, it’s often more readable to enclose the argument list in
parentheses:
string.count 'i' # => 2 # "i" occurs twice. string.count('i') # => 2
The return value of a method call is itself an object. In the case
of String#length
, the return value is
the number 15, an instance of the Fixnum
class. We can call a method on this
object as well:
string.length.next # => 16
Let’s take a more complicated case: a string that contains non-ASCII characters. This string contains the French phrase “il était une fois,” encoded as UTF-8:[2]
french_string = "il xc3xa9tait une fois" # => "il 303251tait une fois"
Many programming languages (notably Java) treat a string as a series of characters. Ruby treats a string as a series of bytes. The French string contains 14 letters and 3 spaces, so you might think Ruby would say the length of the string is 17. But one of the letters (the e with acute accent) is represented as two bytes, and that’s what Ruby counts:
french_string.length # => 18
For more on handling different encodings, see Recipe 1.14 and Recipe 11.12. For more on this specific problem, see Recipe 1.8
You can represent special characters in strings (like the binary data in the French string) with string escaping. Ruby does different types of string escaping depending on how you create the string. When you enclose a string in double quotes, you can encode binary data into the string (as in the French example above), and you can encode newlines with the code “ ”, as in other programming languages:
puts "This string contains a newline" # This string # contains a newline
When you enclose a string in single quotes, the only special codes you can use are “’” to get a literal single quote, and “\” to get a literal backslash:
puts 'it may look like this string contains a newline but it doesn't' # it may look like this string contains a newline but it doesn't puts 'Here is a backslash: ' # Here is a backslash:
This is covered in more detail in Recipe 1.5. Also see Recipes 1.2 and 1.3 for more examples of the more spectacular substitutions double-quoted strings can do.
Another useful way to initialize strings is with the " here documents” style:
long_string = <<EOF Here is a long string With many paragraphs EOF # => "Here is a long string With many paragraphs " puts long_string # Here is a long string # With many paragraphs
Like most of Ruby’s built-in classes, Ruby’s strings define the same
functionality in several different ways, so that you can use the idiom you
prefer. Say you want to get a substring of a larger string (as in Recipe 1.13). If you’re an
object-oriented programming purist, you can use the String#slice
method:
string # => "My first string" string.slice(3, 5) # => "first"
But if you’re coming from C, and you think of a string as an array of bytes, Ruby can accommodate you. Selecting a single byte from a string returns that byte as a number.
string.chr + string.chr + string.chr + string.chr + string.chr # => "first"
And if you come from Python, and you like that language’s slice notation, you can just as easily chop up the string that way:
string[3, 5] # => "first"
Unlike in most programming languages, Ruby strings are mutable: you
can change them after they are declared. Below we see the difference
between the methods String#upcase
and
String#upcase
!:
string.upcase # => "MY FIRST STRING" string # => "My first string" string.upcase! # => "MY FIRST STRING" string # => "MY FIRST STRING"
This is one of Ruby’s syntactical conventions. “Dangerous” methods (generally those that modify their object in place) usually have an exclamation mark at the end of their name. Another syntactical convention is that predicates, methods that return a true/false value, have a question mark at the end of their name (as in some varieties of Lisp):
string.empty? # => false string.include? 'MY' # => true
This use of English punctuation to provide the programmer with information is an example of Matz’s design philosophy: that Ruby is a language primarily for humans to read and write, and secondarily for computers to interpret.
An interactive Ruby session is an indispensable tool for learning
and experimenting with these methods. Again, we encourage you to type the
sample code shown in these recipes into an irb
or fxri
session, and try to build upon the examples as your knowledge of Ruby
grows.
Here are some extra resources for using strings in Ruby:
You can get information about any built-in Ruby method with the
ri
command; for instance, to see
more about the String#upcase
!
method, issue the command ri
"String#upcase!"
from the command line.
“why the lucky stiff” has written an excellent introduction to
installing Ruby, and using irb
and
ri
: http://poignantguide.net/ruby/expansion-pak-1.html
For more information about the design philosophy behind Ruby, read an interview with Yukihiro “Matz” Matsumoto, creator of Ruby: http://www.artima.com/intv/ruby.html
There are two efficient solutions. The simplest solution is to
start with an empty string, and repeatedly append substrings onto it
with the <<
operator:
hash = { "key1" => "val1", "key2" => "val2" } string = "" hash.each { |k,v| string << "#{k} is #{v} " } puts string # key1 is val1 # key2 is val2
This variant of the simple solution is slightly more efficient, but harder to read:
string = "" hash.each { |k,v| string << k << " is " << v << " " }
If your data structure is an array, or easily transformed into
an array, it’s usually more efficient to use
Array#join
:
puts hash.keys.join(" ") + " " # key1 # key2
In languages like Python and Java, it’s very inefficient to build a string by starting with an empty string and adding each substring onto the end. In those languages, strings are immutable, so adding one string to another builds an entirely new string. Doing this multiple times creates a huge number of intermediary strings, each of which is only used as a stepping stone to the next string. This wastes time and memory.
In those languages, the most efficient way to build a string is
always to put the substrings into an array or another mutable data
structure, one that expands dynamically rather than by implicitly
creating entirely new objects. Once you’re done processing the
substrings, you get a single string with the equivalent of Ruby’s
Array#join
. In Java, this is the purpose of
the StringBuffer
class.
In Ruby, though, strings are just as mutable as arrays. Just
like arrays, they can expand as needed, without using much time or
memory. The fastest solution to this problem in Ruby is usually to
forgo a holding array and tack the substrings directly onto a base
string. Sometimes using Array#join
is faster, but it’s usually pretty close, and the <<
construction is generally easier to
understand.
If efficiency is important to you, don’t build a new string when
you can append items onto an existing string. Constructs like str << 'a' + 'b'
or str << "#{var1} #{var2}"
create new
strings that are immediately subsumed into the larger string. This is
exactly what you’re trying to avoid. Use str
<< var1 <<''<< var2
instead.
On the other hand, you shouldn’t modify strings that aren’t yours. Sometimes safety requires that you create a new string. When you define a method that takes a string as an argument, you shouldn’t modify that string by appending other strings onto it, unless that’s really the point of the method (and unless the method’s name ends in an exclamation point, so that callers know it modifies objects in place).
Another caveat: Array#join
does not work precisely the same way as repeated appends to a string.
Array#join
accepts a separator
string that it inserts between every two elements
of the array. Unlike a simple string- building iteration over an array, it will not insert
the separator string after the last element in the array. This example
illustrates the difference:
data = ['1', '2', '3'] s = '' data.each { |x| s << x << ' and a '} s # => "1 and a 2 and a 3 and a " data.join(' and a ') # => "1 and a 2 and a 3"
To simulate the behavior of Array#join
across an iteration, you can use
Enumerable#each_with_index
and omit
the separator on the last index. This only works if you know how long
the Enumerable
is going to
be:
s = "" data.each_with_index { |x, i| s << x; s << "|" if i < data.length-1 } s # => "1|2|3"
You want to create a string that contains a representation of a Ruby variable or expression.
Within the string, enclose the variable or expression in curly brackets and prefix it with a hash character.
number = 5 "The number is #{number}." # => "The number is 5." "The number is #{5}." # => "The number is 5." "The number after #{number} is #{number.next}." # => "The number after 5 is 6." "The number prior to #{number} is #{number-1}." # => "The number prior to 5 is 4." "We're ##{number}!" # => "We're #5!"
When you define a string by putting it in double quotes, Ruby scans it for special substitution codes. The most common case, so common that you might not even think about it, is that Ruby substitutes a single newline character every time a string contains slash followed by the letter n (“ ”).
Ruby supports more complex string substitutions as well. Any
text kept within the brackets of the special marker
#{} (that is, #{text in here}) is interpreted as
a Ruby expression. The result of that expression is substituted into
the string that gets created. If the result of the expression is not a
string, Ruby calls its to_s
method
and uses that instead.
Once such a string is created, it is indistinguishable from a string created without using the string interpolation feature:
"#{number}" == '5' # => true
You can use string interpolation to run even large chunks of Ruby code inside a string. This extreme example defines a class within a string; its result is the return value of a method defined in the class. You should never have any reason to do this, but it shows the power of this feature.
%{Here is #{class InstantClass def bar "some text" end end InstantClass.new.bar }.} # => "Here is some text."
The code run in string interpolations runs in the same context as any other
Ruby code in the same location. To take the example above, the
InstantClass
class has now been
defined like any other class, and can be used outside the string that
defines it.
If a string interpolation calls a method that has side effects, the side effects are triggered. If a string definition sets a variable, that variable is accessible afterwards. It’s bad form to rely on this behavior, but you should be aware of it:
"I've set x to #{x = 5; x += 1}." # => "I've set x to 6." x # => 6
To avoid triggering string interpolation, escape the hash characters or put the string in single quotes.
"#{foo}" # => "#{foo}" '#{foo}' # => "#{foo}"
The “here document” construct is an alternative to the %{}
construct, which is sometimes more
readable. It lets you define a multiline string that only ends when
the Ruby parser encounters a certain string on a line by
iteself:
name = "Mr. Lorum" email = <<END Dear #{name}, Unfortunately we cannot process your insurance claim at this time. This is because we are a bakery, not an insurance company. Signed, Nil, Null, and None Bakers to Her Majesty the Singleton END
Ruby is pretty flexible about the string you can use to end the “here document”:
<<end_of_poem There once was a man from Peru Whose limericks stopped on line two end_of_poem # => "There once was a man from Peru Whose limericks stopped on line two "
You can use the technique described in Recipe 1.3, " Substituting Variables into an Existing String,” to define a template string or object, and substitute in variables later
You want to create a string that contains Ruby expressions or variable substitutions, without actually performing the substitutions. You plan to substitute values into the string later, possibly multiple times with different values each time.
There are two good solutions: printf
-style strings, and ERB templates.
Ruby supports a printf
-style
string format like C’s and Python’s. Put printf
directives into a string and it
becomes a template. You can interpolate values into it later using the
modulus operator:
template = 'Oceania has always been at war with %s.' template % 'Eurasia' # => "Oceania has always been at war with Eurasia." template % 'Eastasia' # => "Oceania has always been at war with Eastasia." 'To 2 decimal places: %.2f' % Math::PI # => "To 2 decimal places: 3.14" 'Zero-padded: %.5d' % Math::PI # => "Zero-padded: 00003"
An ERB template looks something like JSP or PHP code. Most of it is treated as a normal string, but certain control sequences are executed as Ruby code. The control sequence is replaced with either the output of the Ruby code, or the value of its last expression:
require 'erb' template = ERB.new %q{Chunky <%= food %>!} food = "bacon" template.result(binding) # => "Chunky bacon!" food = "peanut butter" template.result(binding) # => "Chunky peanut butter!"
You can omit the call to Kernel#binding
if you’re not in an irb
session:
puts template.result # Chunky peanut butter!
You may recognize this format from the .rhtml
files used by Rails views: they use
ERB behind the scenes.
An ERB template can reference variables like food
before they’re defined. When you call
ERB#result
, or ERB#run
, the template is executed according
to the current values of those variables.
Like JSP and PHP code, ERB templates can contain loops and conditionals. Here’s a more sophisticated template:
template = %q{ <% if problems.empty? %> Looks like your code is clean! <% else %> I found the following possible problems with your code: <% problems.each do |problem, line| %> * <%= problem %> on line <%= line %> <% end %> <% end %>}.gsub(/^s+/, '') template = ERB.new(template, nil, '<>') problems = [["Use of is_a? instead of duck typing", 23], ["eval() is usually dangerous", 44]] template.run(binding) # I found the following possible problems with your code: # * Use of is_a? instead of duck typing on line 23 # * eval() is usually dangerous on line 44 problems = [] template.run(binding) # Looks like your code is clean!
ERB is sophisticated, but neither it nor the printf
-style strings look like the simple Ruby string substitutions
described in Recipe 1.2.
There’s an alternative. If you use single quotes instead of double
quotes to define a string with substitutions, the substitutions won’t
be activated. You can then use this string as a template with eval
:
class String def substitute(binding=TOPLEVEL_BINDING) eval(%{"#{self}"}, binding) end end template = %q{Chunky #{food}!} # => "Chunky #{food}!" food = 'bacon' template.substitute(binding) # => "Chunky bacon!" food = 'peanut butter' template.substitute(binding) # => "Chunky peanut butter!"
You must be very careful when using eval
: if you use a variable in the wrong
way, you could give an attacker the ability to run arbitrary Ruby code
in your eval
statement. That won’t
happen in this example since any possible value of food
gets stuck into a string definition
before it’s interpolated:
food = '#{system("dir")}' puts template.substitute(binding) # Chunky #{system("dir")}!
This recipe gives basic examples of ERB templates; for more complex examples, see the documentation of the ERB class ( http://www.ruby-doc.org/stdlib/libdoc/erb/rdoc/classes/ERB.html)
Recipe 1.2, " Substituting Variables into Strings”
Recipe 10.12,
“Evaluating Code in an Earlier Context,” has more about Binding
objects
The letters (or words) of your string are in the wrong order.
To create a new string that contains a reversed version of your
original string, use the reverse
method. To reverse a string in place, use the reverse
! method.
s = ".sdrawkcab si gnirts sihT" s.reverse # => "This string is backwards." s # => ".sdrawkcab si gnirts sihT" s. reverse! # => "This string is backwards." s # => "This string is backwards."
To reverse the order of the words in a string, split the string into a list of whitespaceseparated words, then join the list back into a string.
s = "order. wrong the in are words These" s.split(/(s+)/). reverse!.join('') # => "These words are in the wrong order." s.split(//).reverse!.join('') # => "These words are in the wrong. order"
The String#split
method takes a regular
expression to use as a separator. Each time the separator matches part
of the string, the portion of the string before the separator goes
into a list. split
then resumes
scanning the rest of the string. The result is a list of strings found
between instances of the separator. The regular expression /(s+)/
matches one or more whitespace
characters; this splits the string on word boundaries, which works for
us because we want to reverse the order of the words.
The regular expression
matches a word boundary. This is not the same as matching whitespace,
because it also matches punctuation. Note the difference in
punctuation between the two final examples in the Solution.
Because the regular expression /(s+)/
includes a set of parentheses, the
separator strings themselves are included in the returned list.
Therefore, when we join the strings back together, we’ve preserved
whitespace. This example shows the difference between including the
parentheses and omitting them:
"Three little words".split(/s+/) # => ["Three", "little", "words"] "Three little words".split(/(s+)/) # => ["Three", " ", "little", " ", "words"]
Recipe 1.9, “Processing a String One Word at a Time,” has some regular expressions for alternative definitions of “word”
Recipe 1.11, “Managing Whitespace”
Recipe 1.17, “Matching Strings with Regular Expressions”
You need to make reference to a control character, a strange UTF-8 character, or some other character that’s not on your keyboard.
Ruby gives you a number of escaping mechanisms to refer to unprintable characters. By using one of these mechanisms within a double-quoted string, you can put any binary character into the string.
You can reference any any binary character by encoding its octal representation into the format “ 00”, or its hexadecimal representation into the format “x00”.
octal = " 00 01 10 20" octal.each_byte { |x| puts x } # 0 # 1 # 8 # 16 hexadecimal = "x00x01x10x20" hexadecimal.each_byte { |x| puts x } # 0 # 1 # 16 # 32
This makes it possible to represent UTF-8 characters even when you can’t type them or display them in your terminal. Try running this program, and then opening the generated file smiley.html in your web browser:
open('smiley.html', 'wb') do |f| f << '<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">' f << "xe2x98xBA" end
The most common unprintable characters (such as newline) have special mneumonic aliases consisting of a backslash and a letter.
"a" == "x07" # => true # ASCII 0x07 = BEL (Sound system bell) "" == "x08" # => true # ASCII 0x08 = BS (Backspace) "e" == "x1b" # => true # ASCII 0x1B = ESC (Escape) "f" == "x0c" # => true # ASCII 0x0C = FF (Form feed) " " == "x0a" # => true # ASCII 0x0A = LF (Newline/line feed) " " == "x0d" # => true # ASCII 0x0D = CR (Carriage return) " " == "x09" # => true # ASCII 0x09 = HT (Tab/horizontal tab) "v" == "x0b" # => true # ASCII 0x0B = VT (Vertical tab)
Ruby stores a string as a sequence of bytes. It makes no difference whether those bytes are printable ASCII characters, binary characters, or a mix of the two.
When Ruby prints out a human-readable string representation of a
binary character, it uses the character’s xxx
octal representation. Characters with
special x
mneumonics are printed
as the mneumonic. Printable characters are output as their printable
representation, even if another representation was used to create the
string.
"x10x11xfexff" # => " 20 21376377" "x48145x6cx6c157x0a" # => "Hello "
To avoid confusion with the mneumonic characters, a literal backslash in a string is represented by two backslashes. For instance, the two-character string consisting of a backslash and the 14th letter of the alphabet is represented as “\n”.
"\".size # => 1 "\" == "x5c" # => true "\n"[0] == ?\ # => true "\n"[1] == ?n # => true "\n" =~ / / # => nil
Ruby also provides special shortcuts for representing keyboard
sequences like Control-C. "C-_x_"
represents the sequence you get by holding down the control key and
hitting the x key, and "M-_x_"
represents the sequence you get by holding down the Alt (or Meta) key
and hitting the x key:
"C-aC-bC-c" # => " 01 02 03" "M-aM-bM-c" # => "341342343"
Shorthand representations of binary characters can be used whenever Ruby expects a character. For instance, you can get the decimal byte number of a special character by prefixing it with ?, and you can use shorthand representations in regular expression character ranges.
?C-a # => 1 ?M-z # => 250 contains_control_chars = /[C-a-C-^]/ 'Foobar' =~ contains_control_chars # => nil "FooC-zbar" =~ contains_control_chars # => 3 contains_upper_chars = /[x80-xff]/ 'Foobar' =~ contains_upper_chars # => nil "Foo212bar" =~ contains_upper_chars # => 3
Here’s a sinister application that scans logged keystrokes for special characters:
def snoop_on_keylog(input) input.each_byte do |b| case b when ?C-c; puts 'Control-C: stopped a process?' when ?C-z; puts 'Control-Z: suspended a process?' when ? ; puts 'Newline.' when ?M-x; puts 'Meta-x: using Emacs?' end end end snoop_on_keylog("ls -ltR 03emacsHello 12370rot13-other-window 12 32") # Control-C: stopped a process? # Newline. # Meta-x: using Emacs? # Newline. # Control-Z: suspended a process?
Special characters are only interpreted in strings delimited by
double quotes, or strings created with %{}
or %Q{}
. They are not interpreted in strings
delimited by single quotes, or strings created with %q{}
. You can take advantage of this feature
when you need to display special characters to the end-user, or create
a string containing a lot of backslashes.
puts "foo bar" # foo bar puts %{foo bar} # foo bar puts %Q{foo bar} # foo bar puts 'foo bar' # foo bar puts %q{foo bar} # foo bar
If you come to Ruby from Python, this feature can take advantage
of you, making you wonder why the special characters in your
single-quoted strings aren’t treated as special. If you need to create
a string with special characters and a lot of embedded double quotes, use the
%{}
construct.
You want to see the ASCII code for a character, or transform an ASCII code into a string.
To see the ASCII code for a specific character as an integer,
use the ?
operator:
?a # => 97 ?! # => 33 ? # => 10
To see the integer value of a particular in a string, access it as though it were an element of an array:
'a'[0] # => 97 'bad sound'[1] # => 97
To see the ASCII character corresponding to a given number, call
its #chr
method. This returns a
string containing only one character:
97.chr # => "a" 33.chr # => "!" 10.chr # => " " 0.chr # => " 00" 256.chr # RangeError: 256 out of char range
Though not technically an array, a string acts a lot like like
an array of Fixnum
objects: one
Fixnum
for each byte in the string.
Accessing a single element of the “array” yields a Fixnum
for the corresponding byte: for
textual strings, this is an ASCII code. Calling String#each_byte
lets you iterate over the
Fixnum
objects that make up a
string.
Recipe 1.8, “Processing a String One Character at a Time”
You want to get a string containing the label of a Ruby symbol, or get the Ruby symbol that corresponds to a given string.
To turn a symbol into a string, use
Symbol#to_s
, or
Symbol#id2name
, for which to_s
is an alias.
:a_ symbol.to_s # => "a_symbol" :AnotherSymbol.id2name # => "AnotherSymbol" :"Yet another symbol!".to_s # => "Yet another symbol!"
You usually reference a symbol by just typing its name. If
you’re given a string in code and need to get the corresponding
symbol, you can use String.intern
:
:dodecahedron.object_id # => 4565262 symbol_name = "dodecahedron" symbol_name.intern # => :dodecahedron symbol_name.intern.object_id # => 4565262
A Symbol
is about the most
basic Ruby object you can create. It’s just a name and an internal ID.
Symbols are useful becase a given symbol name refers to the same
object throughout a Ruby program.
Symbols are often more efficient than strings. Two strings with
the same contents are two different objects (one of the strings might
be modified later on, and become different), but for any given name
there is only one Symbol
object.
This can save both time and memory.
"string".object_id # => 1503030 "string".object_id # => 1500330 :symbol.object_id # => 4569358 :symbol.object_id # => 4569358
If you have n references to a name, you can keep all those references with only one symbol, using only one object’s worth of memory. With strings, the same code would use n different objects, all containing the same data. It’s also faster to compare two symbols than to compare two strings, because Ruby only has to check the object IDs.
"string1" == "string2" # => false :symbol1 == :symbol2 # => false
Finally, to quote Ruby hacker Jim Weirich on when to use a string versus a symbol:
If the contents (the sequence of characters) of the object are important, use a string.
If the identity of the object is important, use a symbol.
See Recipe 5.1, “Using Symbols as Hash Keys” for one use of symbols
Recipe 8.12, “Simulating Keyword Arguments,” has another
Chapter 10, especially Recipe 10.4, “Getting a Reference to a Method” and Recipe 10.10, “Avoiding Boilerplate Code with Metaprogramming”
See http://glu.ttono.us/articles/2005/08/19/understanding-ruby-symbols for a symbol primer
You want to process each character of a string individually.
If you’re processing an ASCII document, then each byte
corresponds to one character. Use String#each_byte
to yield each byte of a
string as a number, which you can turn into a one-character
string:
'foobar'.each_byte { |x| puts "#{x} = #{x.chr}" } # 102 = f # 111 = o # 111 = o # 98 = b # 97 = a # 114 = r
Use String#scan
to yield each character of a
string as a new one-character string:
'foobar'.scan( /./ ) { |c| puts c } # f # o # o # b # a # r
Since a string is a sequence of bytes, you might think that the
String#each
method would iterate
over the sequence, the way Array#each
does. But String#each
is actually used to split a
string on a given record separator (by default, the newline):
"foo bar".each { |x| puts x } # foo # bar
The string equivalent of Array#each
method is actually each_byte
. A string stores its characters as a sequence of Fixnum objects, and
each_bytes
yields that
sequence.
String#each_byte
is faster
than String#scan
, so if you’re processing an ASCII
file, you might want to use String#each_byte
and convert to a string
every number passed into the code block (as seen in the
Solution).
String#scan
works by applying
a given regular expression to a string, and yielding each match to the
code block you provide. The regular expression /./
matches every character in the string,
in turn.
If you have the $KCODE
variable set correctly, then the scan
technique will work on UTF-8 strings as well. This is the simplest way to sneak a
notion of “character” into Ruby’s byte-based strings.
Here’s a Ruby string containing the UTF-8 encoding of the French phrase “ça va”:
french = "xc3xa7a va"
Even if your terminal can’t properly display the character “ç”,
you can see how the behavior of String#scan
changes when you make the
regular expression Unicodeaware, or set $KCODE
so that Ruby handles all strings as
UTF-8:
french.scan(/./) { |c| puts c } # # # a # # v # a french.scan(/./u) { |c| puts c } # ç # a # # v # a $KCODE = 'u' french.scan(/./) { |c| puts c } # ç # a # # v # a
Once Ruby knows to treat strings as UTF-8 instead of ASCII, it starts treating the two bytes representing the “ç” as a single character. Even if you can’t see UTF-8, you can write programs that handle it correctly.
Recipe 11.12, “Converting from One Encoding to Another”
First decide what you mean by “word.” What separates one word from another? Only whitespace? Whitespace or punctuation? Is “johnny-come-lately” one word or three? Build a regular expression that matches a single word according to whatever definition you need (there are some samples are in the Discussion).
Then pass that regular expression into String#scan
. Every word it finds, it will
yield to a code block. The word_count
method defined below takes a piece
of text and creates a histogram of word frequencies. Its regular
expression considers a “word” to be a string of Ruby identifier
characters: letters, numbers, and underscores.
class String def word_count frequencies = Hash.new(0) downcase.scan(/w+/) { |word| frequencies[word] += 1 } return frequencies end end %{Dogs dogs dog dog dogs.}.word_count # => {"dogs"=>3, "dog"=>2} %{"I have no shame," I said.}.word_count # => {"no"=>1, "shame"=>1, "have"=>1, "said"=>1, "i"=>2}
The regular expression /w+/
is nice and simple, but you can probably do better for your
application’s definition of “word.” You probably don’t consider two
words separated by an underscore to be a single word.
Some English words, like “pan-fried” and “fo’c’sle”, contain
embedded punctuation. Here are a few more definitions of “word” in
regular expression form:
# Just like /w+/, but doesn't consider underscore part of a word. /[0-9A-Za-z]/ # Anything that's not whitespace is a word. /[^S]+/ # Accept dashes and apostrophes as parts of words. /[-'w]+/ # A pretty good heuristic for matching English words. /(w+([-'.]w+)*/
The last one deserves some explanation. It matches embedded punctuation within a word, but not at the edges. “Work-in-progress” is recognized as a single word, and “—-never—-” is recognized as the word “never” surrounded by punctuation. This regular expression can even pick out abbreviations and acronyms such as “Ph.D” and “U.N.C.L.E.”, though it can’t distinguish between the final period of an acronym and the period that ends a sentence. This means that “E.F.F.” will be recognized as the word “E.F.F” and then a nonword period.
Let’s rewrite our word_count
method to use that regular expression. We can’t use the original
implementation, because its code block takes only one argument.
String#scan
passes its code block
one argument for each match group in the regular expression, and our
improved regular expression has two match groups. The first match
group is the one that actually contains the word. So we must rewrite
word_count
so that its code block
takes two arguments, and ignores the second one:
class String def word_count frequencies = Hash.new(0) downcase.scan(/(w+([-'.]w+)*)/) { |word, ignore| frequencies[word] += 1 } return frequencies end end %{"That F.B.I. fella--he's quite the man-about-town."}.word_count # => {"quite"=>1, "f.b.i"=>1, "the"=>1, "fella"=>1, "that"=>1, # "man-about-town"=>1, "he's"=>1}
Note that the “w” character set matches different things depending on the value of $KCODE. By default, “w” matches only characters that are part of ASCII words:
french = "il xc3xa9tait une fois" french.word_count # => {"fois"=>1, "une"=>1, "tait"=>1, "il"=>1}
If you turn on Ruby’s UTF-8 support, the “w” character set matches more characters:
$KCODE='u' french.word_count # => {"fois"=>1, "une"=>1, "était"=>1, "il"=>1}
The regular expression group matches a word
boundary: that is, the last part of a word before
a piece of whitespace or punctuation. This is useful for String#split
(see Recipe 1.4), but not so useful
for String#scan
.
Recipe 1.4, “Reversing a String by Words or Characters”
The Facets core library defines a String#each_word
method, using the
regular expression /([-'w]+)/
Your string is in the wrong case, or no particular case at all.
The String class provides a variety of case-shifting methods:
s = 'HELLO, I am not here. I WENT to tHe MaRKEt.' s. upcase # => "HELLO, I AM NOT HERE. I WENT TO THE MARKET." s. downcase # => "hello, i am not here. i went to the market." s.swapcase # => "hello, i AM NOT HERE. i went TO ThE mArkeT." s.capitalize # => "Hello, i am not here. i went to the market."
The upcase
and
downcase
methods force all letters in the
string to upper-or lowercase, respectively. The swapcase
method transforms uppercase letters
into lowercase letters and vice versa. The capitalize
method makes the first character
of the string uppercase, if it’s a letter, and makes all other letters
in the string lowercase.
All four methods have corresponding methods that modify a string
in place rather than creating a new one: upcase!, downcase!, swapcase
!, and capitalize
!. Assuming you don’t need the
original string, these methods will save memory, especially if the
string is large.
un_banged = 'Hello world.' un_banged.upcase # => "HELLO WORLD." un_banged # => "Hello world." banged = 'Hello world.' banged.upcase! # => "HELLO WORLD." banged # => "HELLO WORLD."
To capitalize a string without lowercasing the rest of the
string (for instance, because the string contains proper nouns), you
can modify the first character of the string in place. This
corresponds to the capitalize
!
method. If you want something more like capitalize
, you can create a new string out
of the old one.
class String def capitalize_first_letter self[0].chr.capitalize + self[1, size] end def capitalize_first_letter! unless self[0] == (c = self[0,1].upcase[0]) self[0] = c self end # Return nil if no change was made, like upcase! et al. end end s = 'i told Alice. She remembers now.' s.capitalize_first_letter # => "I told Alice. She remembers now." s # => "i told Alice. She remembers now." s.capitalize_first_letter! s # => "I told Alice. She remembers now."
To change the case of specific letters while leaving the rest
alone, you can use the tr
or
tr
! methods, which translate one
character into another:
'LOWERCASE ALL VOWELS'.tr('AEIOU', 'aeiou') # => "LoWeRCaSe aLL VoWeLS" 'Swap case of ALL VOWELS'.tr('AEIOUaeiou', 'aeiouAEIOU') # => "SwAp cAsE Of aLL VoWeLS"
Recipe 1.18, “Replacing Multiple Patterns in a Single Pass”
The Facets Core library adds a String#camelcase
method; it also defines
the case predicates String#lowercase
? and String#uppercase
?
Your string contains too much whitespace, not enough whitespace, or the wrong kind of whitespace.
Use strip
to remove whitespace from the beginning
and end of a string:
" Whitespace at beginning and end. ". strip
Add whitespace to one or both ends of a string with ljust, rjust
, and
center
:
s = "Some text." s. center(15) s. ljust(15) s. rjust(15)
Use the gsub
method with a
string or regular expression to make more complex changes, such as to
replace one type of whitespace with another.
#Normalize Ruby source code by replacing tabs with spaces rubyCode.gsub(" ", " ") #Transform Windows-style newlines to Unix-style newlines "Line one Line two ".gsub( ", " ") # => "Line one Line two " #Transform all runs of whitespace into a single space character " This string uses all sorts of whitespace.".gsub(/s+/," ") # => " This string uses all sorts of whitespace."
What counts as whitespace? Any of these five characters: space, tab (
), newline (
), linefeed (
), and form feed (f
). The regular expression /s/
matches any one character from that
set. The strip
method strips any
combination of those characters from the beginning or end of a
string.
In rare cases you may need to handle oddball “space” characters
like backspace ( or