CHAPTER 3

Scalars: Integers, Floating-Point Numbers, and Strings

A scalar is a single atomic value, and the most basic of Perl's data types. Scalars in turn are divided into four distinct kinds: integers, floating-point numbers, strings, and references, which are pointers to values stored elsewhere. Of these, the first three are what we might call "normal" scalar values.

In Chapter 2, we introduced integers, floating-point numbers, and strings. In this chapter, we will focus on a detailed look at how Perl handles scalar numbers and strings, and the most important of the built-in operators and functions that Perl provides to manipulate them. If Perl has a specialty, it is probably string manipulation, and accordingly the end of the chapter is dedicated to a roundup of string functions, including split, substr, printf, sprintf, pack, and unpack.

There are many different ways to convert numbers to strings, and strings to numbers, and in this chapter we cover many of them. But although integers, floating-point numbers, and strings are fundamentally different kinds of values, Perl does not require the programmer to make a strong distinction between them. Instead, whenever a value is used in a context that requires a different kind of value—for example, attempting to add a string to a number, or taking the square root of an integer—Perl carries out an automatic conversion for us. Before we examine each of these types of scalar value individually, it is worth looking at how Perl manages them and handles this internal conversion.

Automatic Conversion: One Scalar Fits All

When a scalar is first created, an integer for example, only that representation is stored. When the scalar is used with an operator that expects a different representation, say a string, Perl automatically performs the conversion behind the scenes, and then caches the result. The original value is not altered, so the scalar now holds two representations. If the same conversion is needed again, Perl can retrieve the previously cached conversion instead of doing it all over again.

Both scalar variables and literal values have this ability; a variable merely allows its value to be changed. To illustrate how Perl handles conversions, consider the assignment of a literal integer value to a scalar variable:

$number = 3141;

This assigns an integer value to the scalar stored by the variable. The scalar is now combined into a string by concatenating it, using the dot operator, with another string.

$text = $number.' is a thousand PIs';

This statement causes Perl to convert the scalar into a string, since it currently only knows the integer value and concatenation only works between strings, not numbers. The converted string representation is now cached inside the scalar alongside the original integer. So, if the scalar is requested as a string a second time, Perl does not need to redo the conversion, it just retrieves the string representation.

The same principle works for floating-point numbers too. Here we divide the same scalar by 1.414, which requires that it be converted to a floating-point number:

$new_number = $number/1.414;

Our scalar variable now has three different representations of its value stored internally. It will continue to supply one of the three in any expression in which it is used, until it is assigned a new value. At this point, one of the representations is updated and the other two are marked as invalid.

$number = "Number Six";

In this case, the previous integer and floating-point number values of $number are now invalid; if we ask for the integer value of this variable now, Perl will recalculate it from the string value (getting a default result of zero since this string starts with a letter and not a digit).

All this behind-the-scenes shuffling may seem convoluted, but it allows Perl to optimize the retrieval and processing of scalar values in our programs. One benefit of this is that evaluation of scalars happens faster after the first time. However, the real point is that it allows us to write simpler and more legible programs, because we do not have to worry about converting the type of a scalar in order to meet the expectations of the language. This does not come without risk as Perl will happily let us use a string variable in a numeric context even if we had not intended it, but the advantage of transparent conversion within the scalar variable is that it allows us to avoid creating additional variables simply to store alternate representations and, in general, simplifies our code.

Numbers

As we stated earlier, numbers fall into one of two categories, integer or floating point. In addition, both types of number can be represented as a numerical string, which is converted to the appropriate format when used.

As well as handling integers in base 10 (decimal), numbers may be expressed and displayed in base 2 (binary), base 8 (octal), and base 16 (hexadecimal) formats, all of which Perl handles transparently. Conversely, displaying numbers in a different base requires converting them into a string with the correct representation, so this is actually a special case of integer-to-string conversion. Internally, integers are stored as 32-bit binary values (unless Perl had been built with native 64-bit integer support), so number bases are only relevant for input or output.

In many cases, calculations will produce floating-point results, even if the numbers are integers: for example, division by another integer that does not divide evenly. Additionally, the range of floating-point numbers exceeds that of integers, so Perl sometimes returns a floating-point number if the result of a calculation exceeds the range in which Perl can store integers internally, as determined by the underlying platform.

Integers

Integers are one of the two types of numerical value that Perl supports. In Perl code, integers are usually written in decimal (base 10), but other number bases and formats are also possible, as shown in Table 3-1.

Table 3-1. Written Integer Formats

Number Format
123 Regular decimal integer
0b1101 Binary integer
0127 Octal integer
0xabcd Hexadecimal integer
12_345 Underscore annotated integer
0xca_fe_ba_be Underscore annotated hexadecimal integer

It is also possible to specify an integer value as a string. When used in an integer context, the string value is translated into an integer before it is used.

"123"   # regular decimal integer expressed as a string

The underscore notation is permitted in integers in order to allow them to be written more legibly. Ordinarily we would use commas for this.

10,023

However, in Perl, the comma is used as a list separator, so this would represent the value 10 followed by the value 23. In order to make up for this, Perl allows underscores to be used instead.

10_023

Underscores are not required to occur at regular intervals, nor are they restricted to decimal numbers. As the preceding hexadecimal example illustrates, they can also be used to separate out the individual bytes in a 32-bit hexadecimal integer. It is legal to put an underscore anywhere in an integer except at the start or, since Perl 5.8, the end. (It was found that the trailing underscore made the parser's job harder and was not actually very useful, so it was dropped.)

1_2_3  # ok
_123   # leading underscore makes '_123' an identifier - not a number!
123_   # trailing underscore ok prior to Perl 5.8, illegal from Perl 5.8 onwards

Integer Range and Big Integers

When Perl stores an integer value as an integer (as opposed to a numerical string), the maximum size of the integer value is limited by the maximum size of integers supported by the Perl interpreter, which in turn depends on the underlying platform. Prior to Perl 5.8, on a 32-bit architecture this means integers can range from

    0 to 4294967295 (unsigned)
    −2147483648 to 2147483647 (signed)

Perl can take advantage of 64-bit integers on platforms that support them—and from Perl 5.8 on 32-bit platforms too—but in either case only if the interpreter has been built with support for them.

If an integer calculation falls outside the range that integers can support, then Perl automatically converts the result into a floating-point value. An example of such a calculation is

print "2 to the power of 100:1 against and falling: ", 2**100;

This results in a number larger than integers can handle, so Perl produces a floating-point result and prints this message:

2 to the power of 100:1 against and falling: 1.26765060022823e+30

Because the accuracy of the floating-point number is limited by the number of significant digits it can hold, this result is actually only an approximation of the value 2**100. While it is perfectly possible to store a large integer in a floating-point number or a string (which has no limitations on the number of digits and so can store an integer of any length accurately), we cannot necessarily use that number in a numerical calculation without losing some of the precision. Perl must convert the string into an integer or, if that is not possible due to range limitations, a floating-point number in order to perform the calculation. Because floating-point numbers cannot always represent integers perfectly, this results in calculations that can be imprecise.

For applications in which handling extremely large integers is essential and the built-in integer support is not adequate, the Math::BigInt package is provided as part of the standard Perl library. This can be used explicitly, or via the use bigint or use bignum pragmas (the latter of which also pulls in Math::BigFloat) invoked automatically for any integer calculations.

use bignum; # or use bigint
print "2 to the power of 100:1 against and falling: ", 2**100;

This will print out the correct integer result of 1267650600228229401496703205376. However, unless the underlying platform actually allows native integer calculations on numbers of these sizes without a helping hand, it is also much slower.

Converting Integers into Floating-Point Numbers

In general we almost never need to explicitly convert integers to floating-point representations, since the conversion will be done for us automatically when needed. However, should you need to do so, the process is straightforward. No special function or operation is required; we can just multiply (or divide) it by 1.0.

$float = 1.0 * $integer;

This will store a floating-point value into $float, although if we print out we will still see an apparent integer value. This is because an integer is the most efficient way to display the result even if it is a floating-point value.

Converting Integers into Strings

Just as with floating-point numbers, we do not usually have to convert integers into strings since Perl will do it for us when the situation calls for it. However, at times you may genuinely want to impose a manual conversion. In this section, I'll discuss various issues surrounding this process.

For instance, one reason for manually converting a number into a string is to pad it with leading spaces or zeros (to align columns in a table, or print out a uniform date format). This is a task for printf and sprintf, generic string formatters inspired by and broadly compatible with (though not actually based on) the C function of the same names. They work using special tokens or "placeholders" to describe how various values, including numbers, should be rendered in character.

Formatting of integer values is carried out by the %..d and %0..d placeholders. For example, consider a numeric value that describes the desired width of the resulting text, i.e., %4d for a width of 4 characters. The 0, if present, tells printf to pad the number with leading zeros rather than spaces. Let's consider a few examples:

printf '%d/%d/%d', 2000, 7, 4;       # displays "2000/7/4"
printf '%d/%2d/%2d', 2000, 7, 4;     # displays "2000/ 7/ 4"
printf '%d/%02d/%02d', 2000, 7, 4;   # displays "2000/07/04"

Other characters can be added to the placeholder to handle other cases. For instance, if the number is negative, a minus sign is automatically prefixed, which will cause a column of mixed signs to be misaligned. However, if a space or + is added to the start of the placeholder definition, positive numbers will be padded, with a space or + respectively:

printf '% 2d', $number;   # pad with leading space if positive
printf '%+2d', $number;   # prefix a '+' sign if positive

The functions printf and sprintf are introduced in detail in the section "String Formatting with printf and sprintf" later in the chapter.

Converting Between Number Bases

As we briefly saw at the beginning of the chapter, Perl allows numbers to be expressed in octal, hexadecimal, binary, and decimal formats. To express a number in octal, prefix it with a leading zero. For example:

0123   # 123 octal (83 decimal)

Similarly, we can express numbers in hexadecimal using a prefix of 0x.

0x123   # 123 hexadecimal (291 decimal)

Finally, from Perl 5.6 onwards, we can express numbers in binary with a prefix of 0b.

0b1010011   # 1010011 binary (83 decimal)

Converting a number into a string that contains the binary, octal, or hexadecimal representation of that number can be achieved with Perl's sprintf function. As mentioned earlier, sprintf takes a format string containing a list of one or more placeholders, and a list of scalars to fill those placeholders. The type of placeholder defines the conversion that each scalar must undergo to be converted into a string. To convert into a different base, we just use an appropriate placeholder, %b, %o, %x, or %X.

$bintext = sprintf '%b', $number;   # convert to binary string (5.6.0+ only)
$octtext = sprintf '%o', $number;   # convert into octal string
$hextext = sprintf '%x', $number;   # convert into lowercase hexadecimal
$hextext = sprintf '%X', $number;   # convert into uppercase hexadecimal

The textual conversions are not created with the appropriate number base prefix (0, 0x, or 0b) in place, and so do not produce strings that convert back using the same base as they were created with. In order to fix this problem, we have to add the base prefix ourselves.

$bintext = sprintf '0b%b', 83;    # produces '0b1010011'
$octtext = sprintf '0%o', 83;     # produces '0123'
$hextext = sprintf '0x%bx', 83;   # produces '0x5'

The %b placeholder is only available from Perl version 5.6.0 onwards. Versions of Perl prior to this do not have a simple way to generate binary numbers and have to resort to somewhat unwieldy expressions using the pack and unpack functions. This is covered in more detail in the "Pack and Unpack" section later.

$bintext = unpack("B32", pack("N", $number));

This handles 32-bit values. If we know that the number is small enough to fit into a short (i.e., 16-bit) integer, we can get away with fewer bits.

$smallbintext = unpack("B16", pack("n", $number));

Unfortunately for small values, this is still likely to leave a lot of leading zeros, since unpack has no idea what the most significant bit actually is, and so it just ploughs through all 16 or 32 bits regardless. The number 3 would be converted into

'0000000000000011'   # '3' as a 16-bit binary string

We can remove those leading zeros using the string functions substring and index.

#hunt for and return string from the first '1' onwards
$smallbintext = substring($smallbintext, index($smallbintext, '1'));

A substitution works too.

$smallbintext =˜ s/^0+//;

Though this works, it is certainly neither as elegant (nor as fast) as using sprintf; upgrading an older Perl may be a better idea than using this work-around.

Floating-Point Numbers

Perl allows real numbers to be written in one of two forms, fixed point and scientific. In the fixed-point representation, the decimal point is fixed in place, with a constant and unchanging number of available fractional digits. Prices are a common example of real-life fixed-point numbers. In the scientific representation, the number value is called the mantissa, and it is combined with an exponent representing a power of 10 that the mantissa is multiplied by. The exponent allows the decimal point to be shifted, and gives the scientific representation the ability to express both very small and very large numbers.

Either representation can be used to express a floating-point value within Perl, but depending on the application it is usually more convenient to use one over the other. Consider these examples:

123.45      # fixed point
-1.2345e2   # scientific, lowercase, negative
+1.2345E2   # scientific, uppercase, explicitly positive

Likewise, fractions can be expressed either in fixed-point notation or as a negative exponent.

0.000034    # fixed point
-3.4e-4     # scientific, lowercase, negative
+3.4E-4     # scientific, uppercase, explicitly positive

Floating-point numbers can be expressed over a very large range.

1e100       # a 'googol' or 1 × 10(100)
3.141       # 3.141 × 10(0)
1.6e-22     # 0.000000000000000000000016

Tip A googol is actually the correct mathematical term for this number (1e100), believe it or not. Likewise, a googolplex is 10 to the power of a googol, or 10(10)(100), which is hard for both humans and programming languages to handle.


The distinction between "floating-point numbers" and "scientific representation" is these days one of implementation versus expression. On older hardware there was a speed advantage to using fixed-point numbers—indeed, integers could be pressed into duty as fixed-point reals when floating-point support was not available. So it made a difference whether or not a number was expressed in fixed-point or scientific form.

These days floating-point calculations are intrinsic to the CPU and there is no significant speed advantage to making fixed-point calculations with integers. As a result, even fixed-point numbers are now internally stored as (binary) floating-point values. There is still an advantage to writing in fixed-point notation for human consumption. We would rarely choose to write 15,000 as 1.5e4 or 15e3, even if they are more natural floating-point representations.

Floating-point numbers are used to store what in mathematical terms are called real numbers, the set of all possible values. The accuracy of floating-point numbers is limited, so they cannot represent all real numbers. However, they are capable of a wider range of values than integers, both in terms of accuracy and in terms of scale. The preceding example of 1e100 is mathematically an integer, but it is one that Perl's internal integer representation is unable to handle, since one hundred consecutive zeros is considerably beyond the maximum value of 4,294,967,295 that integers can manage on most platforms. For a floating-point number however it is trivial, since an exponent of 100 coupled to a mantissa of 1 represents the value perfectly.

The standard C library of the underlying platform on which Perl is running determines the range of floating-point numbers. On most platforms floating-point numbers are handled and stored as doubles, double accuracy 8-byte (64-bit) values, though the actual calculations performed by the hardware may use more bits. Of these 64 bits, 11 are reserved for the exponent, which can range from 2 to the power of −1024 to +1024, which equates to a range of around 10 to the power of −308 to +308. The remaining 53 are assigned to the mantissa, so floating-point numbers can represent values up to 53 binary places long. That equates to 15 or 16 decimal places, depending on the exact value.

However, just because a value is within the range of floating numbers does not mean that it can represent them accurately. Unforeseen complications can arise when using floating-point numbers to perform calculations. While Perl understands and displays floating-point numbers in decimal format, it stores them internally in binary format. Fractional numbers expressed in one base cannot always be accurately expressed in another, even if their representation seems very simple. This can lead to slight differences between the answer that Perl calculates and the one we might expect.

To give an example, consider the floating-point number 0.9. This is easy to express in decimal, but in binary this works out to

     0.11100110110110110110110110110110110110110 . . .

that is, a recurring number. But floating-point numbers may only hold a finite number of binary digits; we cannot accurately represent this number in the mantissa alone. As it happens, 0.9 can be accurately represented as 9e-1: a mantissa of 9 with an exponent of -1, but this is not true for every floating-point number. Consequently, calculations involving floating-point values, and especially comparisons to integers and other floating-point numbers, do not always behave as we expect.

Converting Floating Points into Integers

The quickest and simplest way to convert floating-point numbers into integers is to use the int function. This strips off the fractional part of the floating-point number and returns the integer part.

$int = int($float);

However, int is not very intelligent about how it calculates the integer part. First, it truncates the fractional part of a floating-point number and so only rounds down, which may not be what we want. Second, it does not take into account the problems of precision, which can affect the result of floating-point calculations. For example, the following calculation produces different results if the answer is returned as an integer, even though the resulting calculation ought to result in a round number:

$number = 4.05/0.05;
print "$number ";   # returns 81, correct
print int($number);   # returns 80, incorrect!

Similarly, a comparison will tell us that $number is not really equal to 81.

$number = 4.15/0.05;
# if $number is not equal to 81 then execute the print statement
# in the block
if ($number != 81) {
   print "$number is not equal to 81 ";
}

The reason for this is that $number does not actually have the value 81 but a floating-point value that is very slightly less than 81, due to fact that the calculation is performed with binary floating-point numbers. When we display it, the conversion to string format handles the slight discrepancy for us and we see the result we expect.

To round the preceding to the nearest integer rather than the next highest or next lowest, we can add 0.5 to our value and then round it down.

print int($number+0.5);   # returns 81, correct

C programmers may be familiar with the floor and ceil functions of the standard C library that have a similar purpose (and the same rounding problems). In fact, using the POSIX module, we can gain direct access to these and C library functions defined by IEEE 1003.1, if we like. The difference is that the values returned from these functions are floating-point values, not integers. That is, though they appear the same to us, the internal representation is different.

Converting Floating Points into Strings

Perl automatically converts floating-point numbers into strings when they are used in a string context, for example:

print "The answer is $floatnum";

If the number can be represented as an integer, Perl converts it before printing.

$floatnum = 4.3e12;   # The answer is 4300000000000

Alternatively, if the number is a fraction that can be expressed as a fixed decimal (that is, purely in terms of a mantissa without an exponent), Perl converts it into that format.

$floatnum = 4.3e-3;   # The answer is 0.0043

Otherwise it is converted into the standard mantissa+exponent form.

$floatnum = 4.3e99;   # The answer is 4.3e99

Sometimes we might want to alter the format of the generated text, to force consistency across a range of values or to present a floating-point value in a different format. The sprintf and printf functions can do this for us, and provide several placeholder formats designed for floating-point output.

printf '%e', $floatnum;   #force conversion to fixed decimal format
printf '%f', $floatnum;   #force conversion to mantissa/exponent format
printf '%g', $floatnum;   #use fixed (as %e), if accurately possible, otherwise as %f

Perl's default conversion of floating-point numbers is therefore equivalent to

$floatstr = sprintf '%g', $floatnum;

NOTE To be strictly accurate, the default conversion is %.ng, where n is computed at the time Perl is built to be the highest accurate precision available. In most cases this will be the same as %g. See also the discussion of $# at the end of this section.


A field width can be inserted into the format string to indicate the desired width of the resulting number text. Additionally, a decimal point and second number can be used to indicate the width of the fractional part. We can use this to force a consistent display of all our numbers.

    printf '%6.3f', 3.14159;   # display a minimum width of 6 with 3
                               # decimal places; produces ' 3.142'

The width, 6, in the preceding example, includes the decimal point and the leading minus sign if the number happens to be negative. This is fine if we only expect our numbers to range from 99.999 to −9.999, but printf will exceed this width if the whole part of the number exceeds the width remaining (two characters) after the decimal point and three fractional digits have taken their share. Allowing a sufficient width for all possible values is therefore important if sprintf and printf are to work as we want them to.

Just as with integers, we can prefix the format with a space or + to have positive numbers format to the same width as negative ones.

printf '% 7.3f', 3,14159;    # pad with leading space if positive,
                             # produces '  3.142'
printf '%+7.3f', 3.14159;    # prefix a '+' sign if positive,
                             # produces ' +3.142'
printf '%07.3f', 3.14159;    # prefix with leading zeros,
                             # produces '003.142'
printf '% 07.3f', 3.14159;   # pad with a leading space, then leading zeros,
                             # produces ' 03.142'
printf '%+07.3f', 3.14159;   # pad with a leading +, then leading zeros,
                             # produces '+03.142'
printf '%.3f', 3.14159;      # leading digits with no padding,
                             # produces '3.142'

Interestingly, Perl's default conversion of floating-point numbers into strings can be overridden in a couple of ways. First, the special variable $# can be used to set the internal format for conversion (but note that use of $# has always been technically deprecated).

$#="%.3f"
print "1.6666666" # produces "1.667"

Second, for those who require exact and precise control over the conversion process, Perl can be built to use a different conversion function by overriding the d_Gconvert configuration option. See Chapter 1 and perldoc Configure for more information.

The use integer Pragma

We discussed earlier how Perl automatically converts between integers, strings, and floating-point numbers when required. However, we might specifically want an integer, either because we want a result in round numbers or simply for speed. One way to restrict the result of a calculation to integers is to use the int function, as shown in the earlier section, "Converting Floating Points into Integers."

This still causes Perl to do some calculations with floating-point numbers, since it does not always know to do different. If the underlying hardware does not support floating-point operations (rare, but still possible in embedded systems), this can result in unnecessarily slow calculations when much faster integer calculations could be used. To remedy this situation and allow Perl to intelligently use integers where possible, we can encourage integer calculations with the use integer pragma.

use integer;
$integer_result = $nominator / $divisor;

While use integer is in effect, calculations that would normally produce floating-point results but are capable of working with integers will operate interpreting their operands as integers.

use integer;
$PI = 3.1415926;
print $PI;         # prints '3.1415926'
print $PI + 5;     # prints '8'

We can disable integer-only arithmetic by stating no integer, which cancels out the effect of a previous use integer. This allows us to write sections (or blocks) of code that are integer only, but leave the rest of the program using floating-point numbers as usual, or vice versa. For example:

sub integersub {
   use integer;
   #integer-only code...
   {
       no integer;
       #floating point allowed in this block
   }
   #more integer-only code...
}

Using use integer can have some unexpected side effects. While enabled, Perl passes integer calculations to the underlying system (which means the standard C library for the platform on which Perl was built) rather than doing them itself. That might not always produce exactly the same result as Perl would, for example:

print −13 % 7;   # produces '1'

use integer;
print −13 % 7;   # produces '−6'

The reason for this behavior is that Perl and the standard C library have slightly different perspectives on how the modulus of a negative number is calculated. The fact that, by its nature, a modulus calculation cannot produce a floating-point result does not alter the fact that use integer affects its operation.

Even with use integer enabled, Perl will still produce a floating-point number if an integer result makes no sense or if the result would otherwise be out of range. For example:

use integer;
print sqrt(2);   # produces 1.4142135623731
print 2 ** 100;   # produces 1.26765060022823e+030

use integer has one final effect that is not immediately obvious: it disables Perl's automatic interpretation of bitwise operations as unsigned, so results that set the highest bit (that is, the 32nd bit in a 32-bit architecture) will be interpreted by Perl as signed rather than unsigned values.

print ˜0, ' ',-1 << 0;   # produces '4294967295 4294967295'

use integer;
print ˜0, ' ',−1 << 0;   # produces '−1 −1'

This can be useful behavior if we want it, but it is a potential trap for the unwary.

Mathematical Functions

Perl provides a number of built-in mathematical functions for managing numbers. Here is a quick summary of some of the more commonly encountered.

The abs function returns the absolute (unsigned) value of a number.

print abs(-6.3);   # absolute value, produces 6.3

Perl provides three functions for computing powers and logarithms, in addition to the ** exponentiation operator. These functions include sqrt, exp, and log.

print sqrt(6.3);   # square root of 6.3, produces 2.50998007960223
print exp(6.3);    # raise 'e' to the power of 6.3, produces 544.571910125929
print log(6.3);    # natural (base 'e') logarithm, produces 1.84054963339749

Perl's support for logarithms only extends to base(e), also called natural, logarithms. This is not a problem, though, since to work in other bases we just divide the natural log of the number by the natural log of the base, that is

$n=2;
print log($n) / log(10);   # calculate and print log(10)2

For the specific case of base 10 logarithms, the standard C library defines a base 10 logarithm function that we can use via the POSIX module as log10.

use POSIX qw(log10);
print log10($n);   # calculate and print log(10)2

Perl provides three built-in trigonometric functions, sin, cos, and atan2, for the sine, cosine, and arctangent of an angle, respectively. Perl does not provide built-in inverse functions for these three, nor does it provide the standard tan function, because these can all be worked out easily (ignoring issues of ranges, domains, and result quadrants).

atan2($n, sqrt(1 - $n ** 2))   # asin (inverse sine)
atan2(sqrt(1 - $n ** 2), $n)   # acos (inverse cosine)
sin($n) / cos($n)              # tan

We can easily define subroutines to provide these calculations, but to save us the trouble of writing our own trigonometric functions Perl provides a full set of basic trigonometric functions in the Math::Trig module, as well as utility subroutines for converting between degrees and radians and between radial and Cartesian coordinates. See the Math::Trig manual page for more information.

Strings

Perl provides comprehensive support for strings, including interpolation, regular expression processing, and even a selection of ways to specify them. Arguably, string processing is Perl's single biggest strength, and accordingly much of this book is concerned with it in one way or another. In this section, we will look at the basic features of strings and built-in functions Perl provides to handle them.

Quotes and Quoting

Literal strings—that is, strings typed explicitly into source code—can be written in a variety of different quoting styles, each of which treats the text of the string in a different way. These styles are listed in Table 3-2.

Table 3-2. Quote Types and Operators

Quote Type Operator Result
Single quotes q Literal string
Double quotes qq Interpolated string
N/A qr Regular expressions string
Backticks (``) qx Executes external program
N/A qw List of words

As the table shows, Perl provides two syntaxes for most kinds of strings. The ordinary punctuated kind uses quotes, but we can also use a quoting operator to perform the same function.

'Literal Text'          # is equivalent to q(Literal Text)
"$interpolated @text"   # is equivalent to qq($interpolated @text)
`./external -command`   # is equivalent to qx(./external -command)

One of the advantages of the quoting operators is that they allow us to place quote marks inside a string that would otherwise cause syntax difficulties. Accordingly, the delimiters of the quoting operators can be almost anything, so we have the greatest chance of being able to pick a pair of characters that do not occur in the string.

$text = q/a string with 'quotes' is 'ok' inside q/;

Perl accepts both paired and single delimiters. If a delimiter has a logical opposite, such as ( and ), < and >, [ and ], or { and }, the opposing delimiter is used for the other end; otherwise the same delimiter is expected again.

$text = qq{ "$interpolated" ($text) 'and quotes too' };

Other than their more flexible syntax, the quoting operators have the same results as their quote counterparts. Two quote operators, qw and qr, do not have quote equivalents, but these have more specialized purposes. More to the point, they do not produce strings as their output.

The single quoted string treats all of its text as literal characters; no processing, interpolation, or escaping is performed. The quoting operator for literal text is q, so the following are equivalent:

'This is literal text'

q/This is literal text/

Or with any other delimiter we choose, as noted previously.

The double quoted string interpolates its text, expanding scalar and array variables, and backslash-prefixed special characters like . The quoting operator for interpolated text is qq, so the following are equivalent:

"There are $count matches in a $thing. ";
qq/There are $count matches in a $thing. /

Interpolation is a more advanced subject than it at first appears, and the first part of Chapter 11 is dedicated to exploring it in detail. Fortunately, if we do not have any variables in a double-quoted string, Perl will realize this at compile time and compile the string as a constant (as if it had been single quoted), so there is no substantive performance cost to using double quotes even for strings that do not need to be interpolated.

The qr operator, introduced in Perl 5.005, prepares regular expressions for use ahead of time. It accepts a regular expression pattern, and produces a ready-to-go regular expression that can be used anywhere a regular expression operator can.

# directly
$text =˜ /pattern/;

# via 'qr':
$re = qr/pattern/;
$text =˜ $re;

The qr operator also interpolates its argument in exactly the same way that double quoted strings and the qq operator do. Do not despair if this seems rather abstract at the moment; we will learn more about this in Chapter 11, where qr is covered in more detail.

Quoting a string with backticks, `, causes Perl to treat the enclosed text as a command to be run externally. The output of the command (if any) is captured by Perl and returned to us. For example:

#!/usr/bin/perl
# external.pl
use strict;
use warnings;
my $dir = "/home";
my $files = `ls −1 $dir`;   # or something like `dir c:` for DOS/Windows
print $files;

When run, this program produces the following output:

> perl external.pl


beeblebz
denta
prefectf
marvin

Interpolation is carried out on the string before it is executed, and then passed to a temporary shell if any shell-significant characters like spaces or quotes are present in the resulting string. The equivalent quoting operator for backticks is qx.

my $files = qx(ls -1 $dir);

There are serious security issues regarding the use of backticks, however. This is partly because they rely on environment variables like $PATH, which is represented in Perl as $ENV{PATH}, that we may not be able to trust. Additionally, the temporary shell can interpret characters in a potentially damaging way. For this reason, backticks and the qx operator are often considered deprecated in anything more complex than simple private-use scripts. See Chapter 21 for more on the issues as well as ways to avoid them.

The qw operator takes a whitespace-separated string and turns it into a list of values. In this respect it is unlike the other quoting operators, all of which return string values. Its purpose is to allow us to specify long lists of words without the need for multiple quotes and commas that defining a list of strings normally requires.

# Using standard single quotes
@array = ('a', 'lot', 'of', 'quotes', 'is', 'not', 'very', 'legible'),

# much more legible using 'qw'
@array = qw(a lot of quotes is not very legible);

Both these statements produce the same list of single-word string values as their result, but the second is by far the more legible. The drawback to qw is that it will not interpolate variables or handle quotes, so we cannot include spaces within words. In its favor, though, qw also accepts tabs and newlines, so we can also say

@array = qw(
a lot
of quotes
is not
very legible
);

Note that with qw we need to avoid commas, which can be a hard habit to break. If we accidentally use commas, Perl will warn against it, since commas are just another character to qw and so comma-separated words would result in a single string, words, commas, and all.

@oops = qw(a, comma, separated, list, of, words);

If we try to do this, Perl will warn us (assuming we have warnings enabled) with


Possible attempt to separate words with commas at ...

If we actually want to use commas, we can, but in order to silence Perl, we will need to turn off warnings temporarily with no warnings and turn them back on again afterward (from Perl 5.6 onwards we can also turn off specific warnings, so we could do that too).

"Here" Documents

As we have just learned, the usual way to define literal text in source code is with quotes, or the equivalent quoting operators q and qq. Here documents are an additional and alternative way that is particularly well suited to multiple line blocks of text like document templates. Here documents are interpolated, and so make a convenient alternative to both concatenating multiple lines together and Perl formats, which also provide a document template processing feature, but in an entirely different way.

To create a here document, we use a << followed immediately by an end token, a bareword, or quoted string that is used to mark the end of the block. The block itself starts from the next line and absorbs all text, including the newlines, until Perl sees the end token, which must appear alone at the start of a new line. Normal Perl syntax parsing is disabled while the document is defined—although interpolation of variables still takes place—and parsing only continues after the end token is located.

$string = <<_END_OF_TEXT_;
Some text
Split onto multiple lines
Is clearly defined
_END_OF_TEXT_

This is equivalent to, but easier on the eye than

$string = "Some text ".
"Split onto multiple lines ".
"Is clearly defined ";

The << and token define where the document is used, and tell Perl that it is about to start on the next line. There must be no space between the << and the token, otherwise Perl will complain. The token may be an unquoted bareword, like the preceding example, or a quoted string, in which case it can contain spaces, as in the following example:

# the end token may contain spaces if it is quoted
print <<"print to here";
This is
some text
print to here

If used, the type of quote also determines whether or not the body of the document is inter-polated or not. Double quotes or no quotes causes interpolation. If single quotes are used, no interpolation takes place.

# this does not interpolate
print <<'_END_OF_TEXT_'
This %is @not %interpolated
_END_OF_TEXT_

Note that in all examples the here document is used within a statement, which is terminated by a semicolon, as normal. It is not true to say that the <<TOKEN absorbs all text following it; it is a perfectly ordinary string value from the point of view of the statement it appears in. Only from the next line does Perl start to absorb text into the document.

# a foreach loop on one line
foreach (split " ", <<LINES) { print "Got $_ "; }
Line 1
Line 2
Line 3
LINES

Alternatively, we can define a here document within a statement if the statement spans more than one line; the rest of the lines fall after the end token:

#!/usr/bin/perl
# heredoc.pl
use warnings;
use strict;

# a foreach loop split across the 'here' document
foreach (split " ", <<LINES) {
Line 1
Line 2
Line 3
LINES
   print "Got: $_ ";
}

Since here documents are interpolated (unless we use single quotes to define the end token at the top, as noted earlier) they make a very convenient way to create templates for documents. Here is an example being used to generate an e-mail message with a standard set of headers:

#!/usr/bin/perl
# formate.pl
use warnings;
use strict;

print format_email('[email protected]', '[email protected]', "Wishing you were here",
                   "...instead of me!", "Regards, Me");

# subroutines will be explained fully in Chapter 7
sub format_email {
    my ($me, $to_addr, $subject, $body_of_message, $signature) = @_;

    return <<_EMAIL_;
To: $to_addr
From: $me;
Subject: $subject

$body_of_message
--
$signature
_EMAIL_
}

The choice of end token is arbitrary; it can be anything, including a Perl keyword. For clarity's sake, however, use an appropriately named token, preferably in capitals and possibly with surrounding underscores. The end token must also appear at the start of the line—if it is indented, Perl will not recognize it. Likewise, any indentation within the here document will remain in the document. This can present a stylistic problem since it breaks with the indentation of code in things like subroutines. For instance:

sub return_a_here_document {
    return <<DOCUMENT;
This document definition cannot be indented
if we want to avoid indenting
the resulting document too
DOCUMENT
}

If we do not mind indenting the document, then we can indent the end token by defining it with the indent to start with.

sub return_a_here_document {
    return <<'    DOCUMENT';
    This document is indented, but the
    end token is also indented, so it parses OK
    DOCUMENT
}

Although it uses the same symbol, the here document << has nothing whatsoever to do with the shift right operator. Rather, it is a unary operator with an unusual operand.

For large blocks of text, it is also worth considering the DATA filehandle, which reads data placed after the __END__ or __DATA__ tokens in source files, discussed in Chapter 12. This approach is less convenient in that it is used in place like a here document, but it does avoid the indentation problem mentioned previously and can often improve legibility.

Bareword Strings and Version Number Strings

The use strict pragma provides us with three separate additional checks on our Perl code. Two of them, to declare variables properly and forbid symbolic references, we already mentioned in Chapter 2, and will cover in detail in Chapter 8. The third restriction, strict subs, does not allow us to write a string without quotes, known as a bareword string, because of the potential for confusion with subroutines. Without strict subs we can in fact quite legally say

# 'use strict subs' won't allow us to do this
$word = unquoted;

instead of the more correct

$word = "unquoted";

The problem with the first example, apart from the fact we cannot include spaces in a bareword string, is that if at a future point we write a subroutine called unquoted, then our string assignment suddenly becomes a subroutine call. This is a fairly obvious example where we really ought to be using quotes, but it is easy (and perfectly legal) to use bareword strings in things like hash keys, because in these cases a string is required and expected.

$value = $hash{animal}{small}{furry}{cat};

The bareword is allowed because it is the most common case. What if we actually do want to call a subroutine called animal, however? We need to tell Perl by adding some parentheses, turning the bareword into an explicit subroutine call. For example:

$value = $hash{animal()}{small()}{furry()}{cat()};

There are a few other places where bareword strings are allowed. One is the qw operator, which we covered earlier.

qw(startword bareword anotherbareword endword);

A second place where barewords are allowed is on the left-hand side of the relationship (or digraph) operator. This is just a clever form of comma that knows it is being used to define paired keys and values, typically hashes. Since a hash key can only be a string, the left-hand side of a => is assumed to be a string if written as a bareword.

%hash = ( key => "value still needs quotes" );

(We can force a subroutine call instead by adding parentheses to key as before.)

Finally, Perl supports a very special kind of string called a version string, which in fact must be unquoted to be treated as such. The format of a version string must resemble a version number, and can only be made of digits and points.

$VERSION = 1.2.34;

Perl will see this as a version number, because it contains more than one decimal point and has no quotes. If the version number has no or only one decimal point, then we can currently use a v prefix to ensure that Perl does not interpret it as a regular integer or floating-point number.

$float   = 5.6;    # oops, that is a floating-point number
$VERSION = v5.6;   # force a version string (but see later)

Places where a version number is expected, such as the require and use keywords, do not need the prefix. Take the following line that states that Perl must be at least version 5.6 for the program to run:

require 5.6;      # always a version string

Version strings are likely to be replaced with version objects in Perl 5.10. While these will be semantically similar, the v prefix syntax will be retired and there may no longer be a direct equivalence between a version object and the string representation described previously. Instead, version objects will be created when the syntax expects one.

The special variable $^V ($PERL_VERSION) returns a version string. It contrasts with the older $], which returns a floating-point number for compatibility with older Perl versions (e.g., 5.003). It will return a version object in the future, though the usage will be the same.

The purpose of $^V is to allow version numbers to be easily compared without straying into the dangerous world of floating-point comparisons. Take this example that tests the version of Perl itself, and aborts with an error message if it is too old:

# 'require 5.6.0' is another way to do this:
die "Your Perl is too old! Get a new one! " if $^V lt 5.6.0;

The characters in a version string are constructed from the digits, so a 1 becomes the character Control-A, or ASCII 1. 5.6.0 is therefore equivalent to the interpolated string "560" or the expression chr(5).chr(6).chr(0). This is not a printable string, but it is still a string, so we must be sure to use the string comparison operators like lt (less than) and ge (greater or equal to) rather than their numeric equivalents < and >=.

If we use a version string as the leading argument to the => operator, it is evaluated as a regular string and not a version number string, at least from Perl 5.8 onwards (prior to this it would be evaluated as a version number string). So this expression produces a hash with a key of U for Perl 5.8 and chr(64) for Perl 5.6:

my %hash=( v64 => "version 64");

Sometimes it is handy to know whether or not a scalar value conforms to a version number string. To find out, we can make use of the isvnumber routine from the Scalar::Util module.

use Scalar::Util qw(isvnumber);
...
print "Is a version number" if isvnumber($value);

As version strings will ultimately be replaced with version objects, code that relies on the string representation is risky and likely to break in future versions of Perl. Comparisons and isvumber will continue to work as they do now, but with the additional and more intuitive ability to make numeric rather than string comparisons.

Converting Strings into Numbers

As we have seen, Perl automatically converts strings into an integer or floating-point form when we perform a numeric operation on a string value. This provides us with a simple way to convert a string into a number when we actually want a number, for example, in a print statement, which is happy with any kind of scalar. All we have to do is perform a numeric operation on the string that doesn't change its value, multiplying or dividing by 1 for example. Adding zero is probably the simplest (and in fact is the traditional idiom in Perl).

# define a numeric string
$scalar = '123.4e5';

# evaluate it in original string context
print $scalar;   # produces '123.4e5'

# evaluate it in numeric context
print 0 + $scalar;   # produces '12340000'

If the string does not look like a number, then Perl will do the best job it can, while warning us that the string is not completely convertible and that some information is being lost with an "Argument isn't numeric at" message.

print "123.4e5abc" + 0   #   produces '12340000' and a warning

If we actually want to know in advance whether or not a string can be converted into some kind of numeric value, then there are a couple of ways we can do it. If we only need to know whether the string starts numerically, we could extract the numeric part with a regular expression—what pattern we use depends on what kind of numbers we are expecting to parse. For example, this checks for a string starting with an optional sign followed by at least one digit, and extracts it if present:

my $number = $string =˜ /^([+-]?d+)/;

This only handles integers, of course. The more diverse the range of number representations we want to match, the more complicated this becomes. If we want to determine whether the string is fully convertible (in the sense that no information is lost and no warning would be generated if we tried), then we can instead use the looks_like_number routine provided by Perl in the Scalar::Util module:

#!/usr/bin/perl
# lookslikenumber.pl
use Scalar::Util 'looks_like_number';

foreach (@ARGV) {
    print "$ARGV[0] ";
    print looks_like_number($ARGV[0])
        ? "looks" : "does not look";
    print " like a number ";
}

looks_like_number will return 1 if the string can be completely converted to a numeric value, and 0 otherwise. It works by asking Perl's underlying conversion functionality what it thinks of the string, which is much simpler (and to the point) than constructing a regular expression to attempt to match all possible valid numeric strings.

Converting Strings into Lists and Hashes

Transforming a string into a list requires dividing it up into pieces. For this purpose, Perl provides the split function, which takes up to three arguments, a pattern to match on, the string that is to be carved up, and the maximum number of splits to perform. With only two arguments, the string is split as many times as the pattern matches. For example, this splits a comma-separated sequence of values into a list of those values:

#!/usr/bin/perl
# splitup.pl
use strict;
use warnings;

my $csv = "one, two, three, four, five, six";
my @list = split ', ' , $csv;
print "@list";

Although it is commonly used to split up strings by simple delimiters like commas, the first argument to split is in fact a regular expression, and is able to use the regular expression syntax of arbitrary delimiters.

@list = split /, /, $csv;

This also means that we can be more creative about how we define the delimiter. Without delving too deeply into regular expression syntax, to divide up a string with commas and arbitrary quantities of whitespace we can replace the comma with a pattern that absorbs whitespace—spaces, tabs, or newlines—on either side.

@list = split /s*,s*/, $csv;

This does not deal with any leading or trailing whitespace on the first and last items (and in particular any trailing newline), but it is effective nonetheless. However, if we want to split on a character that is significant in regular expressions, we have to escape it. The // style syntax helps remind us of this, but it is easy to forget that a pipe symbol, (|), will not split up a pipe-separated string.

$pipesv = "one | two | three | four | five | six";
print split('|', $pipesv);   # prints one | two | three | four | five | six

This will actually return the string as a list of single characters, including the pipes, because | defines alternatives in regular expressions. There is nothing on either side of the pipe, so we are actually asking to match on nothing or nothing, both of which are zero-width patterns (they successfully match no characters). split treats zero-width matches (a pattern that can legally match nothing at all) as a special case, splitting out a single character and moving on each time it occurs. As a result, we get a stream of single characters. This is better than an infinite loop, which is what would occur if Perl didn't treat zero-width matches specially, but it's not what we intended either. Here is how we should have done it:

print split('|', $pipesv);   # prints one two three four five six

Having warned of the dangers, there are good uses for alternation too. Consider this split statement, which parses hash definitions in a string into a real hash:

$hashdef = "Mouse=>Jerry, Cat=>Tom, Dog=>Spike";
%hash = split /, |=>/, $hashdef;

Because it uses a regular expression, split is capable of lots of other interesting tricks, including returning the delimiter if we use parentheses. If we do not actually want to include the delimiters in the returned list, we need to suppress it with the extended (?:...) pattern.

# return (part of) delimiters
@list = split /s*(, |=>)s*/, $hashdef;
# @list contains 'Mouse', '=>', 'Jerry', ',' , 'Cat', ...
# suppress return of delimiters, handle whitespace, assign resulting
# list to hash
%hash = split /s*(?:, |=>)s*/, $hashdef;

Tip Both examples use more complex forms of regular expression such as s to match whitespace, which we have not fully covered yet; see Chapter 11 for more details on how they work. The last example also illustrates how to define the contents of a hash variable from a list of values, which we will cover in detail when we come to hashes later.


If split is passed a third numeric parameter, then it only splits that number of times, preserving any remaining text as the last returned value.

my $configline = "equation=y = x ** 2 + c";
# split on first = only
my ($key, $value) = split (/=/, $configline, 2);
print "$key is '$value'";   # produces "equation is 'y = x ** 2 + c'"

split also has a special one and no-argument mode. With only one argument, it splits the default argument $_, which makes it useful in loops like the while loop that read lines from standard input (or files supplied on the command line), like the short program that follows:

#!/usr/bin/perl
# readconfig.pl
use warnings;
use strict;

my %config;

# read lines from files specified on command line or (if none)
# standard input
while (<>) {
   my ($key, $value) = split /=/;   # split on $_
   $config{$key} = $value if $key and $value;
}

print "Configured: ", join(', ', keys %config), " ";

We can invoke this program with the following command:

> readconfig.pl configfile

Let's consider the following configfile:

first = one
second = two

Executing readconfig.pl using the supplied configfile, the following output is returned:


Configured: first, second

If no arguments are supplied at all, split splits the default argument on whitespace characters, after skipping any leading whitespace. The following short program counts words in the specified files or what is passed on standard input using the <> readline operator covered in Chapter 12:

#!/usr/bin/perl
# split.pl
use warnings;
use strict;

my @words;

# read lines from supplied filenames or (if none)
# standard input
while (<>) {
   # read each line into $_ in turn
   push @words, split; # split $_ into words and store them
}

print "Found ", scalar(@words), " words in input ";

Functions for Manipulating Strings

Perl provides many built-in functions for manipulating strings, including split, which we discussed previously. Here is a short list and description of some of the most important of them:

Printing: print

Although perhaps the most obvious entry in the list, the ubiquitous print statement is not technically a string function, since it takes a list of arbitrary scalar values and sends them to a specified filehandle or otherwise standard output (technically, the currently selected filehandle). The general form is one of the following:

print @list;        # print to standard output
print TOFILE @list; # print to filehandle TOFILE

The details of using filehandles with print are covered in Chapter 12, but fortunately we can print to standard output without needing to know anything about them. Indeed, we have already seen print in many of the examples so far.

The output of print is affected by several of Perl's special variables, listed in Table 3-3.

Table 3-3. Special Variables That Control Output

Variable Action
$, The output field separator determines what print places between values, by default '' (nothing). Set to ',' to print comma-separated values.
$ The output record separator determines what print places at the end of its output after the last value, by default '' (nothing). Set to ' ' to print out automatic linefeeds.
$# The output format for all printed numbers (integer and floating point), in terms of a sprintf style placeholder. The default value is something similar to %.6g. To print everything to two fixed decimal places (handy for currency, for example) change to %.2f, but note that use of $# is now deprecated in Perl.
$| The autoflush flag determines if line or block buffering should be used. If 0, it is block, if 1, it is line.

Although not directly related to print, it is worth noting that interpolated arrays and hashes use the special variable $" as a separator, rather than $, (a space by default).

Beware of leaving off the parentheses of print if the first argument (after the filehandle, if present) is enclosed in parentheses, since this will cause print to use the first argument as an argument list, ignoring the rest of the statement.

Line Terminator Removal: chop and chomp

The chop and chomp functions both remove the last character from a string. This apparently esoteric feature is actually very handy for removing line terminators. chop is not selective; it will chop off the last character irrespective of what it is or whether it looks like a line terminator or not, returning it to us in case we want to use it for something:

chop $input_string;

The string passed to chop must be an assignable one, such as a scalar variable (or more bizarrely, the return value of substr, shown in a bit, used on a scalar variable), since chop does not return the truncated string but the character that was removed. If no string is supplied, chop uses the default argument $_.

while (<>) {
   chop;
   print "$_ ";
}

Note that if we want to get the string without the terminator but also leave it intact, we can use substr instead of chop. This is less efficient because it makes a copy of the line, but it preserves the original.

while (<>) {
   my $string = substr $_, 0, −1;
   print $string;
}

chomp is the user-friendly version of chop; it only removes the last character if it is the line terminator, as defined by the input record separator special variable $/, which defaults to .

chomp $might_end_in_a_linefeed_but_might_not;

Both chop and chomp will work on lists of strings as well as single ones.

# remove all trailing newlines from input, if present
@lines = <>;
chomp(@lines);

Giving either chop or chomp, a nonstring variable will convert it into a string. In the case of chomp, this will do nothing else; chop will return the last digit of a number and turn it into a string missing the last digit.

Although the line terminator can be more than one character wide, such as on Windows, chomp will still remove the line feed "character" correctly. However, if we happen to be reading text that was generated on a platform with a different concept of line ending, then we may need to alter Perl's idea of what a line terminator is.

@lines = <>;
{
    local $/ = "1512";
    chomp @lines;
}

Here we temporarily give $/ a different value, which expires at the end of the block in which it is placed. This code will strip DOS or Windows linefeeds from a file on a Unix platform within the block, but conversely it will not strip Unix linefeeds, since $/ does not match them within the block. A second chomp after the end of the block will deal with that.

Characters and Character Codes: ord and chr

The ord function produces the integer character code for the specified letter. If passed a string of more than one character, it will return the code for the first one. ord will also handle multibyte characters and return a Unicode character code.

print ord('A'),   # returns 65

The inverse of ord is chr, which converts a character code into a character.

print chr(65);   # returns 'A'

Tip Note that these examples will only produce this output if ASCII is the default character set. In Japan, for example, the output would be different.


The chr and ord functions will happily handle multibyte Unicode character codes as well as single-byte codes. For example:

my $capital_cyrillic_psi=chr(0x471);

See Chapter 23 for more information on Unicode.

Length and Position: length, index, and rindex

The length function simply returns the length of the supplied string.

$length = length($string);

If the argument to length is not a string, it is converted into one, so we can find out how wide a number will be if printed before actually doing so.

$pi = atan2(1, 0) * 2;   # a numeric value
$length_as_string = length($pi);   # length as string

The index and rindex functions look for a specified substring within the body of a string. They do not have any of the power of flexibility of a regular expression, but by the same token, they are considerably quicker. They return the position of the first character of the matched substring if found, or -1 otherwise (adjusted by the index start number $[, if it was changed from its default value 0).

$string = "This is a string in which looked for text may be found";
print index $string, "looked for";   # produces '26'

We may also supply an additional position, in which case index and rindex will start from that position.

print index $string, "look for", 30;   # not found, produces -1

index looks forward and rindex looks backward, but otherwise they are identical. Note that unlike arrays, we cannot specify a negative number to specify a starting point relative to the end of the string, nice though that would be.

Substrings: substr

The versatile substr extracts a substring from a supplied string in very much the same way that splice (covered in Chapter 5) returns parts of arrays, and indeed the two functions are modeled to resemble each other. substr takes between two and four arguments—a string to work on, an offset to start from, a length, and an optional replacement.

# return characters from position 3 to 7 from $string
print substr "1234567890", 3, 4;   # produces 4567

String positions start from 0, like arrays (unless we change the start position number by assigning a new value to the special variable $[). If the length is omitted, substr returns characters up to the end of the string.

print substr "1234567890", 3;   # produces 4567890

Both the offset and the length can be negative, in which case they are both taken relative to the end of the string.

print substr "1234567890", −7, 2;   # produces 45
print substr "1234567890", −7;      # produces 4567890
print substr "1234567890", −7, −2;  # produces 45678

We can also supply a replacement string, either by specifying the new string as the fourth argument, or more interestingly assigning to the substr. In both cases, the new string may be longer or shorter (including empty, if we just want to remove text), and the string will adjust to fit. However, for either to work we must supply an assignable value like a variable or a subroutine or function that returns an assignable value (like substr itself, in fact). Consider the following two examples:

$string = "1234567890";
print substr($string, 3, 4, "abc");
# produces '4567'
# $string becomes '123abc890'

$string = "1234567890";
print substr($string, 3, 4) = "abc";
# produces 'abc8'
# $string becomes '123abc890'

The difference between the two variants is in the value they return. The replacement string version causes substr to return the original substring before it was modified. The assignment on the other hand returns the substring after the substitution has taken place. This will only be the same as the replacement text if it is the same length as the text it is replacing; in the preceding example the replacement text is one character shorter, so the return value includes the next unreplaced character in the string, which happens to be 8.

Attempting to return a substring that extends past the end of the string will result in substr returning as many characters as it can. If the start is past the end of the string, then substr returns an empty string. Note also that we cannot extend a string by assigning to a substr beyond the string end (this might be expected since we can do something similar to arrays, but it is not the case).

Upper- and Lowercase: uc, lc, ucfirst, and lcfirst

Perl provides no less than four different functions just for manipulating the case of strings. uc and lc convert all character in a string into upper- and lowercase (all characters that have a case, that is) and return the result.

print uc('upper'),   # produces 'UPPER'
print lc('LOWER'),   # produces 'lower'

ucfirst and lcfirst are the limited edition equivalents; they only operate on the first letter.

print ucfirst('daniel'),   # produces 'Daniel';
print lcfirst('Polish'),   # produces 'polish';

If we are interpolating a string, we can also use the special sequences U...E and L...E within the string to produce the same effect as uc and lc for the characters placed between them. See the section "Interpolation" in Chapter 11 for more details. And speaking of interpolation . . .

Interpolation: quotemeta

The quotemeta function processes a string to make it safe in interpolative contexts. That is, it inserts backslash characters before any nonalphanumeric characters, including $, @, %, existing backslashes, commas, spaces, and all punctuation except the underscore (which is considered an honorary numeric because it can be used as a separator in numbers).

Pattern Matching, Substitution, and Transliteration: m//, s//, and tr//

Perl's regular expression engine is one of its most powerful features, allowing almost any kind of text matching and substitution on strings of any size. It supplies two main functions, the m// match and s/// substitution functions, plus the pos function and a large handful of special variables. For example, to determine if one string appears inside another ignoring case:

$matched = $matchtext =˜ /some text/i;

Alternatively, to replace all instances of the word green with yellow:

$text = "red green blue";
$text =˜ s/green/yellow/g;
print $text;   # produces 'red yellow blue'

Closely associated but not actually related to the match and substitution functions is the transliteration operator tr///, also known as y///. It transforms strings by replacing characters from a search list with the characters in the corresponding position in the replacement list. For example, to uppercase the letters a to f (perhaps for hexadecimal strings) we could write

$hexstring =˜ tr/a-f/A-F/;

Entire books have been written on pattern matching and regular expressions, and accordingly we devote a large part of Chapter 11 to it.

Password Encryption: crypt

The crypt function performs a one-way transform of the string passed; it is identical to (and implemented using) the C library crypt on Unix systems, which implements a variation of the Data Encryption Standard (DES) algorithm. crypt is not always available, in which case attempting to use it will provoke a fatal error from Perl. Otherwise, it takes two arguments: the text to be encrypted and a salt, which is a two-character string made of random characters in the range 0..9, a..z, A..Z, /, or .. Here is how we can generate a suitable encrypted password in Perl:

@characters = (0..9, 'a'..'z' ,'A'..'Z', '.', '/'),
$encrypted = crypt($password, @characters[rand 64, rand 64]);

Since we do not generally want to use the salt for anything other than creating the password, we instead supply the encrypted text itself as the salt for testing an entered password (which works because the first two characters are in fact the salt).

# check password
die "Wrong!" unless crypt($entered, $encrypted) eq $encrypted;

Note that for actually entering passwords it is generally a nice idea not to echo to the screen. See Chapter 15 for some ways to achieve this.

crypt is not suitable for encrypting large blocks of text; it is a one-way function that cannot be reversed, and so is strictly useful for generating passwords. Use one of the cryptography modules from CPAN like Crypt::TripleDES or Crypt::IDEA for more heavyweight and reversible cryptography.

Low-Level String Conversions: pack, unpack, and vec

Perl provides three functions that perform string conversions at a low level. The pack and unpack functions convert between strings and arbitrary machine-level values, like a low-level version of sprintf. The vec function allows us to treat strings as if they were long binary values, obviating the need to convert to or from actual integer values.

Pack and Unpack

The pack and unpack functions convert between strings and lists of values: the pack function takes a format or template string and a list of values, returning a string that contains a compact form of those values. unpack takes the same format string and undoes the pack, extracting the original values.

pack is reminiscent of sprintf. They both take a format string and a list of values, generating an output string as a result. The difference is that sprintf is concerned with converting values into legible strings, whereas pack is concerned with producing a byte-by-byte string representation of its input values. Whereas sprintf would turn an integer into a textual version of that same integer, for example, pack will turn it into a series of characters whose binary values in combination make up the integer.

$string = pack 'i', $integer;

The format string looks very different from the one used by sprintf. Here, i is just one of several template characters that handle integer values—we will cover the whole list of template characters in a moment. Combining several of these characters together into a format string allows us to convert multiple values and lists of mixed data types into one packed string. Using the same format we can later unpack the string to retrieve the original values.

Because pack and unpack work, but what order the bytes come out in depends on the processor architecture. V is a template like i, but which always packs integers "little-endian" irrespective of the underlying platform's native order. Here we use V and an explicit integer whose value we want to make sure comes out with the least significant byte first:

print pack 'V', 1819436368;   # produces the string 'Perl' (not 'lerP')

Why does this work? Well, the decimal number is really just an obscure way to represent a 32-bit value that is really 4 bytes, each representing a character code. Here is the same number in hexadecimal, where 60, the last byte, is the ASCII code for P:

print pack 'V', 0x6c726560;

To pack multiple integers we can put more is, or use a repeat count.

$string = pack 'i4', @integers[0..3];

If we supply too many items, the excess values are ignored. If we supply too few, Perl will invent additional zero or empty string values to complete the packing operation.

To pack as many items as the list can supply, we use a repeat count of *.

$string = pack 'i*', @integers;

We can combine multiple template characters into one template. The following packs four integers, a null character (given by x), and a string truncated to 10 bytes with a null character as a terminator (given by Z):

$string = pack 'i4xZ10', @integers[0..3], "abcdefghijklmnop";

We can also add spaces for clarity.

$string = pack 'i4 × Z10', @integers[0..3], "abcdefghijklmnop";

To unpack this list again, we would use something like the following:

($int1, $int2, $int3, $int4, $str) = unpack 'i4xZ10', $string;

Repeat counts can optionally be put in square brackets. Using square brackets also allows a repeat count to be expressed in terms of the size of another template character:

$intstring = pack 'i[4]', @integers; # same as before
$nullstring = pack 'x[d]'; # pack a double's-worth of null bytes

Note that there is a subtle difference between a repeated template character and a repetition count when the character encodes string data. This applies to the a/A and Z template characters. Each of these will absorb as many characters as the repeat count specifies from each input value. So 'a4' will get four characters from the first string value, whereas 'aaaa' will get one each from the first, second, third, and fourth.

print pack 'aaaa', 'First','Second','Third','Fourth'; # produces     'FSTF'
print pack 'a4', 'First','Second','Third','Fourth';   # produces     'Firs'
print pack 'a4a4a4a4', 'First','Second','Third','Fourth';
    # produces 'FirsSecoThirFour'

From Perl 5.8 pack and unpack can also use parentheses to group repeating sequences of template characters. To represent four integers, each prefixed by a single character, we could therefore use '(ci)4' with the equivalent meaning to 'cicicici'.

We can also encode and decode a repeat count from the format string itself, using the format count/sequence. Usually the count will be an integer and packed with n or N, while the sequence will be a string packed with a/A or passed with nulls using x. The following pair of statements will encode a sequence of strings provided by @inputstrings and then decode the resultant stream of data back into the original strings again in @outputstrings:

$stream = pack '(n/a*)*', @inputstrings;
@outputstrings = unpack '(n/a*)*', $stream;

NOTE Technically the first * after the a is redundant for unpack, but harmless. Retaining it allows the same format string to be used for both packing and unpacking.


To encode the count value in readable digit characters instead of integers, simply pack the count using a string template instead.

$stream = pack '(a*/a*)*', 'one','two','three'; #creates '3one3two5three'

Unfortunately, pack is not smart enough to pad the numeric value, so '(a3/a*)*' will not produce '003one003two005three' as we would like and unpack isn't smart enough to decode a variable-length number prefix. So we would have to write a smarter decoder to handle the strings packed by the preceding example (and in addition no string can start with a digit or the decoder will not be able to tell where the count ends and the data begins).

pack and unpack can simulate several of Perl's other functions. For example, the c template character packs and unpacks a single character to and from a character code, in exactly the same way that ord and chr do.

$chr = pack 'c', $ord;
$ord = unpack 'c', $chr;

Just as with ord and chr, the c and C template characters will complain if given a value that is out of range. In the case of c, the range is −128 to 127. For C the range is 0 to 255. The advantage is, of course, that with pack and unpack we can process a whole string at once.

@ords = unpack 'c*', $string;

Similarly, here is how we can use x (which skips over or ignores, for unpack) and a (read as-is) to extract a substring somewhat in the manner of substr:

$substr = unpack "x$position a$length", $string;

pack and unpack support a bewildering number of template characters, each with its own properties, and a number of modifiers that alter the size or order in which they work. Table 3-4 provides a brief list; note that several only make sense with an additional count supplied along with the template character.

Table 3-4. pack and unpack Template Characters

Character Properties
a Arbitrary (presumed string) data, null padded if too short
A Arbitrary (presumed string) data, space padded if too short
b Bit string, ascending order (as used by vec)
B Bit string, descending order
c Signed character (8-bit) value
C Unsigned character (8-bit) value
d Double precision (64-bit) floating-point number
D Long double precision (96-bit) floating-point number (Perl 5.8 onwards)
f Single precision floating-point number
F Perl internal floating-point number type, NV (Perl 5.8 onwards)
h Hex string, byte order low-high
H Hex string, byte order high-low
i Signed integer value (length dependent on C)
I Unsigned integer value (length dependent on C)
j Perl internal integer number type, IV (Perl 5.8 onwards)
J Perl internal unsigned number type, UV (Perl 5.8 onwards)
l Signed long (32-bit) value
L Unsigned long (32-bit) value
n Unsigned short, big-endian (network) order
N Unsigned long, big-endian (network) order
p Pointer to null terminated string
P Pointer to fixed-length string
q Signed quad/long long (64-bit) value
Q Unsigned quad/long long (64-bit) value
s Signed short (16-bit) value
S Unsigned short (16-bit) value
u Unencoded string
U Unicode character
v Unsigned short, little-endian (VAX) order
V Unsigned long, little-endian (VAX) order
w BER (base 128) compressed integer
x Null/ignore byte
X Backup a byte
Z Null terminated string
@ Fill with nulls to absolute position (must be followed by a number)

If the format starts with 'U', then the input will be packed or unpacked in Unicode. This is true even if the repeat count is zero (so that the first item need not be a Unicode character). Similarly, if the first template character is 'c' or 'C', Unicode does not become the default.

pack 'U8 N C20', $str1, $int, $str2;    # packs both strings as Unicode
pack 'U0 N C20', $int, $str;            # switches on Unicode, no initial string
pack 'C0 U8 N C20', $str1, $int, $str2; # packs first string as Unicode, not second

Neither pack or unpack have strong opinions about the size, alignment, or endianness of the values that they work with. If it is important to ensure that a string packed on one machine can be interpreted correctly on another, some steps to ensure portability are required.

First, Perl's idea of what size quantities like short integers are can be different from that of the underlying platform itself. This affects the s/S and l/L templates. To force Perl to use the native size rather than its own, suffix the template with an exclamation mark.

s Signed short, Perl format
s! Signed short, native format

To test whether or not Perl and the platform agree on sizes, compare the length of the strings produced by each of the preceding formats. If they are the same, there is agreement.

Second, as pack and unpack deal with strings, they obviously do not know or care about alignment of values. We usually care about this when the intent is to process argument lists and structures for C library routines, which Chapter 20 discusses. Adding an exclamation mark to x or X allows a format string to align forward or backward by the specified size (and therefore only makes sense with a repeat count). Typically, we would use a template character as the repeat count to indicate alignment according to the size of that value. To pack a character and a long in native format, we might use

$struct = pack 'c x![l!] l!', $char, $int;

This pads out the initial character to align to the boundary determined by the size of a long integer in the native format. Different platforms have different requirements for alignment, sometimes depending on what is being aligned, so it is important to remember that the preceding is stating an assumption that long integers align on boundaries that derived from the size of a long integer, which is not necessarily so.

Finally, pack and unpack will produce different results as a result of the endianness of the platform when asked to pack integers with s/S, i/I, or l/L. This is because the order of bytes can be different between processors, as we noted at the start when we used V in an example to print "Perl". The v/V (VAX) and n/N (network) template characters always pack short and long integers in the same order, irrespective of the processor architecture.

print pack 'V', 1819436368;   # produces the string 'Perl'
print pack 'N', 1819436368;   # produces the string 'lerP'
print pack 'i', 1819436368;   # warning! platform dependent!

If we want to pack structures for use with communication with C subroutines, we need to use the native formats i or I. If we want to be portable across machines and networks, we need to use a portable template character. As its name implies, the network-order n/N template character is the more logical choice. Unfortunately, floating-point numbers are notoriously difficult to pass between different platforms, and if we need to do this we may be better off serializing the data in some other way (for example, with Data::Dumper, though more efficient modules also exist).

Vector Strings

Perl also provides the vec function, which allows us to treat a string as if it were a long binary value rather than a sequence of characters. vec treats the whole string as a sequence of bits, with each character holding eight each. It therefore allows us to handle arbitrarily long bit masks and binary values without the constraints of integer size or assumption of byte order.

In operation, vec is somewhat like the substr function, only at the bit level. substr addresses a string by character position and length and returns substrings, optionally allowing us to replace the substring through assignment to a string. vec addresses a string by element position and length and returns bits as an integer, optionally allowing us to replace the bits through assignment to an integer. It takes three arguments: a string to work with, an offset, and a length, exactly as with substr, only now the length is in terms of bits and the offset is in multiples of the length. For example, to extract the tenth to twelfth bits of a bitstring with vec, we would write

$twobitflag = vec($bitstr, 5, 2);   # 5th 2-bit element is bits 10 to 12

The use of the word string with vec is a little stretched; in reality we are working with a stream of bytes in a consistent and platform-independent order (unlike an integer whose bytes may vary in order according to the processor architecture). Each byte contains 8 bits, with the first character being bits 0 to 7, the second being 8 to 15, and so on, so this extracts the second to fourth bits of the second byte in the string. Of course, the point of vec is that we do not care about the characters, only the bits inside them.

vec provides a very efficient way to store values with constrained limits. For example, to store one thousand values that may range between 0 and 9 using a conventional array of integers would take up 4 × 1000 bytes (assuming a 4-byte integer), and 1000 characters if printed out to a string for storage. With vec we can fit the values 0 to 9 into 4 bits, fitting 2 to a character and taking up 500 bytes in memory, and saved as a string. Unfortunately, the length must be a power of 2, so we cannot pack values into 3 bits if we only had to store values from 0 to 7.

# a function to extract 4-bit values from a 'vec' string
sub get_value {
   # return flag at offset, 4 bits
   return vec $_[0], $_[1], 4;
}

# get flag 20 from the bitstring
$value = get_value ($bitstr, 20);

It does not matter if we access an undefined part of the string, vec will simply return 0, so we need not worry if we access a value that the string does not extend to. Indeed, we can start with a completely empty string and fill it up using vec. Perl will automatically extend the string as and when we need it.

Assigning to a vec sets the bits from the integer value, rather like a supercharged version of chr. For example, here is how we can define the string Perl from a 32-bit integer value:

# assign a string by character code
$str = chr(0x50). chr(0x65). chr(0x72). chr(0x6c);   # $str = "Perl";

# the same thing more efficiently with a 32-bit value and 'vec'
vec ($str, 0, 32) = 0x50_65_72_6c;

# extract a character as 8 bits:
print vec ($str, 2, 8);  # produces 114 which is the ASCII value of 'r'.

Using this, here is the counterpart to the get_value subroutine for setting flags:

# a function to set 4-bit values into a 'vec' string
sub set_value {
   # set flag at offset, 4 bits
   vec $_[0], $_[1], 4;
}

# set flag 43 in the bitstring
$value = set_value ($bitstr, 43, $value);

String Formatting with printf and sprintf

We have already seen some examples of the printf and sprintf functions when we discussed converting numbers into different string representations. However, these functions are far more versatile than this, so here we will run through all the possibilities that these two functions afford us.

The two functions are identical except that sprintf returns a string while printf combines sprintf with print and takes an optional first argument of a filehandle. It returns the result of the print, so for generating strings we want sprintf.

sprintf takes a format string, which can contain any mixture of value placeholders and literal text. Technically this means that they are not string functions per se, since they operate on lists. We cover them here because their job is fundamentally one of string generation and manipulation rather than list processing.

For each placeholder of the form %... in the format, one value is taken from the following list and converted to conform to the textual requirement defined by the placeholder. For example:

# use the 'localtime' function to read the year, month and day
($year, $month, $day) = (localtime)[5, 4, 3];
$year += 1900;
$date = sprintf '%4u/%02u/%02u',  $year, $month, $day;

This defines a format string with three unsigned decimal integers (specified by the %u placeholder). All other characters are literal characters, and may also be interpolated if the string is double quoted. The first is a minimum of four characters wide, padded with spaces. The other two have a minimum width of two characters, padded with leading zeros.

printf and sprintf Placeholders

The printf and sprintf functions understand many placeholders for different types of value. Here they are loosely categorized and explained in Tables 3-5 to 3-11. To start with, Table 3-5 shows the placeholders for handling character and string values.

Table 3-5. Character and String Placeholders

Placeholder Description
%c A character (from an integer character code value)
%s A string
%% A percent sign

The placeholders for integer values, shown in Table 3-6, allow us to render numbers signed or unsigned, and in decimal, octal, or hexadecimal.

Table 3-6. Integer and Number Base Placeholders

Placeholder Description
%d Signed decimal integer
%I (Archaic) alias for %d
%u Unsigned decimal integer
%o Unsigned octal integer
%x Unsigned hexadecimal integer, lowercase a..f
%X Unsigned hexadecimal integer, uppercase A..F
%b Unsigned binary integer

In addition, all these characters can be prefixed with l to denote a long (32-bit) value, or h to denote a short (16-bit) value, for example:

%ld Long signed decimal
%hb Short binary

If neither is specified, Perl defaults to whatever size it was built to use (so if 64-bit integers are supported, %d will denote 64-bit integers, for example). The %D, %U, and %O are archaic aliases for %ld, %lu, and %lo. sprintf supports them, but their use is not encouraged.

Extra long (64-bit) integers may be handled by prefixing the placeholder letter with either ll (long long), L (big long), or q (quad).

%lld 64-bit signed integer
%qo 64-bit octal number

This is dependent on Perl supporting 64-bit integers, as we covered in Chapter 1.

Floating-point numbers can be represented as either scientific (fixed-point) values or floating-point values. We can either force one or the other representation, or request the best choice for the value being rendered, as shown in Table 3-7.

Table 3-7. Floating-Point Placeholders

Placeholder Description
%e Scientific notation floating-point number, lowercase e
%E Scientific notation floating-point number, uppercase E
%f Fixed decimal floating-point number
%F (Archaic) alias for %f
%g "Best" choice between %e and %f
%G "Best" choice between %E and %f

By their nature, floating-point values are always double precision in Perl, so there is no l prefix. However, quadruple to store precision (long double) floating-point values can be handled with the ll or L prefixes.

%llE Long double scientific notation, uppercase E
%Lf Long double fixed decimal

As with 64-bit integers, this is dependent on Perl actually supporting long double values.

As the preceding tables show, much of the functionality of sprintf is related to expressing integers and floating-point numbers in string format. We covered many of its uses in this respect earlier in the chapter. In brief, however, placeholders may have additional constraints to determine the representation of a number or string placed by adding modifiers between the % and the type character, as shown in Table 3-8.

Table 3-8. Width and Precision

Modifier Action
n A number, the minimum field width.
* Take the width for this placeholder from the next value in the list.
.m Precision. This has a different meaning depending on whether the value is string, integer or floating point:
String—The maximum width
Integer—The minimum width
Floating point—Digits after the decimal place
.* Take the precision for this placeholder from the next value in the list.
n.m Combined width and precision.
*.* Take the width and precision for this placeholder from the next two values in the list.

If a string or character placeholder has a width, then strings shorter than the width are padded to the left with spaces (zeroes if 0 is used). Conversely, if a precision is specified and the string is longer, then it is truncated on the right. Specifying both as the same number gives a string of a guaranteed width irrespective of the value, for example, %8.8s.

A floating-point number uses the width and precision in the normal numerical sense. The width defines the width of the field as a whole, and the precision defines how much of it is used for decimal places. Note that the width includes the decimal point, the exponent, and the e or E—for example, %+13.3e.

The precision and width are the same for integers except that a leading . will pad the number with leading zeros in the same way in which 0 (shown later) does. If the integer is wider than the placeholder, then it is not truncated—for example, %.4d.

If asterisks are used either for width or precision, then the next value in the list is used to define it, removing it from the list for consideration as a placeholder value.

$a = sprintf "%*.*f", $float, $width, $precision;

Note that negative numbers for either will cause an additional implicit - (see Table 3-9).

Table 3-9. Justification

Character Action
Space Pad values to the left with spaces (right-justify)
0 Pad values to the left with zeros
- Pad values to the right with spaces (left-justify)

Justification is used with a placeholder width to determine how unfilled places are handled when the value is too short. A space, which is the default, pads to the left with spaces, while 0 pads to the left with zeroes, shifting sign or base prefixes to the extreme left if specified. - pads with spaces to the right (even if 0 is also specified). For example:

%04d Pad to four digits with 0
% 8s Pad to eight characters with spaces
%8s The same
%-8s Pad to the right to eight characters with spaces

Number prefixes can be added to numbers to qualify positive numbers with a plus sign, and to indicate the base of nondecimal values, as shown in Table 3-10.

Table 3-10. Number Prefixes

Prefix Action
+ Represent positive numbers with a leading +
# Prefix nondecimal-based integers with 0, 0x, or 0b if they have a nonzero value

Either of these prefixes can be enabled (even on strings, though there is not much point in doing that) by placing them after the % and before anything else. Note that + is for signed integers and that # is for other number bases, all of which treat signed numbers as if they were very large unsigned values with their top bit set. This means that they are exclusive to each other, in theory at least. Both of them are counted in the width of the field, so for a 16-bit binary number plus prefix, allow for 18 characters. For example:

%+4d Give number a sign even if positive
%+04d Signed and padded with zeros
%#018hb 16-bit padded and prefixed short binary integer

Three placeholders do not easily fit into any of the categories so far. Table 3-11 lists them.

Table 3-11. String Length, Pointers, and Version Numbers

Placeholder Description
%n Write length of current output string into next variable
%p Pointer value (memory address of value)
%v Version number string

The %n placeholder is unique among all the placeholders in that it does not write a value into the string. Instead, it works in the opposite direction, and assigns the length of the string generated so far to the next item in the list (which must therefore be a variable).

The %p placeholder is not often used in Perl, since looking at memory addresses is not something Perl encourages, though it can occasionally be useful for debugging references.

Finally, the %v placeholder specifies that the supplied value is converted into a version number string of character codes separated by points, in the format defined by the placeholder (d for decimal, b for binary, and so on). A different separator may be used if a * is used before the v to import it. Note that specifying the separator directly will not work, as v does not conform to the usual rules for placeholders. For example:

printf "v%vd", $^V;            # print Perl's version

printf "%v08b", 'aeiou';       # print letters as 8-bit binary digits
                               # separated by points

printf "%*v8o", '-', 'aeiou';  # print letters as octal numbers
                               # separated by minus signs

The version string placeholder is currently required to print a version string in a recognizable form, but Perl 5.10 is expected to provide a more convenient way with the introduction of version objects. See "Bareword Strings and Version Number Strings," earlier in this chapter, for more details.

Reordering the Input Parameters

From Perl 5.8 onwards, it is possible to reorder the input values so that they appear in the output string in a different order than they were supplied. To specify a particular input value, an order number is inserted in front of the width and separated from it by a dollar sign. Unordered placeholders are evaluated as normal, starting from the front of the list, so the third %s in this example produces 'one', not 'three':

printf '%2$5s,%1$5s,%s','one','two','three'; # produces 'two  ,one  ,one'

Be careful with the dollar signs; in a double-quoted string Perl will try to interpolate them, which is unlikely to produce the intended result. Remember to escape them with backslashes inside double quotes, or stick to single quotes to avoid accidents.

Schizophrenic Scalars

As we mentioned briefly in the introduction, it is possible to have a scalar variable with numeric and string values that are not direct conversions of each other. One of the more famous examples of this is the system error number or Errno variable $!, which Perl uses to store the result of built-in functions that call into the operating system—open and close are perhaps the most frequently seen cases. In numeric context, $! evaluates to the system error number itself. In string context, $! evaluates to a string describing the error number. This is an appropriate and useful alternate form for the error value to take, but not at all a direct conversion of any kind. The following example shows $! being used in both string and numeric contexts:

$opened_ok=open FILE,"doesnotexist.txt";
unless ($opened_ok) {
    print "Failed to open: $! "; #use of $! in string context
    exit $!; #use of $! in numeric context
}

We can create our own scalars with different alternate forms using the dualvar subroutine from the Scalar::Util package. It takes two arguments, the first of which is a number (either integer or floating point) and the second of which is a string. It returns a scalar with both values preset.

$PI=dualvar(3.1415926,"PI");
print "The value of $PI is", $PI+0; #produces "The value of PI is 3.1415926"

The string and numeric values will be used in their respective contexts with neither triggering a new conversion of the other. Perl will still convert the integer or floating-point number (whichever we set initially) to the other numeric type as required. It is important to realize that the dual-valued nature of such a variable is easily destroyed, however. If we assign a new integer value, for example, this will invalidate the string and force a conversion the next time the variable is used in string context. The advantage of dualvar is that it is just an interface to Perl's internal value-handling mechanism, and as such is just as fast as any normal scalar.

For more advanced forms of "clever" scalars we can resort to tie or the overload module, both of which allow us to create an object-oriented class that looks like a scalar but abstracts more complex behavior beneath the surface. We cover both subjects in Chapter 19. Be aware, however, that both of these solutions are intrinsically more heavyweight and much slower than a dual-valued variable.

Summary

In this chapter, we have talked about scalar numbers and strings, and their relationships to functions and modifiers. We were introduced to integers and floating-point numbers, and took a brief look at the use integer pragma. The different types of quotes and quoting operators were discussed, and we learned how to use a here document. We also saw what mathematical functions Perl provides. After seeing how to manipulate strings, we went on to look at low-level string conversions like pack and unpack, all of which has given us the foundation, for number and string manipulation, which we will need throughout the rest of the book.

In this chapter, we have looked at scalar numbers and strings, how Perl handles them internally, and how they are automatically converted into different forms on demand. We also looked at how to convert scalar values into different forms and the various built-in functions that Perl provides to help us. For integers, we additionally looked at number bases, the use integer pragma, and handling big integers. For floating-point numbers we also examined rounding issues and controlling the format of printed floating-point numbers. For strings, we covered quoting, here documents, bareword strings, version number strings, and the range of built-in functions that Perl provides for string manipulation, ranging from print and printf through to pack and unpack. Finally, we looked at a way to create scalars that carry different numeric and string values simultaneously using the dualvar subroutine.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset