How characters are stored in Rust strings, and why they do not allow direct access
How to read string characters or string bytes using iterators
How to extract items from vectors and arrays using iterators
How to get references to items from vectors, arrays, and slices using nonmutating iterators
How to modify items from vectors, arrays, and slices using mutating iterators
A shorthand notation for using iterators in for loops
How to use some iterator adapters: filter, map, and enumerate
How to use some iterator consumers: any, all, count, sum, min, max, and collect
The concepts of iterator chains and lazy processing
String Characters
We already saw that Rust has both static strings and dynamic strings, and that both types share the same character coding, which is UTF-8. Such coding uses sequences of one to six bytes to represent each Unicode character, so a string is not simply an array of characters, but it is an array of bytes that represents a sequence of characters.
But given that s is a string, what’s the meaning of the expression s[0]? Is it the first character of s or the first byte of s?
The function as_bytes converts the string to which it is applied into a slice of immutable u8 numbers. Such conversion has zero runtime cost, because the representation of a string buffer is already that sequence of bytes.
The UTF-8 representation of any ASCII character is just the ASCII code of that character. And so, for the characters a, b, c, 0, 1, and 2, their ASCII value is printed.
The è character is represented by a pair of bytes, having values 195 and 168. And the € character is represented by a sequence of three bytes, having values 226, 130, and 172. Therefore, to get to a character in a given position in a string, it is necessary to scan all the previous characters.
This situation is similar to that of text files compared with fixed-record-length files. Using a fixed-record-length file, it is possible to read a record in any n position by seeking that position, without previously reading all the preceding lines. But using a variable-line-length file to read the nth line requires you to read all the preceding lines.
Scanning a String
Therefore, to process the characters of a string, it is necessary to scan them.
Let’s assume that, given the string “€èe,” we want to print the third character. First we must scan three bytes to get the first character, because the € character is represented by a sequence of three bytes; then we must scan two further bytes to get the second character, because the è character is represented by a sequence of two bytes; then we must scan one further byte to get the third character, because the e character is represented by a sequence of just one byte.
So, we need a way to get the next character of a string, given the current position, and to advance the current position, at the end of the read character.
In computer science, the objects that perform such behavior of extracting an item at the current position in a sequence, and then advance that position, are named iterators (or sometimes cursors). Therefore, we need a string iterator.
This program first defines a function whose purpose is to receive a string s and a number n, and then to print the character of s at position n (counting from 0), if there is a character at such position, or else to do nothing. The last two lines of the program invoke such a function to print the first and the third character of €èe, and so the program prints €e.
The Rust standard library provides a string iterator type named Chars. Given a string s, you get an iterator over s by evaluating s.chars(), as is done in the second line of the preceding program.
Any iterator has the next function. Such a function returns the next item of the underlying sequence at the current position, and advances the current position.
Some sequences like the range 1.. do not have an end, but most sequences do have an end, like vectors, arrays, and slices. An iterator cannot return the next value when the end of the sequence has been reached. So when an iterator has reached such an end, it must be capable of communicating that there are no more items to return.
To consider the possibility of having finished the sequence, the next function of Rust iterators returns a value of Option<T> type. That value is None if the sequence has no more items.
Using the match statement , the Some case causes the processing of the next character of the string, and the None case causes the exit from the otherwise infinite loop.
If the function argument n was 0, the first character of the string must be printed, and so, at the first iteration of the loop, the value of the c variable would be printed, and the loop would be quit. For any other value of n, nothing is done with that character.
After the match statement, the n counter, which was mutable, is decremented so that, when it reaches 0, the required character to print is also reached.
For every character, the character itself is printed, along with its numeric code.
Using String Iterators in for Loops
This program generates the same machine code as the previous one, but it is much clearer for a human reader.
It appears that the expression after the in keyword in a for loop can be an iterator.
But what exactly is an iterator? It is not a type, but, in a way, it is a type specification. An iterator is considered to be any expression that has a next method, with no arguments, returning an Option<T> value.
The first line is valid, as std::ops::Range<u32> is an iterator.
The second line is also valid, as std::ops::RangeFrom<u32> is an iterator.
The third and fourth lines wouldn’t be legal, as std::ops::RangeTo<u32> and std::ops::RangeFull are not iterators.
It will print: 226 130 172 195 168 101 . The first three numbers represent the € character; the next two numbers represent the è character; and the last byte, 101, is the ASCII code of the e character.
While the chars function, seen previously, returns a value whose type is std::str::Chars, the bytes function , used here, returns a value whose type is std::str::Bytes.
Both Chars and Bytes are string iterator types, but while the next function of Chars returns the next character of the string, the next function of Bytes returns the next byte of the string.
These string functions are both different from the as_bytes function, which returns a slice reference on the bytes of the string.
The Rust Editions
The Rust language evolves in major versions named “editions”. So far three editions of Rust have been released: the 2015 Edition, the 2018 Edition, and the 2021 Edition. The last one has been released on October 21st, 2021.
By default, the Rust compiler accepts code written according the 2015 edition. So far all the code we wrote was accepted by all three editions of Rust. Though, since the next section, and also in the ensuing chapters, some features supported only by the 2021 Edition will be used. Therefore, the compiler must be instructed to support such version of the language.
Iterators over Vectors, Arrays, and Slices
It will print: 11 21 31 .
The into_iter function , when applied to a vector of i32 items, returns an iterator of type IntoIter<i32>. Such iterator generates values taken from that vector.
Previously, we said that any type that implements the function next is said to be an iterator. Vectors do not implement the function next, so they are not iterators, although they implement the function into_iter, which returns an iterator. Any type that implements the function into_iter is said to be iterable.
It will print: 11 21 31 .
The into_iter function, when applied to an array, generates an iterator over that array.
It will print: 41 51 . Only the two items contained in the slice are passed to the loop.
Notice that, when it is applied to a vector or to an array, the into_iter function returns an IntoIter iterator, which generates values taken from such sequences, as expected. Instead, when the into_iter function is applied to a slice, it returns an Iter iterator, which returns references to the items contained in the slice. The reason for this is that, in general, it may not be allowed to extract items from a slice.
It generates the compilation error cannot assign twice to immutable variable `item`.
It will print: 11 21 31 .
This code does not change the original vector, because the increment in the third line acts on a value that has been extracted from the vector.
Iterators Generating References
The examples in the previous sections used iterators that extracted values from the iterated objects, with those being strings, string slices, vectors, or arrays.
It will print: 11 21 31 .
The iter function, applied to an object of type Vec<i32>, returns an iterator that generates items whose type is &i32. More in general, the iter function returns an iterator that generates items whose type is a reference to the items contained in the iterable to which it is applied.
It will print: 11 21 31 ; 11 21 .
Actually, for slices, the iter function is equivalent to the into_iter function, as both return an iterator that generates references.
Iterations without Mutation
So far we used iterators over sequences only to read the items contained in such sequences, and this is quite typical.
When iterating over the characters of a string, it is unreasonable to try to change such characters, as the new characters may be represented by a different number of bytes than the existing characters. For example, if an è character is replaced by an e character, two bytes would be replaced by just one byte. Therefore, the Rust standard library has no way to change a string character by character using a character string iterator.
In addition, when iterating over the bytes of a string, it is unsafe to try to change such bytes, as the new bytes may result in a sequence that is not a valid UTF-8 string. Therefore, the Rust standard library has no way to change a string byte by byte using a byte string iterator.
As we have seen, with the iterator obtained using the into_iter or the iter functions on a vector, an array or a slice, you are not allowed to change the items inside such sequences, even if such sequences are mutable.
However, in a mutable vector, array, or slice, you can change a single item by accessing it using its index. An iterator is just another tool to access the items of a sequence. So it is reasonable to desire changing the items of a sequence using an iterator. Such kinds of iterators are shown in the next section.
Iterations with Mutation
So, there is the need to change the values of a sequence using an iterator, though, to such end, a mutable iterator is of no help. In fact, a mutable iterator is an object that can be made to iterate over another sequence, not an object that can be used to mutate the sequence that it iterates over.
It will print: 3 4 5 ; 7 8 .
The mutable variable iterator first refers to the sequence slice1 and then to the sequence slice2 . If you remove the mut clause in the third line, you will get the compilation error cannot assign twice to immutable variable `iterator`.
An iterator of type Iter is similar to a reference, in that a mutable reference is not the same concept of a reference to a mutable object.
So if you want to change the values in a sequence through an iterator over such a sequence, you cannot use a normal (mutable or immutable) iterator.
It will print: [4, 5, 6].
The iter_mut function returns an object of type IterMut<i32>.
Think of the purpose of the iter function as “get an iterator that generates references to items to read,” and the purpose of the iter_mut function as “get an iterator that generates references to items to read or to write.”
Notice that the v variable has been declared as mutable. As the purpose of the iter_mut function is to allow changes to the iterated object, such object must be mutable. If you remove the mut clause in the first line, you get the compilation error cannot borrow `v` as mutable, as it is not declared as mutable.
We have seen the use of the iter_mut function applied to a vector. Similar functions exist for arrays and slices.
Shorthand for Using Iterators in Loops
When using for loops, there is a more compact syntax for using iterators.
In this code, the call to into_iter has been removed. If a value that implements the into_iter function is passed to a for statement, such function is implicitly invoked.
In this code, the call to iter has been removed, and an “&” sign has been added. If a reference to a value that implements the iter function is passed to a for statement, such function is implicitly invoked.
In this code, the call to iter_mut has been removed, and an “&mut” clause has been added. If a reference to a mutable value that implements the iter_mut function is passed to a for statement, such function is implicitly invoked.
Iterator Generators
So far we have encountered five functions that get a sequence and return an iterator: chars, bytes, into_iter, iter, and iter_mut. Functions that can be applied to a value that is not an iterator, but that return an iterator are named iterator generators, because they transform a noniterator into an iterator.
An Iterator Adapter: filter
Let’s see some other uses of iterators.
Here is a problem that can be solved using iterators: Given an array of numbers, how can I print all the negative numbers of such an array?
It will print: -8 -31 .
The filter function is in the standard library. It is to be applied to an iterator, and it takes a closure as argument. As its name suggests, the purpose of this function is filtering the iterated sequence, that is, to discard the items that do not satisfy the criterion implemented by the closure, and let pass only the items that satisfy such criterion.
The filter function gets an item at a time from the iterator, and invokes the closure once for every item, passing to the closure a reference to the current item. In our example, the reference to the current item, which is an integer number, is assigned to the x_ref closure argument.
The closure must return a Boolean that indicates whether the item is accepted (true) or rejected (false) by the filtering. The rejected items are destroyed, while the accepted ones are passed to the surrounding expression.
In fact, the filter function returns an iterator that (when its next function is invoked) produces just the items for which the closure returned true.
As we were interested in accepting only the negative numbers, the condition inside the closure is *x_ref < 0, because we want to compare with zero the item, not its reference.
We said that the filter function returns an iterator. Therefore we can use it inside a for loop, where we used to use iterators.
Because the filter function gets an iterator and returns an iterator, it can be seen that it transforms an iterator into another iterator. Such iterator transformers are usually named iterator adapters. The term adapter recalls that of electrical connectors: if a plug does not fit a socket, you use an adapter.
The map Iterator Adapter
Here is another problem that can be solved using iterators: Given an array of numbers, how can you print the double of each number of that array?
It will print: 132 -16 86 38 0 -62 .
The map function is another iterator adapter in the standard library. Its purpose is to transform the values produced by an iterator into other values. Differing from the filter function, the value returned by the closure can be of any type. Such value represents the transformed value.
Actually, the map function returns a newly created iterator that produces all the items returned by the closure received as an argument.
While the filter adapter removes some items of the iterated sequence, and it keeps the others unchanged, the map adapter does not remove any items, but it transforms them.
Another difference between them is that while filter passes a reference as the argument of its closure, map passes a value.
The enumerate Iterator Adapter
It will print: 0 a, 1 b, 2 c, .
It will print: a, b, c, .
In the second line, the loop variable is actually a tuple of two variables: the index variable, having type usize; and the ch variable, having type char. At the first iteration, the index variable gets the value 0, while the ch value gets as value the first character of the array. At every iteration, both index and ch receive new values.
This works because the enumerate function takes an iterator and returns another iterator. At each iteration, this returned iterator returns a value of type (usize, char). This tuple has a counter as its first field, and as its second field the same item received from the first iterator.
An Iterator Consumer: any
Given a string, how can you determine if it contains a given character?
It will print: "Hello, world!" does not contain 'R'.
It does so because character comparison is case sensitive. But if you replace the uppercase R in the second line with a lowercase r, it will print: "Hello, world!" contains 'r'.
Here, the contains variable and the loop that possibly sets it to true have been removed; and the only other use of such a variable has been replaced by the expression s.chars().any(|c| c == ch).
As the only purpose of the contains variable was to indicate if the s string contained the ch character, the expression that replaces it must also have the same value.
We know that the s.chars() expression is evaluated to an iterator over the characters of the s string. Then the any function, which is in the standard library, is applied to such iterator. Its purpose is to determine if a Boolean function (a.k.a. predicate) is true for any item produced by the iterator.
The any function receives a closure as an argument. It applies that closure to every item received from the iterator, and it returns true as soon as the closure returns true on an item, or returns false if the closure returns false for all the items.
Therefore, such a function tells us if any item satisfies the condition specified by the closure.
It will print: false true .
Notice that while the iterator adapters seen previously returned iterators, the any function is applied to an iterator, but it returns a Boolean, not an iterator.
Every function that is applied to an iterator but does not return an iterator is called iterator consumer, because it gets data from an iterator but does not put them into another iterator, so it consumes data instead of adapting data.
The all Iterator Consumer
With the any function , you can determine if at least one iterated item satisfies a condition. And how can you determine if all iterated items satisfy a condition?
It will print: true false .
Notice that while the any function means a repeated application of the OR logical operator, the all function means a repeated application of the AND logical operator.
Notice also that, following the rules of logic, if the any function is applied to an iterator that does not return any item, the any function returns false, whichever is its closure. Similarly, the all function returns true when applied to an empty iterator, whichever is its closure.
The count Iterator Consumer
Given an iterator, how do you know how many items it will produce?
Well, if you have a vector, an array, or a slice, you would best use the len function of such objects, as it is the simplest and fastest way to get their lengths. But if you want to know how many characters there are in a string, you must scan it all, because the number of chars comprising a string is not stored anywhere, unless you did it.
It will print: 3 6, meaning that this string contains three characters represented by six bytes.
The count iterator consumer does not get any arguments, and it always returns a usize value.
The sum Iterator Consumer
Here, the value returned by sum is assigned to a variable having type i32, so it must return a value of such type.
It will print: 0.
Notice that while the count function was applicable to any iterator, the sum function is applicable only to iterators that produce addable items. The statement [3.4].into_iter().sum::<f64>(); is valid, while the statement [true].into_iter().sum::<bool>(); is illegal, because it is not allowed to sum Booleans.
The min and max Iterator Consumers
It will print: -2 45 ---.
It will print: brave world . Those two words are, respectively, the first and last of the array in alphabetical order.
The collect Consumer
The any, all, count, sum, min, and max iterator consumers return simple information regarding a possibly long sequence of items.
It will print: [36, 1, 15, 9, 4].
The collect function has created a new Vec<i32> object, and it has pushed into it all the items received from the iterator.
The collect function can be used to put items into various kinds of variable size collections, like vectors, linked lists, hashtables (but excluding arrays). Therefore, it is a generic function, parameterized by the type of destination collection. The expression it.collect::<Vec<i32>>() means that the numbers generated by the it iterator will be collected into a vector.
In this code, the type i32 has been replaced by the don’t-care symbol _.
The second and third statements apply the chars function to a string, obtaining an iterator producing characters. But the second statement collects those characters into a String object, while the third statement collects them into a vector of characters.
The second statement uses the bytes function to obtain an iterator producing bytes. Then those bytes, which are the representation of the characters, are collected into a vector.
The third statement uses the as_bytes function to see the string as a slice of bytes. Next, the iter function is used to obtain an iterator over such slice, producing references to bytes. Then, such references to bytes are collected into a vector. When a vector of references is printed for debugging, the referenced objects are actually printed.
Notice that the collect function cannot be used to put the iterated items into a static string, an array, or a slice, because it needs to allocate the needed space at runtime, and such sequences cannot allocate heap memory.
Iterator Chains
Assume you have an array of numbers, and you want to create a vector containing only the positive numbers of such an array, multiplied by two.
It will print: [132, 86, 38].
This last version shows a programming pattern that is typical of functional languages: the iterator chain.
From a sequence, an iterator is created; then zero or more iterator adapters are chained; and then an iterator consumer closes the chain.
Such chains begin with an iterator or an iterator generator, they proceed with zero or more iterator adapters, and they end with an iterator consumer.
We saw several iterator generators: chars, bytes, into_iter, iter, and iter_mut; and we saw ranges, which are iterators with no need to be created by a generator.
We saw several iterator adapters: filter, map, and enumerate.
And we saw several iterator consumers: any, all, count, sum, min, max, and collect.
The standard library contains many more such functions.
Iterators Are “Lazy”
It will print: F66 M66 F-8 F43 M43 F19 M19 F0 F-31 [132, 86, 38].
The runtime operations are the following ones.
The invocation of into_iter prepares a temporary iterator, but it does not access the array. Let’s name “I” the iterator returned by into_iter.
The invocation of filter on “I” prepares another temporary iterator, but it does not manage data. Let’s name “F” the iterator returned by filter.
The invocation of map on “F” prepares another temporary iterator, but it does not manage data. Let’s name “M” the iterator returned by map.
The invocation of collect on “M” asks “M” for an item; “M” asks “F” for an item; “F” asks “I” for an item. Then “I” takes the number 66 from the array and passes it to “F”, which prints it, checks whether it is positive, and passes it to “M”, which prints it, doubles it, and passes it to collect, which then pushes it into the vector.
Then, collect, because it has just received Some item and not None, asks “M” for another item, and the trip is repeated until the number -8 arrives to “F”, which rejects it as nonpositive. Indeed, -8 is not printed by “M”. At this point, “F,” because before it has just received Some item and has rejected it, asks “I” for another item.
The algorithm proceeds in this way until the array is finished. When “I” cannot find other items in the array, it sends a None to “F” to indicate there are no more items. When “F” receives a None , it sends it to “M”, which sends it to collect, which stops asking items, and the whole statement is finished.
This code is equivalent: the same mechanism is activated, and it prints the same output.
This program prints nothing, because it does nothing. Even the compiler reports the warning: unused `Map` that must be used, and then the note: iterators are lazy and do nothing unless consumed.
In computer science, to be lazy means trying to do some processing as late as possible. Iterator adapters are lazy, as they process data only when another function asks them for an item: it can be another iterator adapter, or an iterator consumer, or a for loop, which acts as a consumer. If there is no data sink, there is no data access.