Don’t you just hate getting an email that’s been for
matted
for the wrong number of columns? It’s an unprovoked ass
ault
on your poor visual cortex. And it’s a thoughtless insult, to
o.
It screams: “Hey, you aren’t even worth the eight keystr
okes
it would take me to correctly set my editor’s autowrap!”
> And, of course, it only gets worse when quoted email is
involved. > Even when someone tries to do the right
thing, they just end > up frying more of your neurons as
you attempt to untangle > the mess that most text formatters
make of the standard > quoting conventions. It’s no fun trying
to separate the meaning > from the massage.
What the world needs is a text reformatter that looks at the contents—and context—of the ASCII it’s munging, and then Does The Right Thing automagically.
And that’s exactly what the Text::Autoformat module gives you.
Specifically, it provides a subroutine named autoformat
that wraps text to fixed
margins. However, unlike other text wrapping modules (such as Text::Wrap, Text::Correct, or
Text::Reflow), autoformat
reformats
its input by analyzing the text’s structure: identifying and
rearranging independent paragraphs by looking for visual gaps, list
bullets, changes in quoting, centering, and underlining.
If you’re happy to live with autoformat
’s reasonable defaults, then
reformatting a single paragraph (taking it from STDIN
and printing it to STDOUT
) is no more complicated than
this:
use Text::Autoformat; autoformat;
The default width of the reformatted text is from column 1 to column 72, but it’s very easy
to change that (and a plethora of other defaults) by giving autoformat
the appropriate options:
autoformat {left=>8, right=>64};
Or the equivalent, but often more convenient, alternative:
autoformat {left=>8, width=>57};
If autoformat
’s first
argument isn’t a hash reference, that argument is stringified and used
as the text to be formatted. For example:
autoformat $msg_text;
Likewise, if it’s called in a non-void (scalar or list) context,
autoformat
returns the formatted
text, rather than printing it to STDOUT
.
Normally, autoformat
only
reformats the first paragraph it encounters, and leaves the remainder
of the text unaltered. This behavior seems odd initially, until you
realize that the single most common use of autoformat
is in the following
one-liner:
perl -MText::Autoformat -e'autoformat'
and that the obvious thing to do with this one-liner is to map
it onto a convenient keystroke in your text editor, thereby providing
intelligent, single-key, paragraph-at-a-time reformatting. For example, if you’re
a vi
user, you might add this to
your .exrc
file:
map f !G perl -MText::Autoformat -eautoformat
That is: map the f
key to
grab every line from the current editing position to the end of the
file and filter it through Perl. Then, to provide that filter, the
Text::Autoformat module is loaded and autoformat
is called.
If autoformat
’s default were
to reformat everything it was sent, then you’d have to write:
map f !} perl -MText::Autoformat -eautoformat
and you’d be stuck with vi
’s
much less sophisticated understanding on what
constitutes a paragraph. More on that shortly.
Of course, the real power of the module is best seen when it
operates on multiple paragraphs simultaneously. To convince autoformat
to do that—to reflow every
paragraph you send it—you need to ask explicitly, with another
option:
autoformat { all=>1 };
Which leads to the obvious “just-fix-it-all-up-for-me-would-ya” editor macro:
map F !Gperl -MText::Autoformat -eautoformat{all=>1}
The autoformat
subroutine
gives the illusion of understanding the structure of an input
text because it has a series of very good heuristics
(i.e., guesses) for locating and separating paragraphs.
Most text formatters—and many text editors—define a paragraph to be a sequence of
characters terminated by two or more consecutive newlines. Indeed,
this is Perl’s notion of a paragraph (which you can grab with a single
readline
by setting the $/
variable to an empty string, as described
in the perlvar documentation).
That’s very annoying, because it doesn’t cope with how real
people write paragraphed text. Real people leave spaces and tabs on “empty”
lines. Real people (and many web browsers) bunch up lists of bulleted
and numbered points with no whitespace at all between them. Real
people quote email messages, which transforms formerly empty lines
into non-empty
>
sequences.
Because real people do such things, autoformat
understands all these notions of
a paragraph. Even when they’re all used at once. Even when they’re
used inside one another (for example, quoting a list of bulleted
points).
One of Text::Autoformat’s most useful paragraphing heuristics is that any sequence of lines beginning with standard “quoter” characters is a single piece of quoted text, in which the quoters should be preserved and only the text to the right of them reflowed.
The standard quoters that autoformat recognizes are nested combinations of the characters:
! # % = | : >
Angle brackets can also be preceded by alphabetic characters. So, for example, autoformat would take a series of paragraphs like this:
> ! > calling map in a void context is the sign > ! > of a sick mind > ! > ! I don't see why. > Me either, I regularly do it and I'm still > quite sane. I often split in a void context > too, but there's a bug in Perl that seems to > cause that to mess up $_[0], $_[1], etc. > ! > Sigh. Have you bothered to read the man > ! > page on split??? Yes, I know I wrote this > ! > before that reply: it's a miracle.
and reformat them like so:
> ! > calling map in a void context is > ! > the sign of a sick mind > ! > ! I don't see why. > Me either, I regularly do it and I'm > still quite sane. I often split in a > void context too, but there's a bug > in Perl that seems to cause that to > mess up $_[0], $_[1], etc. > ! > Sigh. Have you bothered to read > ! > the man page on split??? Yes, I > ! > know I wrote this before that > ! > reply: it's a miracle.
That’s the whole point. By understanding the structural
conventions of typical plaintext, autoformat
can reflow it logically, rather than physically.
Often plaintext will include lists that are either bulleted with punctuation characters, simply numbered (i.e., 1., 2., 3., etc.), or hierarchically numbered (1, 1.1, 1.2, 1.3, 2, 2.1., etc.) Whether or not it is physically separated from each of its neighbors, each bulleted item is implicitly a separate paragraph and needs to be formatted individually, with the appropriate indentation.
autoformat
takes care of that
renumbering, and can also detect unordered bullets (the
characters *
, ., +
, and -
), special markers that ought to be
outdented (such as NB
: and p.s.
), Arabic and Roman numerals, single
alphabetic letters, and hierarchical combinations of these (for example,
2.a(ix)
).
Besides adjusting the left margin so that the marker is outdented from the paragraph text, autoformat renumbers each numbered point sequentially (using the first number as its starting point). For example, given the following text:
You're wrong for the following reasons: 1. I'm right. 1.a. I'm *always* right 1. Even if you were right, you have the order wrong. 1.x. You suggested: > D. Analyze the problem carefully > C. Design the algorithm appropriately > A. Code solution systematically > E. Test thoroughly > B. Ship eventually 1.n. The proper sequence is: A. Code solution expediently B. Ship immediately E. Test sporadically (charge user for maintenance) F. Release "upgrade" periodically (charge user again)
autoformat
{all = 1
}> produces:
You're wrong for the following reasons: 1. I'm right. 1.a. I'm *always* right 2. Even if you were right, you have the order wrong. 2.a. You suggested: > D. Analyze the problem carefully > C. Design the algorithm > appropriately > A. Code solution systematically > E. Test thoroughly > B. Ship eventually 2.b. The proper sequence is: A. Code solution expediently B. Ship immediately C. Test sporadically (charge user for maintenance) D. Release "upgrade" periodically (charge user again)
Notice that autoformat
got
the hierarchical ordering correct, and that it
didn’t renumber the quoted list, even though it
reflowed the text within the quoted section. That makes sense, since
renumbering the quoted list might change its meaning in a way that
reformatting wouldn’t.
The autoformat
subroutine
also handles renumbering of lists marked with Roman numerals. For example, the list:
Examples of the five declensions are: i. terra, terra, terram, terrae, terrae, terra v. modus, mode, modum, modi, modo, modo x. nomen, nomen, nomen, nominis, nomini, nomine ix. portus, portus, portum, portus, portui, portu mmmclxiv. dies, dies, diem, diei, diei, die
Examples of the five declensions are: i. terra, terra, terram, terrae, terrae, terra ii. modus, mode, modum, modi, modo, modo iii. nomen, nomen, nomen, nominis, nomini, nomine iv. portus, portus, portum, portus, portui, portu v. dies, dies, diem, diei, diei, die
autoformat
is even smart
enough to right-justify the numbers, so as to align the paragraph
bodies cleanly.
Of course automatically handling lists of letters and lists of Roman numerals presents an interesting challenge. A list such as:
I. Put cat in box. M. Close lid. P. Activate Geiger counter.
should obviously be reordered as I…J…K, whereas:
I. Put cat in box. M. Close lid. XLI. Activate Geiger counter.
should clearly become I…II…III.
But what about:
I. Put cat in box. M. Close lid. L. Activate Geiger counter.
The autoformat
subroutine
resolves this ambiguity by always interpreting a list with alphabetic
bullets as being English letters, unless the full list contains only
valid Roman numerals, and at least one of those numerals is two or
more characters long. So the final example above would become I…J…K—as
you might have expected.
Literary quotations present a different challenge from quoted email. A typical formatter would re-render the following quotation:
"We are all of us in the gutter, but some of us are looking at the stars" -- Oscar Wilde English playwright
like so:
"We are all of us in the gutter, but some of us are looking at the stars" -- Oscar Wilde English playwright
But autoformat recognizes the quotation structure and preserves both indentation and attribution:
"We are all of us in the gutter, but some of us are looking at the stars" -- Oscar Wilde English playwright
It even outdents the leading quotation mark nicely.
Did you notice that in the previous example, autoformat
broke the second line earlier
than it needed to? It did that because, if the full margin width had
been used, the formatting would have left the last line oddly
short:
"We are all of us in the gutter, but some of us are looking at the stars" -- Oscar Wilde English playwright
Typographical misdemeanors of this type (known as
widows) are heavily frowned upon in typesetting
circles. They look ugly in plaintext too, so autoformat
avoids them with a kind of
Dickensian artful dodge: stealing extra words from earlier lines in a
paragraph, to provide the widowed word with adequate company.
The heuristic used is that final lines must be at least ten characters long. If the last line is too short, the paragraph’s right margin is reduced by one column, and the paragraph is reformatted. This process iterates until either the last line exceeds nine characters or the margins have been narrowed by 10% of their original separation. In the latter case, the reformatter gives up and just uses its original formatting.
The autoformat
subroutine can
also take an option that tells it how the reformatted text should be justified. For example:
autoformat {justify => 'right'};
The alternative values for this option are: ‘left
’ (the default), ‘right
’, ‘centre
’ (or ‘center
’), and ‘full
’.
Full justification is interesting in a fixed-width medium like plaintext because it usually results in uneven spacing between words. Typically, text formatters provide for this by distributing the extra spaces into the first available gaps of each line:
R3> Now is the Winter of our discontent made R3> glorious Summer by this son of York. And all R3> the clouds that lour'd upon our house In R3> the deep bosom of the ocean buried.
This produces an odd visual effect, so autoformat
reverses the strategy and inserts
extra spaces at the end of lines (which most readers find less
disconcerting):
R3> Now is the Winter of our discontent made R3> glorious Summer by this son of York. And all R3> the clouds that lour'd upon our house In R3> the deep bosom of the ocean buried.
Even if explicit centering is not specified via the {justify => ‘centre’}
option, autoformat
will automatically detect
centered paragraphs and preserve their justification. It does
this by examining each line of the paragraph and asking itself: “If
this line were part of a centered paragraph, where would the midpoint
have been?”
By making the same estimate for every line in the paragraph, and
then comparing the estimates, autoformat
can deduce whether all the lines
are centered with respect to the same axis of symmetry (with an allowance of plus or minus 1
to cater for the inevitable integer rounding). If a common axis of
symmetry is detected, autoformat
assumes that the lines are supposed to remain centered, and
automatically switches on center-justification for that
paragraph.
You can also optionally perform case conversions on the text being processed, using the case =>
option. The alternatives are
‘upper
’, ‘lower
’, ‘title
’, and ‘highlight
’. Title casing capitalizes the first letter of each
word:
The Strange And Gruesome Case Of The Tab-indented Python.
and highlight casing does the same, except that it ignores trivial words:
The Strange and Gruesome Case of the Tab-indented Python.
A fifth alternative is {case =>
‘sentence’}
. This mode attempts to produce correctly-cased
sentences: first letter in uppercase, subsequent words in lowercase
(unless that word is originally in mixed case). For example, the
paragraph:
POVERTY, MISERY, FRIENDLESSNESS, ETC. are ever the lot of the VisualBasic hacker. 'tis an immutable law of Nature! Whom the GODS would DESTROY, they FIRST force to code Word MACROS.
under {case => ‘sentence’}
becomes:
Poverty, misery, friendlessness, etc. are ever the lot of the VisualBasic hacker. 'Tis an immutable law of Nature! Whom the gods would destroy, they first force to code Word macros.
Note that autoformat
is
clever enough to recognize that the period in abbreviations such as
“etc.” is not a sentence terminator, and that the first capitalizable
letter of “’tis” is the “t,” and that words like “VisualBasic” and
“Nature” should retain their existing capitalizations.
There is an endless list of other smart things Text::Autoformat could be extended to do. Here’s a short preview of some coming attractions:
A future release of Text::Autoformat will recognize columns within a paragraph and allow the user to independently control their layout and justification, even under margin adjustments. For example, given:
Name Mark Comment ==== ==== ======= Pat 99 Unusually high score. Suspect? Kim 72 Solid performance Leslie 51 Just scraped through this time
you’ll be able to call:
autoformat {justify => ['left', 'centre', 'left'], width => [undef, undef, 20]};
and produce:
Name Mark Comment ==== ==== ======= Pat 99 Unusually high score. Suspect? Kim 72 Solid performance Leslie 51 Just scraped through this time
autoformat
will
eventually provide smart 8-to-7 bit transliteration (the way the
Text::StripHigh module does now), so that text like:
¥ This exampleÕs © Erwin Schrıdinger N1/442(±1) Un≠ertaint" Stra§e, -stland.
could be transformed into this:
* This example's (c) Erwin Schroedinger, No42(+/-1) Uncertainte' Strasse, Ostland.
autoformat
was
originally developed as a lazy way to clean up incoming and
outgoing email. It does that exceptionally well, so long as you
keep it away from the headers. Sendmail doesn’t take kindly to
autoformat
’s misguided
efforts with them:
To: Jon Orwant <[email protected]> From: damian@conway.org Subject: Re: When's the next meeting of the Secret Perl Cabal? References: <200011100411.PAA17166@indy05- .csse.monash.edu.au>
A future version of the module will detect mail headers and either leave them alone or wrap them intelligently.
Another irritation is that autoformat
blindly attempts to
reformat HTML, pod, Perl code, and many other things it should
just ignore. The very next release of Text::Autoformat will have a “leave-it-the-hell-alone” option that causes
autoformat
to disregard any
(non-bulleted) text that is indented. Later versions may also be
able to automatically diagnose marked-up sections of text—and
perhaps code examples—and just magically skip them.
Currently, the list of abbreviations and “stop words” that
autoformat
knows about is
fixed, as are the set of quoter characters, and list bullets.
This should obviously be user-configurable, and will be in a
forthcoming release.
Meanwhile, despite these niggles, Text::Autoformat does a remarkably good job at what it was designed for: making ASCII text reformatting as easy as (in)humanly possible.
So you no longer have any excuse for sending email that slops over the margin.