#38 EOL Type Detector

One of the problems with standards is that there are so many of them. Evensomething as simple as the format of a text file can be subject to many different standards. For example, Microsoft, Apple, and Unix/Linux all use a different end-of-line (EOL) indicator.

The root of this problem can be traced back to the early days, in the 1920s B.C. (before computers). A device called a Teletype was invented to send text over the phone lines at the amazingly fast speed of 10 characters a second (fast for 1920s technology).

The unit consisted of a keyboard, printer, paper tape reader, and punch. It contained a character encoder made out of levers and a character decoder built around a shift register that looked a lot like a car's distributor. The thing was loud and difficult to maintain, but it still managed to do its job.

One of the problems with the Teletype was that although it took 1/10 of a second to print a character, it took 2/10 of a second to move the printhead from the right side of the page to the left. If you sent the machine a printable character while the printhead was moving, it would print a smudge in the middle of the page.

The solution to this problem was to use two characters for the end of line. The first, a carriage return, sent the printhead or carriage to the left side, the second, a line feed, moved the paper up.

The early computers frequently used Teletypes as their main console. After all, the Teletype had a keyboard and printer for typing and a paper tape reader/punch for storage. But back then storage cost a lot more per byte than it does now. Storing two characters for an end of line was expensive.

So some people decided to take the two-character end-of-line sequence (carriage return, line feed) and store only one of the characters. The Unix people decided to use the line feed. DEC, and later Apple, decided to standardize on carriage return. Microsoft decided to use both carriage return and line feed. The result is the tower of babble we must deal with now.

Moving files from one machine to another can cause problems because of EOL incompatibilities. For that reason, it's a good idea to know what type of EOL is being used in a file. So you need a good way of telling what type of file you are dealing with.

The Code

   1 use strict;
   2 use warnings;
   3 use English;
   4
   5 ############################################
   6 # do_file($name) -- Tell what type of file
   7 #       the given file is
   8 ############################################
   9 sub do_file($)
  10 {
  11     my $file = shift;
  12     if (not open IN_FILE, "<$file") {
  13         print "Could not open $file
";
  14         return;
  15     }
  16     binmode(IN_FILE);
  17     my $old_file = select IN_FILE;
  18     local $/;
  19     select $old_file;
  20     my $buffer = <IN_FILE>;
  21
  22     my $cr = $buffer =~ tr/
/
/;
  23     my $lf = $buffer =~ tr/
/
/;
  24     my $crlf = $buffer =~ s/
/
/g;
  25
  26     close (IN_FILE);
  27
  28     $cr -= $crlf;
  29     $lf -= $crlf;
  30     if (($cr == 0) && ($lf == 0) && ($crlf != 0)) {
  31         print "$file:	Microsoft (<cr><lf>)
";
  32     } elsif (($cr == 0) && ($lf != 0) && ($crlf == 0)) {
  33         print "$file:	Linux/UNIX (<lf>)
";
  34     } elsif (($cr != 0) && ($lf == 0) && ($crlf == 0)) {
  35         print "$file:	Apple (<cr>)
";
  36     } else {
  37         print "$file:	Binary (<cr>=$cr <lf>=$lf <cr><lf>=$crlf)
";
  38     }
  39 }
  40
  41 foreach my $cur_file (@ARGV) {
  42     do_file($cur_file);
  43 }

Running the Script

To run the script, just specify the files to be processed on the command line:

$ eol-type.pl test.dos test.unix test.mac test.mixed

The Results

test.dos:    Microsoft (<cr><lf>)
test.unix:   Linux/UNIX (<lf>)
test.mac:    Apple (<cr>)
test.mixed:  Binary (<cr>=1 <lf>=1 <cr><lf>=1)

How It Works

The script starts by opening the file and then setting binmode on it. This prevents Perl from internally performing any EOL editing on the input file. (On Windows, for example, a carriage return/line feed combination would be translated to just a line feed as the file was being read. Binary mode turns off Perl's internal EOL editing.)

  12     if (not open IN_FILE, "<$file") {
  13         print "Could not open $file
";
  14         return;
  15     }
  16     binmode(IN_FILE);

Next the file is read in using one read statement. To do this, you use a little trick. First you use the select call to make IN_FILE the current file(saving the old current file in the process). Next, declare a local version of the record separator $. This is assigned no value so it gets the value undef. That means that the file is not divided into records. The old current file specification is restored. (The record separator specification stays with the input file.) The file is then read. Because there is no record separator, the entire file is read and deposited into the variable $buffer. There's one final step, but that one is invisible. When the local $ goes out of scope (at the end of the function), the old value of $ is restored. Although the result is only a few lines of Perl, there's a lot going on here:

  17     my $old_file = select IN_FILE;
  18     local $/;
  19     select $old_file;
  20     my $buffer = <IN_FILE>;

Next you count the number of carriage returns, line feeds, and carriage return/line feed combinations. The tr operator is used to count single characters (carriage returns, line feeds). The substitution operator is used to count the carriage return/line feed combinations:

  22     my $cr = $buffer =~ tr/
/
/;
  23     my $lf = $buffer =~ tr/
/
/;
  24     my $crlf = $buffer =~ s/
/
/g;

Next you adjust the carriage return and line feed count so it reflects the number of solo carriage returns and line feeds and does not include any contained in the carriage return/line feed pairs.

  28     $cr -= $crlf;
  29     $lf -= $crlf;

At this point, if you have a text file, only one of the variables $cr, $lf, and $crlf will be nonzero. All you have to do is figure out which one and print out the results. If more than one of these variables is nonzero, then multiple types of EOLs are present in the file. This indicates a binary or confused file:

  30     if (($cr == 0) && ($lf == 0) && ($crlf != 0)) {
  31         print "$file:	Microsoft (<cr><lf>)
";
  32     } elsif (($cr == 0) && ($lf != 0) && ($crlf == 0)) {
  33         print "$file:	Linux/UNIX (<lf>)
";
  34     } elsif (($cr != 0) && ($lf == 0) && ($crlf == 0)) {
  35         print "$file:	Apple (<cr>)
";
  36     } else {
  37         print "$file:	Binary (<cr>=$cr <lf>=$lf <cr><lf>=$crlf)
";
  38     }
  39 }

Hacking the Script

The script is fairly simple, but it still can be hacked. I'm sure that there are a number of ways to use Perl tricks to improve the speed and efficiency of this program.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset