It isn't dreadfully idiomatic Perl, but it isn't completely dreadful Perl either (though it could be much more compact).
Two warning bells - the shebang line doesn't include '-w
' and there is neither 'use strict;
' nor 'use warnings;
'. This is very old-style Perl; good Perl code uses both warnings and strict.
The use of old-style file handles is no longer recommended, but it isn't automatically bad (it could be code written more than 10 years ago, perhaps).
The non-use of regular expressions is a bit more surprising. For example:
# Process every field in line.
while ($line ne "") {
# Skip spaces and start with empty field.
if (substr ($line,0,1) eq " ") {
$line = substr ($line,1);
next;
}
That could be written:
while ($line ne "") {
$line =~ s/^\s+//;
This chops off all leading spaces using a regex, without making the code iterate around the loop. A good deal of the rest of the code would benefit from carefully written regular expressions too. These are a characteristically Perl idiom; it is surprising to see that they are not being used.
If efficiency was the proclaimed concern (reason for not using regexes), then the questions should be "did you measure it" and "what sort of efficiency are you discussing - machine, or programmer"?
Working code counts. More or less idiomatic code is better.
Also, of course, there are modules Text::CSV and Text::CSV_XS that could be used to handle CSV parsing. It would be interesting to enquire whether they are aware of Perl modules.
There are also multiple notations for handling quotes within quoted fields. The code appears to assume that backslash-quote is appropriate; I believe Excel uses doubled up quotes:
"He said, ""Don't do it"", but they didn't listen"
This could be matched by:
$line =~ /^"([^"]|"")*"/;
With a bit of care, you could capture just the text between the enclosing quotes. You'd still have to post-process the captured text to remove the embedded doubled up quotes.
A non-quoted field would be matched by:
$line =~ /^([^,]*)(?:,|$)/;
This is enormously shorter than the looping and substringing shown.
Here's a version of the code, using the backslash-double quote escape mechanism used in the code in the question, that does the same job.
#!/usr/bin/perl -w
use strict;
open (IN, "qq.in") || die "Cannot open qq.in";
while (my $line = <IN>) {
chomp $line;
print "$line\n";
while ($line ne "") {
$line =~ s/^\s+//;
my $field = "";
if ($line =~ m/^"((?:[^"]|\\.)*)"([^,]*)(?:,|$)/) {
# Quoted field
$field = "$1$2";
$line = substr($line, length($field)+2);
$field =~ s/""/"/g;
}
elsif ($line =~ m/^([^,]*)(?:,|$)/) {
# Unquoted field
$field = "$1";
$line = substr($line, length($field));
}
else {
print "WTF?? ($line)\n";
}
$line =~ s/^,//;
print " [$field]\n";
}
}
close (IN);
It's under 30 non-blank, non-comment lines, compared with about 70 in the original. The original version is bigger than it needs to be by some margin. And I've not gone out of my way to reduce this code to the minimum possible.