On Perl and data manipulation

It is common belief that Perl is the language of choice for the purpose of data processing and manipulation , and i couldn’t agree more , having been turning to it every time i needed some serious data processing , only to get the best results at each undertaking.

However back then in 99 when looking for a method of automating the calculation of my monthly online time from my dialup provider’s access logs , a task for which ASM just wouldn’t cut it , i had no other choice but to assume this thing about perl and data manipulation as the truth and have a go at the language without knowing for certain.

Today after quite a while and many data manipulation scripts , and quite coincidentally for the purpose of calculating the total downtime of my isp from my server’s access logs , i can not feel but pleased with the power you have with perl and it’s data manipulation drive , so i felt like evoking this wonderful side of perl myself too.

Now as to not make this article a dry reading , and because probably nobody likes staring at a program execution with no progress display for hours on end , or not even minutes , i am going to explain 2 simple progress display trick for your perl scripts.

Roughly the trick consists in using r (carriage return) to write over the same line over and over, while also disabling output buffering where it is the case

The first example below is the simpler one , but memory buffers are sacrificed for this and do not use this example for files bigger than some dozens of megabytes or loading times will be drastic and memory usage intensive , that aside this should not present any other speed decreases given you write your script in a speed conscious manner.

use POSIX;
#disable output buffering
$| = 1;

open (INFILE,”< $infile”) or die “$infile file not found”;
@data=<INFILE>;
foreach $a (@data){
$proc = floor((($#data – $.) / $#data) * 100);
print “$. more lines to process ($proc% processed)r”;
$.–;
}

In this second script we are not buffering the whole file into memory , so loading speeds will be great even tho’ we have to use a function to count the total line number of the file with clines() before starting to process it

use POSIX;
#disable output buffering
$| = 1;

#reading line length
print “Reading $infile….”;
$lines = clines($infile);
print “done ($lines lines)n”;

#processing data
open (INFILE,”< $infile”) or die “$infile file not found”;
$out = 0;
while(<INFILE>) {
$proc = floor(($out / $lines) * 100);
print $lines-$out.” more lines to process ($proc% processed)r”;
$out++;
}
sub clines {
my ($filename) = @_;
$lines = 0;
open(FILE, $filename) or die “Can’t open `$filename’: $!”;
while (sysread FILE, $buffer, 4096) {
$lines += ($buffer =~ tr/n//);
}
close FILE;
return $lines
}

Advertisements