Perl substr Function

The substr function is used to extract and return a substring from a string.

The substr function is one of the most important string functions in the Perl language and is meant to retrieve sub-strings of a given string. But this function is a bit complicated and it does much more than I pointed out above.

You can use it for manipulate strings, either you use it alone or in context with other string functions, like index or length.

A lot of strings manipulation can be done using the power of regular expressions but in many cases, the built-in string functions are straightforward and take less time to execute.

The syntax forms of this function are as follows:

substr EXPR, OFFSET, LENGTH, REPLACEMENT
substr EXPR, OFFSET, LENGTH
substr EXPR, OFFSET
 
where:
  • EXPR is a string expression from which the substring will be extracted
  • OFFSET is an index from where the substring to be extracted starts
  • LENGTH is the length of the substring to extract
  • REPLACEMENT is a string that will replace the substring

Like in the case with other functions, you can use the parentheses or not, do it as you wish.

As you can see above, some arguments are mandatory and others are optional.

You must mention at least the string expression (EXPR) and the position (OFFSET) from where the substring to be extracted starts.

Before reviewing the Perl substr function parameters, I want to remind you that in Perl the first character of a string has the index 0, the second 1, and so on.

Actually, you can modify this by setting the special variable $[ with whatever you want, but be careful however if you decide to change it. For strings $[ is the index of the first character of the string and by default is set to 0.

And now let’s go back to our parameters.

OFFSET could be:

  • positive – the substring starts that far from the beginning of the string
  • negative – the substring starts that far from the end of the string
  • 0 - that means that the substring starts at the first character of the string

LENGTH could be:

  • omitted – the function will return all the characters beginning with the OFFSET position up to the end character of the string
  • positive – the function will return from the string maximum LENGTH characters beginning with the OFFSET position
  • negative – it will return the substring starting with the OFFSET position but without that many characters off the end of the string
  • 0 – in this case the returned substring will be empty, no error warning

You can use the substr function to extract a substring starting from an index and having a given length. See the following snippet code:

#!/usr/local/bin/perl
 
use strict;
use warnings;
 
my $names = "John Peterson Anne Mike";
my $oneName = substr($names, 5, 8);
 
print "$names\n";
#it prints John Peterson Anne Mike
 
print "$oneName\n"; 
#it prints Peterson
In the above code the Perl substr function will extract a substring from the $names string variable, starting with the index 5 and having a length of 8 characters. The substring will be returned in the $oneName scalar variable.

Please note that $names variable value didn’t change after using the substr function.

You can use the substr function either in various comparisons or like a lvalue such as an assignment. In this last case, the value of the initial string will be modified.

See the next block of code for this:

#!/usr/local/bin/perl
 
use strict;
use warnings;
 
my $names = "Alin Fred John";
substr($names, 5, 4) = "Mary";
 
print "$names\n";
# it prints Alin Mary John

The following example shows you how to replace a substring (if exists) with a different substring in a string, using the index, length and substr built-in functions.

See the following code snippet:

#!/usr/local/bin/perl
 
use strict;
use warnings;
 
my $str  = '121 122 123 124 125 126 127';
my $find    = '12';
my $replace = '13';
 
my $pos = index $str, $find;
 
while ($pos != -1) {
  # using substr like a lvalue
  substr ($str, $pos, length $find) = $replace;
  $pos = index $str, $find, $pos + length $replace;
}
 
print "$str\n";
# it prints: 131 132 133 134 135 136 137
A while loop is used to step through the characters of the $str. To help you understand this code I remind you that:
 
  • the index function returns -1 if the substring is not found or the position of the first occurrence of the substring, otherwise
  • the length function returns the number of characters/bytes of an expression
  • the Perl substr function is used like a lvalue and has the syntax:
 
substr (EXPR, OFFSET, LENGTH) = REPLACEMENT
 
     where:

o    EXPR is a string expression from which the substring will be extracted

o    OFFSET is an index from where the substring to be extracted starts

o    LENGTH is the length of the substring to extract

o    REPLACEMENT is a string that will replace the substring

Another approach is to use the s/// substitution operator of the regular expressions. You can rewrite the above example as follows:

#!/usr/local/bin/perl
 
use strict;
use warnings;
 
my $str  = '121 122 123 124 125 126 127';
 
$str =~ s/12/13/g ;
 
print "$str\n";
# it prints: 131 132 133 134 135 136 137
The /g global modifier is used to match all the occurrences of the substring that you want to replace. 

A flat file database consists of a number of records delimited by a separator, which in most cases is the newline ("\n") character. In this case we say that each record is specified on a single line. Each record consists by one or more fields, either of fixed width or delimited by some special character like whitespace or comma.

For instance, let’s suppose that each record of the file customers.txt includes the fields: Name, Phone and ZipCode and the entire file has three records only, like in the next figure:

Name

Phone

ZipCode

John Abbot

872-321-1212

55416

Clark Eliot

205-321-1200

20037

Johnny Randolph

345-767-3476

33702

 

Fixed-width columns

First, we’ll examine the case when the fields have fixed width: Name20, Phone12 and ZipCode5. If we’ll print the file, we’ll get something like this:

John Abbot          872-321-121255416
Clark Eliot         205-321-120020037
Johnny Randolph     345-767-347633702

The following block of code reads the file line by line using the while loop:

#!/usr/local/bin/perl
 
use strict;
use warnings;
 
open FILE, "customers.txt" or die $!;
while (<FILE>) {
  # chomp off the possible ending newline from $_
  chomp;
 
  my $name = substr($_,0, 20);
  # trim the end trailing whitespaces
  $name =~ s/ +$//;
 
  my $phone = substr($_,20, 12);
  # delete all '-' characters
  $phone =~ s/-//g;
 
  my $zipCode = substr($_, 20+12, 5);
 
  print $name, ",",$phone, ",",$zipCode, "\n";
}
close FILE;
Running the code snippet will produce the following output (printing each record on a line, with the fields separated by comma):
 
John Abbot,8723211212,55416
Clark Eliot,2053211200,20037
Johnny Randolph,34576734763,33702
 
Columns delimitated by separator

The next example will illustrate the case when the fields are delimited by a character separator like comma. In this case the content of our file will be:

John Abbot,872-321-1212,55416
Clark Eliot,205-321-1200,20037
Johnny Randolph,345-767-34763,33702
 
Because I want to show you how you can use the Perl substr function to access the fields of the record, I’ll not use the split function to do this (although it looks easier).

See the next sample code to see how you could implement it:

#!/usr/local/bin/perl
 
use strict;
use warnings;
 
open FILE, "customers.txt" or die $!;
while (<FILE>)
{
  # chomp off the possible ending newline
  chomp;
 
  my $pos1 = index($_, ",");
  my $name = substr($_,0, $pos1);
 
  my $pos2 = index $_, ",", $pos1+1;
  my $phone = substr($_,$pos1+1, $pos2-$pos1-1);
  # delete all - characters
  $phone =~ s/-//g;
 
  my $zipCode = substr($_, $pos2+1, length($_)-$pos2);
  print $name,",",$phone,",",$zipCode,"\n";
}
close FILE;
The output is the same as in the previous example.

You can use the Perl sprintf function to pad left and right with blanks or zeroes. If you need to pad with a character other than blank or zero, you can use the substr and length functions.

Have a look at the following code snippet:

#!/usr/local/bin/perl
 
use strict;
use warnings;
 
# left padding with _
my $text = 'This is about Perl language';
my ($padLen, $padChar) = (35, '_');
substr( $text, 0, 0 ) =
        $padChar x ( $padLen - length( $text ) );
print "$text\n";
# it prints: ________This is about Perl language
 
# right padding with _
$text = 'This is about Perl language';
substr( $text, length( $text ), 0 ) =
        $padChar x ( $padLen - length( $text ) );
print "$text\n";
# it prints: This is about Perl language________

Here $padLen is the length to which you wish to pad the string, $text contains the string to be padded and $padChar contains the padding character.

The substr function is used here like a lvalue, modifying the $text directly. The x operator is used to repeat the padding character as many positions are available. This method doesn’t truncate $text.

The Perl substr function can be used together with other functions as pack and unpack to make common conversions between number representations. Here the Perl substr function is used to left pad a character string with zeros.

This approach will show two examples about how to convert from hexadecimal / binary format into decimal.

The first example is about the conversion from hexadecimal to decimal. See the following code:

#!/usr/local/bin/perl
 
use strict;
use warnings;
 
my $hex = '12ef3a';
my $dec = unpack ("N", (pack "H8", substr("0" x 8 . $hex, -8)));
print "$dec\n";
# it prints: 1240890
The substr function will pad the $hex string with 8 zeros at the left side and will return the last 8 characters of the resulting string (because the offset is -8). The pack function, using the H8 template will pack the hex string returned by the substr function by putting the high nibble first (see the H8 template format of the pack function). The unpack function will unpack the string returned by pack, converting the string into an unsigned long using the N template format.

The next example shows you how to convert from binary to decimal. This example is suitable for larger strings of bit characters:

#!/usr/local/bin/perl
 
use strict;
use warnings;
 
my $bin = '100101110111100111010';
my $dec = unpack("N", pack("B32", substr("0" x 32 . $bin, -32)));
print "$dec\n";
# it prints: 1240890
Like in the previous example, the Perl substr function is used to left pad a 32 character string with zeros. 

The following example shows you how to use the Perl substr function to split a string into an array of strings, where each string of the array has a given length.

See the following code snippet:

#!/usr/bin/perl
 
use warnings;
use strict;
 
my $str = 'abcdefghijklmnopqrstuvwxyzabcdefgh';
my ($chunkSize, $strLength) = (10, length $str);
my @array;
 
for(my $i = 0; $i < $strLength; $i += $chunkSize) {
  push @array, substr ($str, $i, $chunkSize);
}
 
print "@array\n";
To split the initial string into an array the for statement is used. Inside the for loop, the substr function extracts consecutively a portion of $chunkSize characters and the push function append it to @array. The last string of the array could have a length less than $chunkSize.

Finally, the array is printed, the elements of the array being separated by space.

If you’ll run this code, you’ll get the following output:

abcdefghij klmnopqrst uvwxyzabcd efgh
 
An alternative to Perl substr function is to use the regex engine with the match operator as you can see in the below example:
 
#!/usr/bin/perl
 
use warnings;
use strict;
 
my $str = 'abcdefghijklmnopqrstuvwxyzabcdefgh';
my @array = $str =~ /(.{1,10})/g;
 
print "@array\n";
The /g global modifier forces to match all the instances of the pattern. Because of the =~ binding operator, the match is against the $str variable. The round parentheses define a single group that allows us to extract the match in the $1 special variable.

Because of @array, we are in a list context so the value stored in $1 will be append as a string to @array. The . (dot) matches any single character and the notation .{1,10} means . (i.e. any character) matches at least once, but no more than 10.  

The output is the same as before.

If you need to extract a substring delimited by two other substrings from a string, you can use both the index and substr built-in functions, as you can see in the following example:

#!/usr/local/bin/perl
 
use strict;
use warnings;
 
my $url =
  'http://www.misc-perl-info.com/perl-howto-tutorials.html';
my ($str1, $str2, $pos) = ('http://', '/');
 
 
if(($pos = index(lc $url, lc $str1)) > -1) {
  $url = substr ($url, $pos + length $str1);
  if(($pos = index(lc $url, lc $str2)) > -1) {
    $url = substr ($url, 0, $pos)
  }
}
 
print "$url\n";
# it prints "www.misc-perl-info.com"
From the above url, we need to extract the text between 'http://' and the first  '/' in that string.
 
  • the first index function returns in the $pos variable the position of the $str1 in the $url string.
  • the first substr function will cut from $url the $str1 substring
  • the second index function returns in the $pos variable the position of the $str2 in the new $url string
  • the second substr function will cut from $url the portion beginning with the $pos until the end of the $url string
  • finally, the $url is printed

To make the index function to search case insensitive, the lc function is used.

The string functions are often faster than regular expressions, because they have not metacharacters to worry about and they don’t set any of the memory variables.

But you can use sometimes together the regular expressions with the string function in order to provide some additional functionality to your code. 

The following example shows you a way to use the Perl substr function with the =~ binding operator:

#!/usr/local/bin/perl
 
use strict;
use warnings;
 
# initialize the original string
my $str = "one or two or three or four";
 
substr($str, -17) =~ s/or/and/ig;
 
print "$str\n";
Here the Perl substr function is used like a lvalue and that means the selected portion of the $str variable (the last 17 characters) can be changed by the expression from the right side of the equals (=) sign.

This code will have as effect the replacing of the text or with the text and wherever possible within just the last 17 characters of the $str string.

The output is as follows:

one or two and three and four