Friday 23 May 2014

Awk: Patterns

The whole line is stored in $0, the first field in $1, and the second field in $2 and so on.
awk 'Pattern {Action}' InputFile

Patterns inlucde relational expression patterns, range pattern and pattern-matching expression.
The action is always enclosed in braces. The whole awk statement, consisting of pattern and action, must be enclosed in single quotes. You could treat more than one file at once by giving several filenames.

awk's editing statement must consist of either a pattern, an action or both. If a pattern is omitted, then the default pattern is employed. The default pattern is matching every line. If the action is omitted, the default action is printing the whole line.

You can change the field separator by supplying it immediately after the -F option, enclosed in double quotes.
$awk -F":" '/Freddy/ {print $1 " uses " $7}' /etc/passwd
Freddy uses /bin/bash
### Here the content of fields 1 and 7, represented by the variables $1 and $7, respectively, separated by some text that must be enclosed in double quotes, is printed.

With awk, you can write very complex programs. Usually, programs are saved in a file that can be called by the option -f.

Except for the patterns BEGIN and END, patterns can be combined with the Boolean operator ||, && and !.

Regular Expressions
Regular expressions must be enclosed in slashes (/.../). If you append an exclamation mark (!) before the first slash, all records not matching the regular expression will be chosen.

$awk '!/xxx/ {print $1}' enzyme.txt

Pattern-Matching Expressions
Sometimes you will need to ask the question if a regular expression matches a field. Thus, you do not wish to see whether a pattern matches a record but a specified field of this record. In this case, you would use a pattern-matching expression. $n stands for any field variable like $1, $2 or so.

$n~/re/ ### Is true if the field $n matches the regular expression re.
$n!~/re/ ### Is true if the field $n does not match the regular expression re.

Relational Character Expressions
It is often useful to check if a character string (a row of characters) in a certain field fulfills specified conditions. You might, for example, want to check whether a certain string is contained within another string. In these cases, you would use relational character expressions. The character string s must always be enclosed in double quotes.

$n=="s"
$n!="s"
$n<"s" ### Character-by-character comparison of $n and s. The first character of each string is compared, then the second character and so on. The result is true if $n is lexicographically smaller than s: "flag<fly" and "abc<abcd"
$n<="s"
$n>"s"

Uppercase characters are lexicographically less than lowercase characters, and number characters are less than alphabetic characters.

numbers<uppercase<lowercase

"==" requires a perfect match of strings and not a partial match.

Relational Number Expressions
Similar to relation character expressions are relational number expressions, except that they compare the value of numbers.

$n==v
$n!=v
$n<v
$n<=v
$n>v
$n>=v

Any numerical value is allowed on the left side of the relation.
$awk 'length($1)>6' enzyme.txt

Conversion of Numbers and Characters
If you need to force a number to be converted to a string, concatenate that number with the empty string "".

A string (that contains numerical characters) can be converted to a number by adding 0 to that string.
$awk '($2+0)>2' enzyme.txt

Ranges
A range of records can be specified using two patterns separated by a comma.
$awk '/En/,/Hy/' enzyme.txt
### In this example, all records between and including the first line matching the regular expression "En" and the first line matching the regular expression "Hy" are printed.

The range works like a switch. Upon the first occurrence of the first pattern, all lines are treated by the action until the second pattern matches or becomes true. When it becomes switched off, no line matches, until it becomes turned on again. If the "off switch", which is the second pattern, is not found, all records down the end of the file match.

If both range patterns are the same, the switch will be turned on and off at each record. Thus, the statement
awk '/AUTHOR/,/AUTHOR/' structure.pdb
prints only one line.

BEGIN and END
BEGIN and END do not deal with the input file at all. These patterns allow for initialization and cleanup actions, respectively. Both BEGIN and END must have actions and these must be enclosed in braces. There is no default action.

The BEGIN block is executed before the first line (record) of the input file is read. Likewise, the END block is executed after the last record has been read. Both blocks are executed only once.

We can assign the field separator either with the option -F or by assigning the variable FS in the BEGIN block.

The special patterns BEGIN and END cannot be used in ranges or with any operators. However, an awk program can have multiple BEGIN and END blocks. They will be executed in the order in which they appear.
















No comments:

Post a Comment