AWK is a programming language in Unix-based systems that is used to process and manipulate text. You can perform operations such as searching, filtering, formatting, etc on an input file or a standard input.
AWK is an abbreviation taken from the initials of its developers Alfred Aho, Peter Weinberger, and Brian Kernighan.
AWK reads each line in a text file or standard input and splits it into fields or columns separated by a delimiter, like a whitespace or a comma, and allows you to perform operations on these fields.
AWK consists of two main parts, patterns and actions. Actions are the operation that we perform and Patterns specify what lines the action is applied to using regular expressions.
AWK Basic Syntax
The basic syntax for AWK is:
awk [options] 'pattern {action}' filename
What are Actions in AWK
Actions are operations that you want to perform on the lines that match the specified pattern.
Example
We have a file named data.txt with the following content that you can use to practice along with me:
Alice 28
Bob 35
Charlie 42
David 29
Emily 31
Williams 46
Mark 42
Luca 30
The first column contains all the names. How can we print the first column in this file?
awk '{ print $1} ' data.txt
In the above example, there is no pattern specified which means the action is being performed on each line. In this example, the action is the print function that prints the first column (represented by $1) in each line.
Output:
Alice
Bob
Charlie
David
Emily
Williams
Mark
Luca
In AWK columns or fields can be separated by any character. By default, AWK takes whitespace (spaces or tabs) as the field separator.
If the columns are separated by a comma, we can use awk
with -F option to specify our delimiter.
awk -F ‘,’ {print $1} data.txt
What are Patterns in AWK
A pattern is a regular expression that filters the lines in a text where the Action is to be performed. Without patterns, the action will be performed on every line.
Example
As mentioned, patterns are going to allow us to perform the action on the lines filtered by the pattern.
In this example, let's look for the line that contains the name ‘Bob’ and print it out:
awk '/Bob/ {print}' data.txt
In the above example, the pattern “/Bob/” which is a regular expression is specified, which means the action print is going to be applied to the lines matching the pattern.
Output:
Bob 35
To learn more about Regular Expressions, check out this article Regular Expressions in Linux.
Example
Print everything between the lines containing the words ‘Charlie’ and ‘Emily’.
awk '/Charlie/, /Emily/ {print}' data.txt
Output:
Charlie 42
David 29
Emily 31
Example
Print names of people who’s age is above 30
awk '$2 > 30 {print $1}' data.txt
The second column, which is the age, can be represented by $2. In the above example, we specified our pattern as $2 > 30
which filters the second column. Then we applied the Action i.e print to print the first column, which is the names. In other words, for each line where the 2nd field is greater than 30, we print the first field.
Output:
Bob
Charlie
Emily
Williams
Mark
How to use variables in AWK
Similar to other programming languages, variables are used to store values that can be used throughout the script.
Similar to Python and Perl, variables in AWK do not have a type and can hold values of any data type.
variable_name = value;
Example
awk '{name=$1; age=$2; print name, "is", age, "years old"}' data.txt
The first part of the AWK command, '{name=$1; age=$2;'
, assigns the values of the first and second columns to the variables name
and age
, respectively.
The second part of the command, 'print name, "is", age, "years old."}'
, prints the values of the name
and age
variables, along with some text to make the output more readable.
Output:
Alice is 28 years old
Bob is 35 years old
Charlie is 42 years old
David is 29 years old
Emily is 31 years old
Williams is 46 years old
Mark is 42 years old
Luca is 30 years old
There are quite a handful of built-in variables in AWK that you can make use of to reference specific information about the text you’re working with. Some of those variables are:
- NR - The number of the current record being processed. This variable is incremented automatically by AWK as it reads each record in the input file.
- NF - The number of fields in the current record. A field is a section of the record that is separated by a delimiter (such as a space or a comma).
- FS - The field separator is used to divide fields in the input file. By default, this is set to whitespace (spaces and tabs), but you can change it to any character you like.
- RS - The record separator is used to divide records in the input file. By default, this is set to a newline character, but you can change it to any character you like.
- FILENAME - The name of the current input file being processed.
- OFS - The output field separator is used to separate fields in the output. By default, this is set to a space character, but you can change it to any character you like.
- ORS - The output record separator is used to separate records in the output. By default, this is set to a newline character, but you can change it to any character you like.
How to use Control Flow in AWK
AWK supports control flow statements like if statements, while loops, for loops, and more to make text processing more efficient.
How to use If-else statements
if statements are used to check if a condition is true before executing some code. If the conditions are true, the code inside the block is getting executed. Otherwise, the else statement is executed. It’s possible to have an if statement without specifying an else.
if-else syntax:
if (condition)
{
statements;
}
else
{
statements;
}
Example
If a person’s age is over 30, print their name:
awk '{ if ($2 > 30) {print $1}}' data.txt
Here is a cleaner writing of it using indentations:
awk '{
if ($2 > 30) {
print $1
}
}' data.txt
Output:
Bob
Charlie
Emily
Williams
Mark
That is an if statement without an else. Let’s add an else to it.
If a person’s age is over 30, print their name, else state that they’re under 30That is an if statement without an else. Let’s add an else to it.
If a person’s age is over 30, print their name, else state that they’re under 30:
awk '{ if ($2 > 30) {print $1} else { print $1, "is under 30"}}' data.txt
Here is a cleaner way of writing it with indentations:
awk '{
if ($2 > 30){
print $1, "is over 30"
}
else {
print $1, "is under 30"
}
}' data.txt
Output:
Alice is under 30
Bob is over 30
Charlie is over 30
David is under 30
Emily is over 30
Williams is over 30
Mark is over 30
Luca is under 30
We can have multiple else as well like the following:
awk '{
if ($2 > 30){
print $1, "is over 30 years old"
}
else if ($2 == 30){
print $1, "is 30 years old"
}
else {
print $1, "is under 30"
}
}' data.txt
Output:
Alice is under 30
Bob is over 30 years old
Charlie is over 30 years old
David is under 30
Emily is over 30 years old
Williams is over 30 years old
Mark is over 30 years old
Luca is 30 years old
How to use While Loops
While loops are used to execute a set of statements repeatedly as long as certain conditions are true.
Here is a basic syntax:
while (condition)
{
statements;
}
Example:
Print numbers 1 to 10 using a while loop
awk 'BEGIN { i=1; while (i<=10) { print i; i++; } }'
Here, we initialized a variable i
and set it to 1; the condition checks if the value of i
is less than or equal to 10. As long as the condition is true, the statements inside the loop are executed.
Output:
1
2
3
4
5
6
7
8
9
10
You also learned the keyword BEGIN from this example. The keyword BEGIN allows you to perform some initial processing before the input file is read, and this processing doesn't depend on any input from the file.
How to use For loops
The for loop is the most popular type of loop in any programming language. In AWK, the for loop syntax is as follows:
for (initialization; condition; increment/decrement)
{
statements;
}
Example
Print numbers 1 to 10 using for loop:
awk 'BEGIN {
for (i=1; i<=10; i++) {
print i}
}'
Output:
1
2
3
4
5
6
7
8
9
10
You can also use break
and continue
statements along these control flow but I’m not going to deep dive into them in here.
Useful AWK options you must know
There are a number of options that you should know when using AWK.
-F option: This option is the most useful option and you will need is quite often to specify a field separator on your input file. By default, AWK takes whitespace (space or tabs) as the delimiter.
if we have a comma separate values like this:
firstname,lastname,age
You can use the following command to specify the comma as the field separator:
awk -F',' '{print $1}' filename
-v option: This option can help you to define a variable in your command line and pass it to your awk script.
For example:
awk -v num=10 'BEGIN {print num}'
We defined a variable num and used it inside our awk script
Output:
10
-f option: You can write your awk script in a separate file and execute it with -f option.
For example, let’s make a file called script.awk
and add the following script which is the for loop we saw earlier
BEGIN {
for (i=1; i<=10; i++) {
print i}
}
We can now execute this script using -f option like this
awk -f script.awk
A Few Useful Use Cases of AWK
How to replace text using awk
You can use AWK to replace a string with another using gsub
function:
For example, replacing “Bob” with “Jack” in the data.txt file
awk '{gsub(/Bob/,"Jack"); print}' data.txt
Output:
Alice 28
Jack 35
Charlie 42
David 29
Emily 31
Williams 46
Mark 42
Luca 30
How to filter text based on multiple conditions
For example, we can look for the name “Bob” in data.txt who’s age (second column) is greater than 30:
awk '/Bob/ && $2 > 30 {print}' data.txt
Output:
Bob 35
How to calculate the sum of a column
We can use AWK variables to store the value of a specific field in each row in a variable.
For example, to sum the second column in data.txt, we can use:
awk '{sum = sum + $2} END {print sum}' data.txt
The "END" keyword is used to specify a block of code that should be executed once all the lines of input have been processed. This is the opposite of BEGIN keyword that we discussed earlier.
Also, instead of using sum = sum + $2,
we can use the shorter form sum += $2
awk '{sum += $2} END {print sum}' data.txt
Output:
283
How to format data
We can use AWK’s printf
function to specify a format for the output. Let’s say we want to re-format data.txt to use commas instead of whitespace as the filed separator.
awk '{printf "%s,%s\n", $1, $2}' data.txt
Explanation:
The “%s,%s\n
” is our format where %s
is a string value that is assigned to the first and second columns “$1, $2
” respectively, and the \n
indicates a new line. That is how we are getting the following output:
Alice,28
Bob,35
Charlie,42
David,29
Emily,31
Williams,46
Mark,42
Luca,30
How to calculate the average of a column
Similar to how we calculated the sum of a column, we can calculate the average of a column by dividing the sum to the number of lines.
awk '{sum+=$2} END {print "Average:", sum/NR}' data.txt
As discussed earlier, the NR is a built-in variable that gives the number of records or rows in a file or text.
Output:
Average: 35.375
Conclusion
AWK is a programming language that can be used to filter, search, manipulate, and process text files or standard input. As an IT engineer, you will need this tool to filter log files, format text, and manipulate large text data. In this blog post, we covered everything you need to know to start using AWK from its syntax to the use cases.