How to use AWK in Linux

AWK is a programming language in Unix-based systems that is used to process and manipulate text. You can perform operations such as searching, filtering, formatting, etc on an input file or a standard input.

AWK is an abbreviation taken from the initials of its developers Alfred Aho, Peter Weinberger, and Brian Kernighan.

AWK reads each line in a text file or standard input and splits it into fields or columns separated by a delimiter, like a whitespace or a comma, and allows you to perform operations on these fields.

AWK consists of two main parts, patterns and actions. Actions are the operation that we perform and Patterns specify what lines the action is applied to using regular expressions.

AWK Basic Syntax

The basic syntax for AWK is:

awk [options] 'pattern {action}' filename

What are Actions in AWK

Actions are operations that you want to perform on the lines that match the specified pattern.

Example

We have a file named data.txt with the following content that you can use to practice along with me:

Alice 28
Bob 35
Charlie 42
David 29
Emily 31
Williams 46
Mark 42
Luca 30

The first column contains all the names. How can we print the first column in this file?

awk '{ print $1} ' data.txt

In the above example, there is no pattern specified which means the action is being performed on each line. In this example, the action is the print function that prints the first column (represented by $1) in each line.

Output:

Alice
Bob
Charlie
David
Emily
Williams
Mark
Luca

In AWK columns or fields can be separated by any character. By default, AWK takes whitespace (spaces or tabs) as the field separator.

If the columns are separated by a comma, we can use awk with -F option to specify our delimiter.

awk -F ‘,’ {print $1} data.txt

What are Patterns in AWK

A pattern is a regular expression that filters the lines in a text where the Action is to be performed. Without patterns, the action will be performed on every line.

Example

As mentioned, patterns are going to allow us to perform the action on the lines filtered by the pattern.

In this example, let's look for the line that contains the name ‘Bob’ and print it out:

awk '/Bob/ {print}' data.txt

In the above example, the pattern “/Bob/” which is a regular expression is specified, which means the action print is going to be applied to the lines matching the pattern.

Output:

Bob 35

To learn more about Regular Expressions, check out this article Regular Expressions in Linux.

Example

Print everything between the lines containing the words ‘Charlie’ and ‘Emily’.

awk '/Charlie/, /Emily/ {print}' data.txt

Output:

Charlie 42
David 29
Emily 31

Example

Print names of people who’s age is above 30

awk '$2 > 30 {print $1}' data.txt

The second column, which is the age, can be represented by $2. In the above example, we specified our pattern as $2 > 30 which filters the second column. Then we applied the Action i.e print to print the first column, which is the names. In other words, for each line where the 2nd field is greater than 30, we print the first field.

Output:

Bob
Charlie
Emily
Williams
Mark

How to use variables in AWK

Similar to other programming languages, variables are used to store values that can be used throughout the script.

Similar to Python and Perl, variables in AWK do not have a type and can hold values of any data type.

variable_name = value;

Example

awk '{name=$1; age=$2; print name, "is", age, "years old"}' data.txt

The first part of the AWK command, '{name=$1; age=$2;', assigns the values of the first and second columns to the variables name and age, respectively.

The second part of the command, 'print name, "is", age, "years old."}', prints the values of the name and age variables, along with some text to make the output more readable.

Output:

Alice is 28 years old
Bob is 35 years old
Charlie is 42 years old
David is 29 years old
Emily is 31 years old
Williams is 46 years old
Mark is 42 years old
Luca is 30 years old

There are quite a handful of built-in variables in AWK that you can make use of to reference specific information about the text you’re working with. Some of those variables are:

  • NR - The number of the current record being processed. This variable is incremented automatically by AWK as it reads each record in the input file.
  • NF - The number of fields in the current record. A field is a section of the record that is separated by a delimiter (such as a space or a comma).
  • FS - The field separator is used to divide fields in the input file. By default, this is set to whitespace (spaces and tabs), but you can change it to any character you like.
  • RS - The record separator is used to divide records in the input file. By default, this is set to a newline character, but you can change it to any character you like.
  • FILENAME - The name of the current input file being processed.
  • OFS - The output field separator is used to separate fields in the output. By default, this is set to a space character, but you can change it to any character you like.
  • ORS - The output record separator is used to separate records in the output. By default, this is set to a newline character, but you can change it to any character you like.

How to use Control Flow in AWK

AWK supports control flow statements like if statements, while loops, for loops, and more to make text processing more efficient.

How to use If-else statements

if statements are used to check if a condition is true before executing some code. If the conditions are true, the code inside the block is getting executed. Otherwise, the else statement is executed. It’s possible to have an if statement without specifying an else.

if-else syntax:

if (condition)
{
   statements;
}
else
{
   statements;
}

Example

If a person’s age is over 30, print their name:

awk '{ if ($2 > 30) {print $1}}' data.txt

Here is a cleaner writing of it using indentations:

awk '{
	if ($2 > 30) {
		print $1
	}
}' data.txt

Output:

Bob
Charlie
Emily
Williams
Mark

That is an if statement without an else. Let’s add an else to it.

If a person’s age is over 30, print their name, else state that they’re under 30That is an if statement without an else. Let’s add an else to it.

If a person’s age is over 30, print their name, else state that they’re under 30:

awk '{ if ($2 > 30) {print $1} else { print $1, "is under 30"}}' data.txt

Here is a cleaner way of writing it with indentations:

awk '{
	if ($2 > 30){
		print $1, "is over 30"
	}
	else {
	  print $1, "is under 30"
	}
}' data.txt

Output:

Alice is under 30
Bob is over 30
Charlie is over 30
David is under 30
Emily is over 30
Williams is over 30
Mark is over 30
Luca is under 30

We can have multiple else as well like the following:

awk '{
	if ($2 > 30){
		print $1, "is over 30 years old"
	}
	else if ($2 == 30){
	  print $1, "is 30 years old"
	}
	else {
		print $1, "is under 30"
	}
}' data.txt

Output:

Alice is under 30
Bob is over 30 years old
Charlie is over 30 years old
David is under 30
Emily is over 30 years old
Williams is over 30 years old
Mark is over 30 years old
Luca is 30 years old

How to use While Loops

While loops are used to execute a set of statements repeatedly as long as certain conditions are true.

Here is a basic syntax:

while (condition)
{
   statements;
}

Example:

Print numbers 1 to 10 using a while loop

awk 'BEGIN { i=1; while (i<=10) { print i; i++; } }'

Here, we initialized a variable i and set it to 1; the condition checks if the value of i is less than or equal to 10. As long as the condition is true, the statements inside the loop are executed.

Output:

1
2
3
4
5
6
7
8
9
10

You also learned the keyword BEGIN from this example. The keyword BEGIN allows you to perform some initial processing before the input file is read, and this processing doesn't depend on any input from the file.

How to use For loops

The for loop is the most popular type of loop in any programming language. In AWK, the for loop syntax is as follows:

for (initialization; condition; increment/decrement)
{
   statements;
}

Example

Print numbers 1 to 10 using for loop:

awk 'BEGIN {
	for (i=1; i<=10; i++) {
		print i}
}'

Output:

1
2
3
4
5
6
7
8
9
10

You can also use break and continue statements along these control flow but I’m not going to deep dive into them in here.

Useful AWK options you must know

There are a number of options that you should know when using AWK.

-F option: This option is the most useful option and you will need is quite often to specify a field separator on your input file. By default, AWK takes whitespace (space or tabs) as the delimiter.

if we have a comma separate values like this:

firstname,lastname,age

You can use the following command to specify the comma as the field separator:

awk -F',' '{print $1}' filename

-v option: This option can help you to define a variable in your command line and pass it to your awk script.

For example:

awk -v num=10 'BEGIN {print num}'

We defined a variable num and used it inside our awk script

Output:

10

-f option: You can write your awk script in a separate file and execute it with -f option.

For example, let’s make a file called script.awk and add the following script which is the for loop we saw earlier

BEGIN {
	for (i=1; i<=10; i++) {
		print i}
}

We can now execute this script using -f option like this

awk -f script.awk

A Few Useful Use Cases of AWK

How to replace text using awk

You can use AWK to replace a string with another using gsub function:

For example, replacing “Bob” with “Jack” in the data.txt file

awk '{gsub(/Bob/,"Jack"); print}' data.txt

Output:

Alice 28
Jack 35
Charlie 42
David 29
Emily 31
Williams 46
Mark 42
Luca 30

How to filter text based on multiple conditions

For example, we can look for the name “Bob” in data.txt who’s age (second column) is greater than 30:

awk '/Bob/ && $2 > 30 {print}' data.txt

Output:

Bob 35

How to calculate the sum of a column

We can use AWK variables to store the value of a specific field in each row in a variable.

For example, to sum the second column in data.txt, we can use:

awk '{sum = sum + $2} END {print sum}' data.txt

The "END" keyword is used to specify a block of code that should be executed once all the lines of input have been processed. This is the opposite of BEGIN keyword that we discussed earlier.

Also, instead of using sum = sum + $2, we can use the shorter form sum += $2

awk '{sum += $2} END {print sum}' data.txt

Output:

283

How to format data

We can use AWK’s printf function to specify a format for the output. Let’s say we want to re-format data.txt to use commas instead of whitespace as the filed separator.

awk '{printf "%s,%s\n", $1, $2}' data.txt

Explanation:

The “%s,%s\n” is our format where %s is a string value that is assigned to the first and second columns “$1, $2” respectively, and the \n indicates a new line. That is how we are getting the following output:

Alice,28
Bob,35
Charlie,42
David,29
Emily,31
Williams,46
Mark,42
Luca,30

How to calculate the average of a column

Similar to how we calculated the sum of a column, we can calculate the average of a column by dividing the sum to the number of lines.

awk '{sum+=$2} END {print "Average:", sum/NR}' data.txt

As discussed earlier, the NR is a built-in variable that gives the number of records or rows in a file or text.

Output:

Average: 35.375

Conclusion

AWK is a programming language that can be used to filter, search, manipulate, and process text files or standard input. As an IT engineer, you will need this tool to filter log files, format text, and manipulate large text data. In this blog post, we covered everything you need to know to start using AWK from its syntax to the use cases.

RECENT POSTS

Table of Contents