03. Records and Fields RS, RT, ORS, FS, OFS, $n

The backbone of AWK's programming model consists of two pieces: 1) records & fields, along with 2) patterns & actions. Let's look at the first core component here, then move onto patterns & actions in the next lesson.

What are records and fields?

AWK views each input stream as a collection of records. Records can be thought of individual lines, which are then divided into fields (each data cell). Take a look at the figure below, which displays the grades.txt file.

Records and fields in AWK.
Our example grades.txt file. Each row is a record, and each data cell is a field.

In the above example, the records are separated by a newline character, and the fields are delimited by whitespace. But perhaps we are working with CSV (comma-separated values) files, which use commas to separate fields. Or maybe we have just one long line of data, with each record separated by a semicolon (;). How do we let our gawk implementation to know this?

Record separators (RS & RT)

To specify the character that separates records, we use the built-in RS variable. In the original AWK implementation, the RS variable had to be a single literal character such as the newline or an empty string. In other implementations such as gawk, RS may be a regular expression.

In the case we have a regular expression, RS will hold the literal regex, while RT will hold the matching string.

$ echo firstRecord 111111 secondRecord 222222 thirdRecord 333333 lastRecord |
>    gawk 'BEGIN { RS = "([[:digit:]]+)" }
>                { print "RS = " RS " and RT = " RT }'
RS = ([[:digit:]]+) and RT = 111111
RS = ([[:digit:]]+) and RT = 222222
RS = ([[:digit:]]+) and RT = 333333

This code snippet sets the RS variable to any number of digits. Notice how the RS variable displays the literal regex, while RT displays the matched regex.

Output Record Separator (ORS)

The Output Record Separator (ORS) is used to specify what should come after an record is printed. The default is a newline character.

In this example, we read and print out the current record in our buffer (denoted by $0), followed by a plus (+) symbol.

$ echo 'hello; nihao; hola; anyonghasaeyo' | 
>    gawk 'BEGIN { RS = ";"; ORS = " +"}
>                { print $0 }'
hello + nihao + hola + anyonghasaeyo 

Field separators (FS)

Fields are separated by the FS variable. The default value is a single space, which translates to one or more whitespace characters with the leading/trailing whitespaces on the line are ignored. Thus, the following fields looks the same to AWK.

Joe John Johanna
Joe   John   Johanna

To specify a literal single space, enclose the space with brackets such that FS = '[ ]'

The field separated may be identified by the -F option via the command line, or by assigning it in the BEGIN block.

$ echo 'Joe John Johanna' | 
>    gawk -F' ' '{ print NF ":" $0 }'
3:Joe John Johanna 
# Same command as above but using the BEGIN block
$ echo 'Joe John Johanna' | 
>    gawk 'BEGIN { FS=" " } 
>                { print NF ":" $0 }'
3:Joe John Johanna
# Changing the FS character
$ echo '   Joe   John   Johanna   ' | 
>    gawk -F'[ ]' '{ print NF ":" $0 }'
13:   Joe   John   Johanna                     

Here we can see that the -F variable is used to manipulate the FS variable straight from the command line. We'll formally learn about how to use AWK via the command line in future lesson.

Output Field Separator (OFS)

The Output Field Separator, or OFS stores the variable that separates each field upon output. By default, it is a space.

$ echo 'John Mary; Jacob Teresa; Bob Claire' |
>    gawk 'BEGIN { OFS=" loves "; RS=";" }
>                { print $1, $2 }'
John loves Mary
Jacob loves Teresa
Bob loves Claire                    
h3 Field accession ($n)

You may have noticed the use of the $0 variable in the previous example. This variable stores the current record. To access fields, we can simply use a $, followed by the field number (eg. $1 for the first field, $2 for the second, and so on).

$ echo 'uno dos tres' | gawk -F' ' '{ print "The second     | field is: " $2; print "The entire record is: " $0 }'
The second field is: dos
The entire record is: uno dos tres

Note that that the values start at 1 and not 0, unlike most programming languages with a zero-based index.

Field to integer conversion

Fields are converted to integer values accordingly. Thus, $(2*2), $(8/2), $"4.41" and $4 all refer to the fourth field. Note that negative values have no meaning.

Aching back from coding all day?

Foam Seat Cushion

Aching back from coding all day? Try Back Problems

This foam seat cushion relieves lowerback pain, numbness and pressure sores by promoting healthy weight distribution, posture and spine alignment. Furthermore, it reduces pressure on the tailbone and hip bones while sitting. Perfect for sitting on the computer desk for long periods of time.

$ Check price
99.9599.95Amazon 4.5 logo(9,445+ reviews)

More Back Problems resources

Take your Linux skills to the next level!

How Linux Works

Take your Linux skills to the next level! Try Linux & UNIX

In this completely revised second edition of the perennial best seller How Linux Works, author Brian Ward makes the concepts behind Linux internals accessible to anyone curious about the inner workings of the operating system. Inside, you'll find the kind of knowledge that normally comes from years of experience doing things the hard way.

$ Check price
39.9539.95Amazon 5 logo(114+ reviews)

More Linux & UNIX resources