03. Records and Fields RS, RT, ORS, FS, OFS, $n

The backbone of AWK's programming model consists of two pieces: 1) records & fields, along with 2) patterns & actions. Let's look at the first core component here, then move onto patterns & actions in the next lesson.

What are records and fields?

AWK views each input stream as a collection of records. Records can be thought of individual lines, which are then divided into fields (each data cell). Take a look at the figure below, which displays the grades.txt file.

Records and fields in AWK.
Our example grades.txt file. Each row is a record, and each data cell is a field.

In the above example, the records are separated by a newline character, and the fields are delimited by whitespace. But perhaps we are working with CSV (comma-separated values) files, which use commas to separate fields. Or maybe we have just one long line of data, with each record separated by a semicolon (;). How do we let our gawk implementation to know this?

Record separators (RS & RT)

To specify the character that separates records, we use the built-in RS variable. In the original AWK implementation, the RS variable had to be a single literal character such as the newline or an empty string. In other implementations such as gawk, RS may be a regular expression.

In the case we have a regular expression, RS will hold the literal regex, while RT will hold the matching string.

$ echo firstRecord 111111 secondRecord 222222 thirdRecord 333333 lastRecord |
>    gawk 'BEGIN { RS = "([[:digit:]]+)" }
>                { print "RS = " RS " and RT = " RT }'
RS = ([[:digit:]]+) and RT = 111111
RS = ([[:digit:]]+) and RT = 222222
RS = ([[:digit:]]+) and RT = 333333

This code snippet sets the RS variable to any number of digits. Notice how the RS variable displays the literal regex, while RT displays the matched regex.

Output Record Separator (ORS)

The Output Record Separator (ORS) is used to specify what should come after an record is printed. The default is a newline character.

In this example, we read and print out the current record in our buffer (denoted by $0), followed by a plus (+) symbol.

$ echo 'hello; nihao; hola; anyonghasaeyo' | 
>    gawk 'BEGIN { RS = ";"; ORS = " +"}
>                { print $0 }'
hello + nihao + hola + anyonghasaeyo 

Field separators (FS)

Fields are separated by the FS variable. The default value is a single space, which translates to one or more whitespace characters with the leading/trailing whitespaces on the line are ignored. Thus, the following fields looks the same to AWK.

Joe John Johanna
Joe   John   Johanna

To specify a literal single space, enclose the space with brackets such that FS = '[ ]'

The field separated may be identified by the -F option via the command line, or by assigning it in the BEGIN block.

$ echo 'Joe John Johanna' | 
>    gawk -F' ' '{ print NF ":" $0 }'
3:Joe John Johanna 
# Same command as above but using the BEGIN block
$ echo 'Joe John Johanna' | 
>    gawk 'BEGIN { FS=" " } 
>                { print NF ":" $0 }'
3:Joe John Johanna
# Changing the FS character
$ echo '   Joe   John   Johanna   ' | 
>    gawk -F'[ ]' '{ print NF ":" $0 }'
13:   Joe   John   Johanna                     

Here we can see that the -F variable is used to manipulate the FS variable straight from the command line. We'll formally learn about how to use AWK via the command line in future lesson.

Output Field Separator (OFS)

The Output Field Separator, or OFS stores the variable that separates each field upon output. By default, it is a space.

$ echo 'John Mary; Jacob Teresa; Bob Claire' |
>    gawk 'BEGIN { OFS=" loves "; RS=";" }
>                { print $1, $2 }'
John loves Mary
Jacob loves Teresa
Bob loves Claire                    
h3 Field accession ($n)

You may have noticed the use of the $0 variable in the previous example. This variable stores the current record. To access fields, we can simply use a $, followed by the field number (eg. $1 for the first field, $2 for the second, and so on).

$ echo 'uno dos tres' | gawk -F' ' '{ print "The second     | field is: " $2; print "The entire record is: " $0 }'
The second field is: dos
The entire record is: uno dos tres

Note that that the values start at 1 and not 0, unlike most programming languages with a zero-based index.

Field to integer conversion

Fields are converted to integer values accordingly. Thus, $(2*2), $(8/2), $"4.41" and $4 all refer to the fourth field. Note that negative values have no meaning.

Aching back from coding all day?

Acupressure Mat & Pillow

Aching back from coding all day? Try Back Problems

Relieve your stress, back, neck and sciatic pain through 1,782 acupuncture points for immediate neck pain relief. Made for lower, upper and mid chronic back pain treatment, and improves circulation, sleep, digestion and quality of life.

$$ Check price
144.87144.87Amazon 4.5 logo(1,890+ reviews)

More Back Problems resources

Take your Linux skills to the next level!

System Admin Handbook

Take your Linux skills to the next level! Try Linux & UNIX

This book approaches system administration in a practical way and is an invaluable reference for both new administrators and experienced professionals. It details best practices for every facet of system administration, including storage management, network design and administration, email, web hosting, scripting, and much more.

$ Check price
74.9974.99Amazon 4.5 logo(142+ reviews)

More Linux & UNIX resources

Ad