Why AWK and CSV beat XLS
I’m betting that the vast majority of people out there use spreadsheets to crunch numbers. Before I learned to code, I used them too. Now I do most of my number crunching with Awk and other programming languages. Why? One word: Formulas. Spreadsheets basically force people to write tiny one-liners all day long. Have you ever tried to write an entire shell script all on one line? With loops, conditionals, variables, etc? It’s utter madness! And depending on what program you’re using, you might not be able to create new columns or rename sheets without breaking formulas. And don’t even get me started on changing fonts and colors!
Columns of plaintext are so much easier work with. Awk is incredibly good at working with such files, and there’s even a universal data format called CSV (comma separated values). Of course Awk can be told to use just about any character as a field separator, so you’re not limited to using the comma. For the purposes of this article, I chose the pipe symbol (|).
What is Awk
Awk is a programming language disquised as a command line utility on a good chunk of computers running Unix-like operating systems. If you’re currently running a modern Linux distribution or a varient of BSD, you probably already have some version of it installed. The GNU version, gawk, is common on Linux. But there’s also nawk, mawk, BusyBox Awk, and a few other versions scattered across the Internet. The important to thing to remember is that they all share a common syntax. They are also designed to work well with other command line tools.
Getting started
Here is a simple dataset with information about the competitors in a combat robotics league. If you don’t know what combat robotics is, it’s not important. If you want to follow along, you should probably save this block of text as robots.csv
name|builder|weight|weapon type|wins|losses
Chomper|Bill Example|190|vertical saw|25|10
Go To Space|Kyle Spacer|200|lifter|45|8
Mauler|Sherry Doe|185|drum spinner|18|11
Let It Rip|Joe Riker|190|shell spinner|68|17
Spicy Taco|Kevin Tux|200|flame thrower|88|12
Just Fork It|Billy Joe|180|lifter|58|13
Ares|Bob Maloc|185|beater bar|98|9
Thumper|Mark Awk|185|spiked hammer|61|25
This data is arranged into rows and columns, and the first row contains the names of those columns. Look familiar? It’s basically a spreadsheet in it rawest form, a plaintext table! Here’s the same table with extra spaces to line everything up.
name |builder |weight|weapon type |wins |losses
Chomper |Bill Example |190 |vertical saw |25 |10
Go To Space |Kyle Spacer |200 |lifter |45 |8
Mauler |Sherry Doe |185 |drum spinner |18 |11
Let It Rip |Joe Riker |190 |shell spinner |68 |17
Spicy Taco |Kevin Tux |200 |flame thrower |88 |12
Just Fork It |Billy Joe |180 |lifter |58 |13
Ares |Bob Maloc |185 |beater bar |98 |88
Thumper |Mark Awk |185 |spiked hammer |61 |25
Answering questions
When you really think about it, what is the purpose of formulas in a spreadsheet? To answer questions! Let’s start by answering the question “Which bot has the most wins?” Of course, we can clearly see that the answer is Ares just by looking at the data above. But for the sake of argument, let’s say there are 300 robots in this league. Keeping the answer to this question up to date manually is going to be a real pain, so let’s just write a short Awk program to do it for us.
Awk works a bit differently than most programming languages. Some would call Awk a data driven language because it automatically loops over the lines of a file looking for matching patterns or true conditions (the what). When a matching pattern or true condition is found, it feeds the current line into a block of code to do something (the action). Blocks look like this: <what> { <action> }.
There are 3 exceptions to this looping rule. The first is a BEGIN block, which as the name implies, gets executed before Awk starts looping. In this block, we can tell it what character to use for a field separator by setting FS, one of Awk’s magic global variables. We can also declare our own variables.
# file: most_wins.awk
BEGIN {
# Set the field separator
FS = "|"
wins = 0
}
With that done, we have to tell Awk what we want to do with each line of the data file.
# file: most_wins.awk
# ...
NR != 1 {
# Field 5 holds the number of wins
if ($5 > wins) {
wins = $5
record = $0
}
}
In the code above, another magic variable is introduced: NR. For the purposes of this example, it’s the line number. The expression NR != 1 basically tells Awk to skip the first line. Why? Because the first line just holds the column names. It’s not actually a row of data we want to process.
Another thing you’ll notice is that Awk starts counting lines and fields from 1. We can get the individual fields by using the $<field number> syntax. So in the code above, Awks runs the block for every line except the first line.
Inside the block, Awk checks if field 5 is greater than the current value of wins. If it is, it updates the value of wins and sets a new variable called record. In Awk, nearly all variables are global, and variables do not have to be initialized before first use. The reason why wins gets the initial value of 0 is because we want Awk to treat it as a number for the purpose of comparison.
The block above also introduces the special variable $0, which refers to the entire line of text. So on each line of the file, Awk basically executes these instructions:
- If line number NOT EQUAL 1, continue. Else, skip.
- If field 5 is greater than
wins, continue. Else, skip. - Set
winsto the value of field 5. - Set
recordto the value of CURRENT LINE.
But what happens when it gets done looping over the file? Well, nothing. To get the final answer to our question, we need to tell Awk to print it out. We can do this using the second exception to Awk’s looping rule: the END block.
# file: most_wins.awk
# ...
END {
print record
}
As the name implies, the END block is the last chunk of code Awk runs before it exits.
Putting it together
Here’s all the code we have so far.
# file: most_wins.awk
BEGIN {
# Set the field separator
FS = "|"
wins = 0
}
NR != 1 {
# Field 5 holds the number of wins
if ($5 > wins) {
wins = $5
record = $0
}
}
END {
print record
}
But how do we run it? Awk can take a whole program as a command line argument, but we can just use the -f flag to tell it to run a file on our data file. Here’s what the output should look like if you’re following along.
$ awk -f most_wins.awk robots.csv
MOST WINS:
Ares|Bob Maloc|185|beater bar|98|9
Answering multiple questions
The example in the previous section only answers a single question, but spreadsheets are typically set up to answer multiple questions. We can easily combine answers into a single report by using Awk’s third exception to the looping rule: functions! Yes, Awk supports writing functions. So let’s put the main chunk of code from the above program into it’s own function.
# report1.awk
BEGIN {
FS = "|"
wins = 0
}
function updateWins() {
# Column 5 holds the number of wins
if ($5 > wins) {
wins = $5
record = $0
}
}
NR != 1 {
updateWins()
}
END {
print "MOST WINS:"
print record
}
Remember what I said about variables? They’re global! There is one exception to this rule but we’ll get to that later. For now, just know that functions can manipulate global variables. If you’re following along, try running the new code. You should get the same output as before.
With functions in play, we can answer all sorts of questions. Let’s add another function to the script and rename some variables to improve readability.
# report2.awk
BEGIN {
FS = "|"
wins = 0
losses = 0
biggestWinner = ""
biggestLoser = ""
}
function updateBiggestWinner() {
# Column 5 holds the number of wins
if ($5 > wins) {
wins = $5
biggestWinner = $0
}
}
function updateBiggestLoser() {
# Column 5 holds the number of wins
if ($5 > losses) {
losses = $5
biggestLoser = $0
}
}
NR != 1 {
updateBiggestWinner()
updateBiggestLoser()
}
END {
print "MOST WINS:"
print biggestWinner
print ""
print "---"
print ""
print "MOST LOSSES:"
print biggestLoser
}
Here’s the output:
MOST WINS:
Ares|Bob Maloc|185|beater bar|98|88
---
MOST LOSSES:
Ares|Bob Maloc|185|beater bar|98|88
So the bot with the most wins also has the most losses. From here, we could continue adding more functions and variables to get the heaviest and lightest bots, or count the number of bots with a particular weapon. I don’t know about you, but I want to know which bot has the highest win rate. This is easy to calculate. For each bot, add the wins and losses together to find the total number of fights. Then divide the number of wins by the number of fights like this:
function getWinRate() {
totalFights = $5 + $6
return $5 / totalFights
}
It seems simple enough, but there’s a problem with this function. What if we don’t want totalFights to be a global variable? We can declare it as a function parameter. That way, it will only exist between the start and end of the function. Let’s add some more code to the script.
# report3.awk
BEGIN {
FS = "|"
wins = 0
losses = 0
winRate = 0
biggestWinner = ""
biggestLoser = ""
highestWinRate = ""
}
function getHighestWinRate( totalFights, currentRate) {
totalFights = $5 + $6
currentRate = $5 / totalFights
if (currentRate > winRate) {
winRate = currentRate
highestWinRate = $0
}
}
function updateBiggestWinner() {
# Column 5 holds the number of wins
if ($5 > wins) {
wins = $5
biggestWinner = $0
}
}
function updateBiggestLoser() {
# Column 5 holds the number of wins
if ($5 > losses) {
losses = $5
biggestLoser = $0
}
}
NR != 1 {
updateBiggestWinner()
updateBiggestLoser()
getHighestWinRate()
}
END {
print "MOST WINS:"
print biggestWinner
print ""
print "---"
print ""
print "MOST LOSSES:"
print biggestLoser
print ""
print "---"
print ""
print "HIGHEST WIN RATE: (" winRate "%)"
print highestWinRate
}
When you run that, you should get the following output:
MOST WINS:
Ares|Bob Maloc|185|beater bar|98|88
---
MOST LOSSES:
Ares|Bob Maloc|185|beater bar|98|88
---
HIGHEST WIN RATE: (0.88%)
Spicy Taco|Kevin Tux|200|flame thrower|88|12
As we can seek, even though Ares had the highest wins overall, Spicy Taco won about 88% of its fights. But let’s pause and take a look at the function getHighestWinRate().
function getHighestWinRate( totalFights, currentRate) {
totalFights = $5 + $6
currentRate = $5 / totalFights
if (currentRate > winRate) {
winRate = currentRate
highestWinRate = $0
}
}
What’s going on with the function parameters? In the code above, the extra whitespace (usually 2 or 4 spaces) tells you that totalFights and currentRate are meant to be variables within the function instead of actual parameters. This works because Awk lacks arity checking. That means it doesn’t check if the number of arguments a function is called with matches the number of parameters. In other programming languages, this would cause an error, but Awk is different.
Scratching the surface
This article only covers the basics of Awk. It can actually do quite a lot more when combined with other tools, so I encourage you to experiment. Try sorting tables by an arbitrary column, generating charts with Gnuplot, or converting tables into SQL databases.