Pipelines and your Unix toolbox

[posts] May 29, 2019

Unix commands are great for manipulating data and files. They get even better when used in shell pipelines. The following are a few of my go-tos – I’ll list the commands with an example or two. While many of the commands can be used standalone, I’ll provide examples that assume the input is piped in because that’s how you’d used these commands in a pipeline. Lastly, most of these commands are pretty simple and that is by design – the Unix philosophy focuses of simple, modular code, which can be composed to perform more complex operations.

Note: if you’re using a Mac, the builtin tools shipped with macOS might behave a little differently than the most recent versions. You can get more recently compiled version of these tools by running brew install coreutils. The typically usage is then ghead instead of head, gtail instead of tail, gpaste instead of paste, etc.

Here are the commands and a one sentence description:

Please forgive the “useless use of cat”. I’m using cat to show the pipe-able versions of the piped-to commands.

Print the first 5 lines

cat somefile.txt | head -n 5

Print the last 5 lines:

cat somefile.json | tail -n 5

Print all but the first line

cat somefile.csv | tail -n +2

Send all lines of output as arguments to echo

cat somefile.xml | xargs echo

Send each line individually as an argument to echo

cat somefile.yaml | xargs -n 1 echo

Send two arguments at a time to echo

cat somefile | xargs -n 2 echo

Print only column two of each line (assumes whitespace between columns)

cat somefile | awk '{print $2}'

Print only column two of each line with , separators

cat somefile | awk -F',' '{print $2}'

cat somefile | cut -d',' -f 2

String replace each line matching 1, 2, or 3 using regex with the letter ‘x’

cat somefile.txt | sed 's/[1-3]/x/'

Count the number of times each line appears in the file

cat somefile.txt | sort | uniq -c

Write out the intermediate result to a file in the middle of a pipeline using tee

cat somefile.txt | grep "mystring" | tee newfile.txt | wc -l

The above are a bunch of “tools” to file away for when you need them. I like to store them along with a few keywords or a sentence describing what each does for easy searching and recall.

Now, let’s consider the following example file myfile.txt:

uuid,name
72e925a8-58fb-11e9-87f4-fbe50933ad95,name1
237fd8a4-58fb-11e9-8bb7-bf5388556288,name2
91834624-58fb-11e9-8c01-2393beecfc80,name3
223c7438-58fc-11e9-a629-7707a76a95c5,name4
a0362ab0-58fb-11e9-8aed-1f7ca90ae318,name5
f3f153aa-58fb-11e9-a37c-3ffed25b518b,name6
09f6cdb0-58fc-11e9-9663-93ede25c7774,name7
21f4c2ec-58fc-11e9-baf6-7792114ee968,name8

We want to create files in the current folder using the names in column two, where the uuids in column one start with a “2”.

Break down the transformation into parts:

remove the first line (the header)
filter for lines starting with 2
grab the second column using a , separator
use the result to create the files with touch

tail -n +2 myfile.txt | grep "^2" | awk -F',' '{print $2}' | xargs touch

Confirm it worked:

$ ls name*
name2 name4 name8

As you add new tools to your toolbox, you can plug them in to your shell one-liners to manipulate data streams.

Some other useful commands for text processing worth looking into:

paste tr wc jq

✎ Edit

Raw

Pipelines and your Unix toolbox

Recommended Posts