Python and R have emerged as the language of choice for Data Science. This is not surprising, given their ability to rapidly prototype analyses, use tools like Jupyter Notebook and afford help from a massive community that provides cheat sheets, resources, reliable packages and an array of linux commands.These languages serve as microscopes into data. Once you know where to look, you can look as closely as you want.This microscope approach works well when one knows what they’re dealing with. However, most exploratory work that one would do as a Data Scientist would involve getting a “feel” of the data as the most preliminary step. This task is often better suited to a magnifying glass, that offers a closer look at the data, and doesn’t require the overhead that comes with analysis under a microscope.In the context of computing, BASH serves as a magnifying glass.It can be justified because of the following features:
Having a reasonable command on Bash script can allow one to prepare for a more detailed analysis.Here’s a walk-through of some useful bash commands.
Kaggle’s introductory Data science/ML challenge is on data from the Titanic. We are going to use that data set to demonstrate most of the commands here. You can find the data set and the tutorial here : https://www.kaggle.com/c/titanic/data. The dataset contains three files. For now we can ignore the one named “gender_submission.csv”.
Often, faced with a new data set, the first thing one has to do is see what the data is like. The tutorial at Kaggle recommends using Pandas to load the CSV file onto a Jupyter Notebook, and then call the head()function to see what the training data looks like.Just issuing a head/tail bash command achieves the exact same effect:head train.csvTo manipulate how many lines to see, just use head -n 2 train.csvLastly to have a look at the last few lines, use tail, in the exact same way tail -n 10 train.csvIn both cases, the -n flag takes the number of lines to be displayed.
Now that we know what to expect from the data, we may want to glance through the whole data set. We may be interested to see if there are some anomalous values in certain columns.Using more train.csv The whole file can be seen being printed on the shell. You can scroll down the file by pressing space or return. The more shell command goes through data only in one direction. You can’t scroll up.At the end of the file, more will relinquish control to the shell script. In case you want to go up and down the file, use less train.csv. Inside less, you can scroll up and down using arrow keys.You can search inside the text by typing “/<search string>” and hitting enter. Press “n” to go to the next match. Try searching for “,,”. This highlights places wherever two commas come together. We now know, this CSV contains some blank.When you’re done, press “q” to exit.
Just use the wc bash command like wc train.csv to see the number of characters in the file. Or, more usefully, use wc -l train.csv to see the number of lines in the file. We now know the training data has 891 records!
We now know that the data has some blank fields. So let us find all of them to get a sense of the scale of the issue. Here, you will find the bash command grep to be useful. Use grep ”,,” train.csv to see all the lines with a match with the pattern you gave. But this isn’t too useful in its own right.
Now, say grep ”,,” train.csv | wc -l . The vertical line character is called a “Pipe” . It pipes the output from the STDOUT of the command on the left into the input of the command on the right. The output here is the number of lines with atleast one empty field.But this too, is only somewhat useful. Let us go one step ahead, and ask, how many rows have a missing age. By doing a head -n 1 train.csv you can see that Age is the 6th column. Since, this is a CSV file, we can be sure that there’s a comma at the end of each field entry in each column.So let us use the cut shell command to “cut” the contents of the file according to a delimiter. Say cut -d”,” -f6 train.csv | more.
But there seems to be an issue, while the first entry says “ Age”, the following lines are all the sex of the passengers. A head -n 3 train.csv can tell you that the Names of all passengers are written as “ Last Name, First Name”. There’s an extra comma that you’ll have to account for. Now, type in, cut -d”,” -f7 train.csv | grep ”^$” | wc -l. This bash command prints all ages, then “greps” the ones which are blank using the regex characters for beginning and end of line, and then counts the number of lines. We now know there are 177 records with the Age missing.
Now that we have a sense of where all in the data we may want to replace such occurrences so as to do some sort of processing. Let us say we decided to change all the blank values in age to be 30. We can use the sed bash command to write a substitution string. Say, sed -i -e ”s/male,,/male,30,/” train.csv. We use the string “male” to seek the blanks especially because there’s another field that can be blank but age is next to sex.
So let’s see what the distribution of age is . Enter the following bash command, cut -d, -f7 train.csv | tail -n +2 | sort -n > age. Here, we print all the age values, exclude the first line which is a header ( by using the plus sign on tail ) and then pipe the values of age into the sort bash command and ask it to sort numerically (using -n) and write the values in a file called age.Now let us use the shell command xmgrace age. [Note: This requires you to install the Grace package]Below is a plot showing the value of age in each line :
We can see in the above plot:
Now try going to Data->Transformation->Histograms. Set the min, max and number of bins. You’ll get a histogram. Use the Magnification button to zoom into the histogram.
The cut bash command is great at what it does, but sometimes we need more power. Let’s try the awk bash command. Awk is a programming language in itself but we leave it upto the reader to delve into depth.For now, try the following bash command, awk -F”,” '{if($2==1&&$3==3){print $3}}' train.csv | wc -l. Here we tell awk to use delimiter “,” and print the value of the PClass if Surviving is 1. Then, we count number of such records. What we get is the number of people who belonged to class 3 and survived. We get 119.Now, try changing the if condition to $2==0. We get 372. Repeat this with other combinations.We can easily get a hint of the relation of the Passenger Class with the chance of survival, with class=1 being more likely to survive.When we build a prediction model now, we could be sure to use the PClass column
Learning Bash has a little bit of a learning curve but the returns on that investment are large in terms of convenience. Also, the portability of the environment means you can work on most systems you will encounter as a Data Scientist.
A powerful and customisable shell script editor on the command line.
A (potentially) complete operating system that allows you to edit files, take notes, work on your email, and whatever else you can program to do it in ELisp.Disclaimer : We don’t know much about Emacs other than that people who use it would have had been severely offended if we didn’t include it in an article that mentioned vim. They would have had then proceeded to explain how Emacs is superior. They’re nice folk though! Not a cult! Definitely not a cult!
Useful in downloading your data, especially on remote machines
A simple job queue to run and manage multiple commands to exploit a multi-core machine very well.Say you want to run 10000 matrix multiplications each of which take 5 seconds to run you wouldn’t want to run them one by one as that would take 13 hours. You couldn’t fire them all together either as that would crash your machine. But if you were to run them 10 a time, you could get a time boost and also not screw up your machine. Just configure bash_threads to run these 10000 commands one by one, and tell it that it can run upto 10 at a time. It will run them in queue.Wondering how you’d type out 10000 commands on the command list for bash_threads? Think forloop in bash!
Take notes in markdown and convert them into pretty PDFs using pandoc.Express flowcharts, sequence diagrams and Gantt charts in mermaid and convert them to nice pictures.Integrate mermaid into pandoc, and express notes and diagrams in a single file and get notes that have nice diagrams in a single PDF.
See how Polly analysis biological data. Book a demo right away.