STAT 39000: Project 3 — Fall 2020

Motivation: The ability to navigate a shell, like bash, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful bash tools, help you navigate a filesystem, and even run bash tools from within an RMarkdown file in RStudio.

Context: At this point in time, you will each have varying levels of familiarity with Scholar. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within RStudio in an RMarkdown file.

Scope: bash, RStudio

Learning objectives
  • Distinguish differences in /home, /scratch, and /class.

  • Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.

  • Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.

  • Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.

  • Utilize other Scholar resources: rstudio.scholar.rcac.purdue.edu, notebook.scholar.rcac.purdue.edu, desktop.scholar.rcac.purdue.edu, etc.

  • Use man to read and learn about UNIX utilities.

  • Run bash commands from within and RMarkdown file in RStudio.

There are a variety of ways to connect to Scholar. In this class, we will primarily connect to RStudio Server by opening a browser and navigating to rstudio.scholar.rcac.purdue.edu/, entering credentials, and using the excellent RStudio interface.

Here is a video to remind you about some of the basic tools you can use in UNIX/Linux:

This is the easiest book for learning this stuff; it is short and gets right to the point:

You just log in and you can see it all; we suggest Chapters 1, 3, 4, 5, 7 (you can basically skip chapters 2 and 6 the first time through).

It is a very short read (maybe, say, 2 or 3 hours altogether?), just a thin book that gets right to the details.

Questions

Question 1

Navigate to rstudio.scholar.rcac.purdue.edu/ and login. Take some time to click around and explore this tool. We will be writing and running Python, R, SQL, and bash all from within this interface. Navigate to Tools > Global Options …​. Explore this interface and make at least 2 modifications. List what you changed.

Here are some changes Kevin likes:

  • Uncheck "Restore .Rdata into workspace at startup".

  • Change tab width 4.

  • Check "Soft-wrap R source files".

  • Check "Highlight selected line".

  • Check "Strip trailing horizontal whitespace when saving".

  • Uncheck "Show margin".

(Dr. Ward does not like to customize his own environment, but he does use the emacs key bindings: Tools > Global Options > Code > Keybindings, but this is only recommended if you already know emacs.)

Items to submit
  • List of modifications you made to your Global Options.

Question 2

There are four primary panes, each with various tabs. In one of the panes there will be a tab labeled "Terminal". Click on that tab. This terminal by default will run a bash shell right within Scholar, the same as if you connected to Scholar using ThinLinc, and opened a terminal. Very convenient!

What is the default directory of your bash shell?

Start by reading the section on man. man stands for manual, and you can find the "official" documentation for the command by typing man <command_of_interest>. For example:

# read the manual for the `man` command
# use "k" or the up arrow to scroll up, "j" or the down arrow to scroll down
man man
Items to submit
  • The full filepath of default directory (home directory). Ex: Kevin’s is: /home/kamstut

  • The bash code used to show your home directory or current directory (also known as the working directory) when the bash shell is first launched.

Question 3

Learning to navigate away from our home directory to other folders, and back again, is vital. Perform the following actions, in order:

  • Write a single command to navigate to the folder containing our full datasets: /class/datamine/data.

  • Write a command to confirm you are in the correct folder.

  • Write a command to list the files and directories within the data directory. (You do not need to recursively list subdirectories and files contained therein.) What are the names of the files and directories?

  • Write another command to return back to your home directory.

  • Write a command to confirm you are in the correct folder.

Note: / is commonly referred to as the root directory in a linux/unix filesystem. Think of it as a folder that contains every other folder in the computer. /home is a folder within the root directory. /home/kamstut is the full filepath of Kevin’s home directory. There is a folder home inside the root directory. Inside home is another folder named kamstut which is Kevin’s home directory.

Items to submit
  • Command used to navigate to the data directory.

  • Command used to confirm you are in the data directory.

  • Command used to list files and folders.

  • List of files and folders in the data directory.

  • Command used to navigate back to the home directory.

  • Commnad used to confirm you are in the home directory.

Question 4

Let’s learn about two more important concepts. . refers to the current working directory, or the directory displayed when you run pwd. Unlike pwd you can use this when navigating the filesystem! So, for example, if you wanted to see the contents of a file called my_file.txt that lives in /home/kamstut (so, a full path of /home/kamstut/my_file.txt), and you are currently in /home/kamstut, you could run: cat ./my_file.txt.

.. represents the parent folder or the folder in which your current folder is contained. So let’s say I was in /home/kamstut/projects/ and I wanted to get the contents of the file /home/kamstut/my_file.txt. You could do: cat ../my_file.txt.

When you navigate a directory tree using . and .. you create paths that are called relative paths because they are relative to your current directory. Alternatively, a full path or (absolute path) is the path starting from the root directory. So /home/kamstut/my_file.txt is the absolute path for my_file.txt and ../my_file.txt is a relative path. Perform the following actions, in order:

  • Write a single command to navigate to the data directory.

  • Write a single command to navigate back to your home directory using a relative path. Do not use ~ or the cd command without a path argument.

Items to submit
  • Command used to navigate to the data directory.

  • Command used to navigate back to your home directory that uses a relative path.

Question 5

In Scholar, when you want to deal with really large amounts of data, you want to access scratch (you can read more here). Your scratch directory on Scholar is located here: /scratch/scholar/$USER. $USER is an environment variable containing your username. Test it out: echo /scratch/scholar/$USER. Perform the following actions:

  • Navigate to your scratch directory.

  • Confirm you are in the correct location.

  • Execute myquota.

  • Find the location of the myquota bash script.

  • Output the first 5 and last 5 lines of the bash script.

  • Count the number of lines in the bash script.

  • How many kilobytes is the script?

You could use each of the commands in the relevant topics once.

When you type myquota on Scholar there are sometimes two warnings about xauth but sometimes there are no warnings. If you get a warning that says Warning: untrusted X11 forwarding setup failed: xauth key data not generated it is safe to ignore this error.

Commands often have options. Options are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION section of the man pages. For example: man wc. You can see -m, -l, and -w are all options for wc. To test this out:

# using the default wc command. "/class/datamine/data/flights/1987.csv" is the first "argument" given to the command.
wc /class/datamine/data/flights/1987.csv
# to count the lines, use the -l option
wc -l /class/datamine/data/flights/1987.csv
# to count the words, use the -w option
wc -w /class/datamine/data/flights/1987.csv
# you can combine options as well
wc -w -l /class/datamine/data/flights/1987.csv
# some people like to use a single tack `-`
wc -wl /class/datamine/data/flights/1987.csv
# order doesn't matter
wc -lw /class/datamine/data/flights/1987.csv

The -h option for the du command is useful.

Items to submit
  • Command used to navigate to your scratch directory.

  • Command used to confirm your location.

  • Output of myquota.

  • Command used to find the location of the myquota script.

  • Absolute path of the myquota script.

  • Command used to output the first 5 lines of the myquota script.

  • Command used to output the last 5 lines of the myquota script.

  • Command used to find the number of lines in the myquota script.

  • Number of lines in the script.

  • Command used to find out how many kilobytes the script is.

  • Number of kilobytes that the script takes up.

Question 6

Perform the following operations:

  • Navigate to your scratch directory.

  • Copy and paste the file: /class/datamine/data/flights/1987.csv to your current directory (scratch).

  • Create a new directory called my_test_dir in your scratch folder.

  • Move the file you copied to your scratch directory, into your new folder.

  • Use touch to create an empty file named im_empty.txt in your scratch folder.

  • Remove the directory my_test_dir and the contents of the directory.

  • Remove the im_empty.txt file.

rmdir may not be able to do what you think, instead, check out the options for rm using man rm.

Items to submit
  • Command used to navigate to your scratch directory.

  • Command used to copy the file, /class/datamine/data/flights/1987.csv to your current directory (scratch).

  • Command used to create a new directory called my_test_dir in your scratch folder.

  • Command used to move the file you copied earlier 1987.csv into your new my_test_dir folder.

  • Command used to create an empty file named im_empty.txt in your scratch folder.

  • Command used to remove the directory and the contents of the directory my_test_dir.

  • Command used to remove the im_empty.txt file.

Question 7

Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan.