TDM 10200: Project 9 — Spring 2023
Motivation: Working in pandas can be fun! Learning how to wrangle data and clean up data in pandas is a helpful tool to have in your tool belt!
Context: Now that we are feeling more comfortable with building functions and using pandas we want to continue to build skills and use pandas to solve data driven problems.
Scope: python, pandas, numpy
Dataset(s)
When launching Juypter Notebook on Anvil you will need to use 2 cores.
The following questions will use the following dataset(s):
/anvil/projects/tdm/data/disney/total.parquet
Helpful Hints
import pandas as pd
disney = pd.read_parquet('/anvil/projects/tdm/data/disney/total.parquet')
Insider Knowledge
It is helpful to use a Parquet
file when we need efficient storage. If we tried to read in all the .csv files in the disney folder the kernel would crash. In short a Parquet
file allows for high performance data compression and encoding schemes to deal with large amounts of complex data. The format is a column-oriented file format while .csv’s tend to be row-oriented.
You can read more about what row vs column oriented databases are here.
ONE
Luckily this data is being read in as already cleaned data. It also has been recently updated and has a lot more information, i.e., it has more data from more rides.
-
Since there is a lot of new ride data, let’s print the name of each ride.
-
How many rows of data are there for each ride?
-
What is different about the information that you receive if you use the groupby() vs value_counts()? Which one yields the information asked by question 1b? Why?
-
Go ahead and import the
numpy
package and see if you can find the frequency of JUST the ride namedhall_of_presidents
from the columnride_name
. Under Helpful Hint there are two different ways to do that, but can you come up with a third?
Helpful Hint
import numpy as np
disney[disney.ride_name == 'hall_of_presidents'].shape[0]
#OR
import numpy as np
(disney['ride_name']=='hall_of_presidents').sum()
Insider Knowledge
-
Note that, before it gives you all the unique values in the column
ride_name
, it tells you that it is an array. An array is a ordered collection of elements where every value has the same data type.
-
Code used to solve this problem.
-
Output from running the code.
-
Answer to questions a,b,c,d
TWO
Create a new function that accepts a ride name as an argument, and prints two things: (1) the first year the data for that ride was collected, and (2) the most recent year that the data for that ride was collected.
-
Code used to solve this problem.
-
Output from running the code.
THREE
Notice that the dataset has two columns SPOSTMIN
and SACTMIN
. Each row has either a value for SPOSTMIN
or SACTMIN
but not both.
-
How many total rows of data do we have?
-
How many non-null rows for
SPOSTMIN
? -
How many non-null rows for
SACTMIN
? -
Combine columns
SPOSTMIN
andSACTMIN
to create a new variable namednewcolumn
-
What is the length of
newcolumn
? Is that the same as the number of rows in thedisney
dataframe?
Helpful Hints
It might be useful to use the combine_first
function for question 3d:
-
Code used to solve this problem.
-
Output from running the code.
-
Answer to questions a,b,c,d,e
FOUR
-
Find the max and min
SACTMIN
time for each ride -
Find the max and min
SPOSTMIN
time for each ride -
Find the average
SPOSTMIN
time for each ride -
Find the average
SACTMIN
time for each ride
Helpful Hint
Note that the value -999
indicates that the attraction was closed.
-
Code used to solve this problem.
-
Output from running the code.
-
Answer to questions a-d
FIVE
-
Find the date that each ride was most frequently checked.
-
What was the most commonly closed ride? (Again, note that the value
-999
indicates that the attraction was closed.)
-
Code used to solve this problem.
-
Output from running the code.
-
Answer to questions a and b
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |