QUESTION

Text

Image

Data Science: Tracking COVID-19

Problem 1) - Write a function named rows_and_columns that takes in a pandas data frame and returns the string: The data has $X$ rows and $Y$ columns. where $X$ is the number of rows and $Y$ is the number of columns. For example, if the data frame has 100 rows and 10 columns, the function should return the string: The data has 100 rows and 10 columns. - Write a function named get_min_max that takes in a pandas data frame and a column name as a string, and returns the minimum and maximum value of that column in a tuple - Write a function named odd_get_min_max that takes in a pandas data frame and a column name as a string, and returns the minimum and maximum values for the odd rows and that column in a tuple [ ] \# Problem 1) write your first function here $\mathrm{d}$ \# Problem 1) write your second function here \# Problem 1) write your third function here
[ ] \# Problem 1) write your first function here d \# Problem 1) write your second function here \# Problem 1) write your third function here And we can test our functions! [ ] \# here we call your functions print(rows_and_columns(covid19_data)) print(get_min_max(covid19_data,'latitude')) print(odd_get_min_max(covid19_data,'latitude')) To get a sense of the data, let's view the column names and a sample of the data. [ ] print(covid19_data.columns) print(covid19_data.head())
Data Cleanıng and Wranglıng The data are messy. Various parties have contibuted to the dataset without following a consistent formatting for the columns. If we are interested in questions about age, for example, we need to clean the age column. First, let's visualize the age column data by counting the unique fields. Problem 2) Write a function named "get_uniq" that takes in a pandas data frame and a column name, and returns a numpy ndarray contain the unique values in that column. Hint: use the DataSeries. unique( function: https://pandas.pydata.org/pandas-docs/stable/reference/series.html [ ] \# Problem 2) write your function here Let's use your function to print out the unique elements in the age column. [ ] print(get_uniq(covid19_data,'age')) We can also compute the counts for each of the unique elements. Pandas gives us a handy function to do this: value_counts(). By default value_counts() ignores $\mathrm{NaN}$ values. [ ] print(covid19_data['age'].value_counts())
Problem 3) Define a function named "unique_nonNaN_cnt" that takes a pandas data frame, a column name as a string, and returns the number of unique non-NaN values. You can think about this as either counting the non-NaN values or summing up the unique non-NaN values from the value_counts() method. [ ] \# Problem 3) write your function here and test our function... [ ] print("Total of " + str(unique_nonNaN_cnt(covid19_data,'age')) + " non-NaN age entries.") It's clear that the individuals entering the data were not following the same standard or format! We will need to clean this data before we can use it. There is a large amount of missing data, and a large variety of entries. We should clean the age columns. Let's convert the ages to age ranges for plotting. For the existing ranges in the data, let's consider the mean age. [ ] \# cleaning the age column \# We observe that the age column does not follow a nice format \# defining the age ranges age_ranges $=[]$ for aqe in rance $(0,100,10)$ :
Problem 4) Fill in the relevant prompts below to create the bar plot of COVID-19 cases by sex. As a hint, we can select a subset of rows based on the value in a column with the syntax: dataframe[dataframe[colname] $==$ value $]$ where dataframe is a pandas data frame, colname is the column name, and value is some value for the colname. You can use other comparisons as well, e.g., to get all rows with latitude $>0$, we can use the syntax: covid19_data[covid19_data.latitude>0] [ ] \# distribution of cases by age and sex \# Problem 4) Complete where we have indicated below def create_bar_plot_by_sex(covid19_data, age_ranges): age_range_labels $=[\operatorname{str}(x[0])+"-"+\operatorname{str}(x[1])$ for $x$ in age_ranges $]$ \# from the covid19_data, select the age_range for female rows female_age_ranges $=\#$ problem 4, fill this in counts_female $=$ female_age_ranges.value_counts $($ [age_range_labels] \# from the covid19_data, select the age_range for male rows male_age_ranges $=\#$ problem 4 , fill this in counts_male $=$ male_age_ranges.value_counts () [age_range_labels $]$ \# create plot fig, $a x=$ plt.subplots $($ figsize $=(20,10))$ index $=n p$.arange $($ len(age_ranges $))$ bar_width $=0.35$ opacity $=0.8$ \# the bar function draws a bar plot, the first two arugments are the $\mathrm{x}$ position of the bar, and its height
$\#$ create plot fig, ax = plt.subplots(figsize=(20,10)) index $=n p . a r a n g e($ len(age_ranges) $)$ bar_width $=0.35$ opacity $=0.8$ \# create plot fig, $a x=$ plt.subplots $($ figsize $=(20,10))$ index $=n p$.arange(len(age_ranges)) bar_width $=0.35$ opacity $=0.8$ \# the bar function draws a bar plot, the first two arugments are the $\mathrm{x}$ position of the bar, and its height rects1 = plt.bar(, , bar_width, \# problem 4, fill in first two arguments alpha=opacity, color='b',label='Male') rects2 = plt.bar(, , bar_width, \# problem 4, fill in first two arguments hint: you have to use the bar_width in the first argument alpha=opacity, color ='g',label='Female') plt.xlabel('Age Range') plt.ylabel('Count') plt.title('Corona Cases per Age Group') \#plt.xticks(index + bar_width, age_ranges) plt.xticks(index, ["["+str(x[0])+"," $+\operatorname{str}(x[1])+") "$ for $x$ in age_ranges $]$ ) plt.legend() plt.tight_layout() return counts_female, counts_male
Problem 5) Print the same bar plot by country, but limit the plot to countries that have $>1000$ cases. [ ] \# distribution of cases by country with $>1000$ cases \# Problem 5) Complete where we have indicated below def create_bar_plot_by_country(covid19_data): country_cnts $=$ covid19_data.country.value_counts() \# get the counts for countries with $>1000$ cases, this should be a data series counts $=$ \# Problem 5, fill this in \# get number of countries with $>1000$ cases, this should be an integer $\mathrm{n} \_$groups $=$\# Problem 5, fill this in \# create plot fig, ax $=$ plt.subplots $($ figsize $=(20,10))$ index $=n p$. arange $\left(n \_\right.$groups $)$ bar_width $=0.35$ opacity $=0.8$ rects 1 = plt.bar(index, counts, bar_width, alpha=opacity,color='b') plt.xlabel('Country') plt.ylabel('Count') plt.title('Corona Cases per Country') plt.xticks(index, ) \# Problem 5, fill this in plt.legend() plt.tight_layout() return n_groups, counts
Problem 6) Professor Derek is worried about outcomes over time for his age bracket (30-40). He wants you to plot the relative frequency of positive outcomes ( $y$-axis) over time ( $x$-axis) while also including 1 standard deviation above and below each point. You should not compute Spearman's correlation here. Fill in the function below. [ ] \# Problem 6) Complete where we have indicated below def create_bar_plot_for_derek(covid19_data): \# first we subset the data by the appropriate age bracket and do a bit of cleaning prof_age_data = covid19_data[covid19_data.age_range=="30-40"] prof_age_data=prof_age_data.replace(to_replace='25.02.2020 - 26.02.2020',value='25.02.2020') \# and we convert the column to a date-time prof_age_data['date_confirmation']=pd.to_datetime(prof_age_data['date_confirmation'],dayfirst=True) outcomes_over_time $=\#$ Problem 6) fill in here outcomes_over_time = outcomes_over_time.dropna() \# we should drop the rows with missing values $x=\#$ Problem 6) fill in here $y=\#$ Problem 6) fill in here error = \# Problem 6) fill in here fig, $a x=$ plt.subplots $($ figsize $=(20,10))$ ax.errorbar(x, y, yerr=error, fmt='-o') plt.ylabel('Relative Frequency', fontsize=14) plt.xlabel('Date', fontsize=14) return $\mathrm{x}, \mathrm{y}$, error

Public Answer

CTHLRI The First Answerer