STAT 101 Daily Questions 1-7

Fall 2024

Dr. Darien DeWolf

Daily Question 1: Introduction, the Who and the What

The personnel department keeps records on all employees in a company. Here is the information that they keep in one of their data files: employee identification number, last name, first name, middle initial, department, number of years with the company, salary, education (coded as high school, some college, or college degree), and age.

Identify the WHO and the WHAT in this scenario.

For the WHAT, do remember to indicate the type (categorical or quantitative) and units/categories of the variable.

Solution:

[Optional parts or comments in brackets.]

Who: The employees of this company. [Or: the records of the employees of this company]

What:

Employee identification number: [Nominal] Categorical (units = ID#)
Last name: [Nominal] Categorical
First name: [Nominal] Categorical
Middle initial: [Nominal] Categorical
Department: [Nominal] Categorical
Number of years with the company: Quantitative, units: years
Salary: Quantitative (Units = $)
Education: [Ordinal] Categorical. (High School, Some College, College)
Age: Quantitative, units: years

Daily Question 2: Graphically Displaying Variables

Suppose that 30 voters are randomly sampled and we record several things about them:

Province of residence
Age
Income
Voting status (1 = voted last election, 0 = did not vote)
Party registration (1 = LPC, 2 = CPC, 3 = BQ, 4 = NDP, 5 = GPC, 5 = other)

For each of the variables, which graph would you use to display them and why?

Solution:

The graph type is based on the variable type:

Province is categorical: use a bar chart [or pie chart].
Age is quantitative: use a histogram.
Income is quantitative: use a histogram.
Voting status is categorical: use a bar chart [or pie chart].
Party registration is categorical: use a bar chart [or pie chart].

Daily Question 3: Describing Quantitative Variables

Briefly describe the shape (symmetry, modality, and outliers) of this distribution.

Then, to describe the centre and spread of a distribution, you have some choice:

Use the mean and standard deviation.
Use the median and IQR.

Which of these two is most appropriate for the distribution above, and why?

Solution:

This distribution is [approximately] symmetric, bimodal, no outliers.

The mean and standard deviation is appropriate, because the distribution is symmetric.

Daily Question 4: Box Plots and Outliers

Which boxplot has the least spread? How did you determine that?

Solution:

Lettuce because its IQR (width of box) is smallest. (We usually use the IQR when the median is shown.)

[Also OK: Tomatoes since the range is smaller. The range is not often used, but it is still a valid measure of spread.]

Daily Question 5: Associations Between Quantitative Variables

Describe this scatterplot in terms of direction, form, strength and outliers.

Solution:

Positive, linear, strong relationship with no outliers.

[One may say that one or two of the upper-right could be outliers.]

Daily Question 6: Simple Linear Regression

How well does the population of a state predict the number of undergraduates?

The population and the number of undergraduates in each state are measured.

The least squared regression line is $\widehat{y} = -15057 + 0.05326x$.

Interpret the regression slope and intercept in full context of the problem.

Solution:

Intercept: The predicted number of undergraduates in a state with population 0 is $-15057$ students.

Slope: When the population of a state increases by 1 person, the predicted number of undergraduates increases by 0.05326 students.

Daily Question 7: More on Regression

Suppose that 15 random adults are asked their age and number of cats they've had in their life.

The fitted line plot and regression output are produced:

Are the assumptions for regression met?

Interpret the slope and intercept of the regression model in context.

We see that $r^2 = 84.62\%$. Interpret this in full context.

Solution:

Assumptions

Quantitative Variables Condition: The number of cats owned and age (years) are both quantitative. Good!

Straight Enough Condition: The scatter plot has no apparent bends and looks straight. Good!

No Outliers Condition: We don't see any observations with high/low x-values or high residual. Good!

Interpretations

The intercept: The predicted number of cats for a 0-year-old adult [which, of course, makes no physical sense] is 0.29 cats.

The slope: For each 1-year increase in age of adults, the predicted number of cats they've owned increases by 0.06 cats.

r-squared (84.62%): 84.62% of the variation observed in the number cats owned in this sample can be explained by the linear regression on age.