DATA301 2024 Test2.pdf

Family Name: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Other Names: . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Student ID: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Course Code: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TRIMESTER 2 – YEAR 2024
Course Code:
DATA 301/471
DATA SCIENCE IN PRACTICE
10/10/2024
Time Allowed:
45 MINUTES
******** WITH SOLUTIONS **********
OPEN BOOK
Permitted materials: Silent non-programmable calculators or silent programmable calculators
with their memories cleared are permitted.
Printed foreign languages to English dictionaries are permitted.
No other material is permitted.
Instructions:
There are TWO sections (A and B).
Answer ALL questions in ALL TWO sections in the spaces provided.
Write your Student ID number on the top of every page.
For marking use only:
A
/10
B
/30
Total
/40
DATA 301/471
Page 1 of 6

Student ID: . . . . . . . . . . . . . . . . . . . . . . .
Test 2 - Part I (Communication)
1. Section A
(10 marks)
Alejandro C. Frery
Consider a report that describes observations from three classes labelled as “A”, “B”, and “C”. Figure 1a
shows the first version of the observations’ boxplots, while Figure 1b shows the final version of the same
display. The first version is the default output of ggplot2.
Boxplots of observations per class
2
2
Class
ations
A
v
0
B
0
Obser
C
−2
−2
−0.2
0.0
0.2
A
B
C
(a) First version
(b) Final version
Figure 1: The same dataset, different figure versions.
Identify the changes. Why did the authors made such alterations?
Answer:
1
DATA 301/471
Page 2 of 6

Student ID: . . . . . . . . . . . . . . . . . . . . . . .
2. Section B
(30 marks)
(a)
(2 marks) Which of the following can be true for selecting base learners for an
ensemble?
A. Different learners can come from the same algorithm with different hyperpa-
rameters.
B. Different learners can come from different algorithms.
C. Different learners can come from different training data.
D. All of the above.
Answer:
(b) (2 marks) Suppose, in a binary classification problem using the same dataset, there
are 3 models with 70% accuracy each. If we use the majority voting method to en-
semble these models, what is the maximum accuracy we can get?
A. 100%
B. 78.38%
C. 70%
D. 44%
Answer:
(c)
(2 marks) How does SMOTE (Synthetic Minority Over-sampling Technique) gen-
erate synthetic data?
A. By duplicating existing data points from the minority class.
B. By randomly selecting data points from the majority class.
C. By interpolating between existing minority class data points.
D. By creating new data points based on clustering techniques.
Answer:
(d)
(2 marks) Which of the following best describes the way SMOTE selects minority
class samples for generating synthetic examples?
A. It randomly selects any two points from the dataset.
B. It selects samples that are closest to each other based on a distance metric.
C. It uses Principal Component Analysis (PCA) to create synthetic points.
D. It selects samples based on their feature importance scores.
Answer:
(e)
(2 marks) What is the purpose of the setup() function in PyCaret?
A. To split the dataset into training and testing sets.
B. To automatically select the best model for a given task.
C. To initialize the environment, preprocess data, and define the target variable.
D. To deploy the model in a production environment.
Answer:
DATA 301/471
Page 3 of 6

Student ID: . . . . . . . . . . . . . . . . . . . . . . .
(f)
(2 marks) In Optuna, what is a “trial”?
A. A dataset validation step.
B. A checkpoint in training a neural network.
C. A single run of the model with a specific set of hyperparameters.
D. A predefined sequence of hyperparameters used in optimization.
Answer:
(g)
(2 marks) Which of the following is true about SHAP values?
A. SHAP values indicate the importance of each feature by comparing it to the
global average.
B. SHAP values only apply to regression models.
C. SHAP values can only be computed for models that use tree-based algorithms.
D. SHAP values show the direction and magnitude of a feature’s contribution to a
specific prediction.
Answer:
(h) (2 marks) How can past and future covariates work together in a Darts forecasting
model?
A. Past covariates provide historical data points, while future covariates bring in
known future events or conditions to enhance prediction accuracy.
B. Both are used for classification tasks rather than time series forecasting.
C. Past covariates are ignored when future covariates are available.
D. They are used to model unrelated tasks in separate models.
Answer:
(i)
(3 marks) What is the output of the code below?
1
import numpy as np
2
from sklearn.impute import SimpleImputer
3
4
A = [[3, 5], [7, np.nan], [2, 3]]
5
B = [[np.nan, 1], [4, 5], [2, np.nan]]
6
7
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
8
imp.fit(A)
9
print(imp.transform(B))
Answer:
DATA 301/471
Page 4 of 6

Student ID: . . . . . . . . . . . . . . . . . . . . . . .
(j)
(3 marks) Examine the code below and its output.
1
from sklearn.datasets import load_iris
2
from sklearn.feature_selection import SelectKBest, f_classif
3
4
X, y = load_iris(return_X_y=True, as_frame=True)
5
print(X.head())
6
7
selector = SelectKBest(f_classif, k=2)
8
selector.fit(X, y)
9
10
supports = selector.get_support()
11
print(selector.pvalues_)
Output:
1
2
sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)
3
0
5.1
3.5
1.4
0.2
4
1
4.9
3.0
1.4
0.2
5
2
4.7
3.2
1.3
0.2
6
3
4.6
3.1
1.5
0.2
7
4
5.0
3.6
1.4
0.2
8
[1.66966919e-31 4.49201713e-17 2.85677661e-91 4.16944584e-85]
What is the output of the code below?
1
for c in X.columns[supports]:
2
print(c)
Answer:
(k)
(4 marks) What is the output of the code below?
1
from sklearn.datasets import make_classification
2
from imblearn.over_sampling import SMOTE
3
from imblearn.under_sampling import RandomUnderSampler
4
from imblearn.pipeline import Pipeline
5
DATA 301/471
Page 5 of 6

Student ID: . . . . . . . . . . . . . . . . . . . . . . .
6
X, y = make_classification(n_samples=10000, n_features=10, n_classes=2,
7
weights=[0.99, 0.01], flip_y=0, random_state=1)
8
9
over = SMOTE(sampling_strategy=0.2)
10
under = RandomUnderSampler(sampling_strategy=0.5)
11
pipe = Pipeline([('o', over), ('u', under)])
12
13
X, y = pipe.fit_resample(X, y)
14
print(X.shape)
Answer:
(l) (4 marks) The plot below displays the variation of four metrics – sensitivity, speci-
ficity, balanced accuracy, and queue rate – of a model on a test set when the decision
threshold is adjusted. Match the names of these metrics to Metrics 1 through 4.
Metrics vs. Decision Threshold
1.0
0.8
0.6
alue
Metric V 0.4
0.2
Metric 1
Metric 2
0.0
Metric 3
Metric 4
0.0
0.2
0.4
0.6
0.8
1.0
Decision Threshold
Answer:
• Metric 1:
• Metric 2:
• Metric 3:
• Metric 4:
* * * * * * * * * * * * * * *
DATA 301/471
Page 6 of 6