✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.
Hidden Markov Model - MCQs - Problem-based Practice Questions
Machine Learning - Advanced MCQs
Understanding the foundations of machine learning requires a strong grasp of how different models learn from data, make predictions, and generalize. This collection of MCQs covers essential concepts such as generative vs discriminative classification, k-NN behavior, MAP vs MLE estimation, boosting dynamics, kernel methods, and decision tree depth—topics frequently asked in exams, interviews, and university courses.
These questions are designed to strengthen conceptual clarity and test real-world intuition about model assumptions, probability distributions, density estimation, and decision boundaries.
Whether you are preparing for GATE, UGC NET, university assessments, data science interviews, or machine learning certifications, this curated set will help you quickly revise key principles and identify common pitfalls in ML theory.
A. Minimizing the empirical risk
B. Evaluating P(Y | X) using Bayes’ rule
C. Maximizing the margin between classes
D. Fitting a logistic function to the data
Explanation:
Generative models estimate P(X∣Y) and P(Y). Classification is performed using Bayes’ rule:
P(Y∣X) ∝ P(X∣Y)P(Y).
Generative models are a class of machine learning models that learn the underlying data distribution and can generate new data samples similar to those seen during training.
A generative model learns the joint probability distribution: π(π,π) or just π(π). This means the model tries to understand how the data is produced, not just how to classify it.
A. When the covariances of all classes are identity and equal
B. When logistic regression is regularized
C. When data is linearly separable
D. When priors are uniform only
Explanation:
GNB with shared identity covariance produces a linear discriminant identical to logistic regression’s functional form.
Understanding the QuestionThis question asks under which specific condition Logistic Regression (LR) and Gaussian Naive Bayes (GNB) classifiers produce identical decision boundaries. The key is understanding the mathematical relationship between these two seemingly different algorithms.
Both models, Logistic Regression (LR) and Gaussian NaΓ―ve Bayes (GNB) normally produce different decision boundaries because:
- LR is discriminative → models π(π∣π). That is, Logistic Regression directly models the conditional probability π(π∣π) using the logistic function.
- GNB is generative → models π(π∣π). That is, Gaussian Naive Bayes is a generative classifier that models the joint probability π(π,π) by estimating P(Y) and π(π∣π).
But under a special condition, they produce identical linear decision boundaries. That special condition is: When the covariances of all classes are identity and equal.
When GNB has: Identity covariance, that means, no correlation between features, each feature has variance=1 and same covariance for each class, Gaussian NaΓ―ve Bayes's decision boundary has the same mathematical form as Logistic Regression.
Both models produce: π€⊤π+π=0. Same functional form, so same separating hyperplane.A. Because 1-NN memorizes the class means
B. Because each training point is its own nearest neighbor
C. Because 1-NN uses leave-one-out validation
D. Because 1-NN normalizes distances
Explanation:
Each training sample is its own closest neighbor, so 1-NN always predicts correctly on training data.
What is 1-NN?
1-NN means 1-Nearest Neighbor, which is the simplest form of the k-Nearest Neighbors (k-NN) algorithm. 1-NN classifier assigns the class of a new point based on the single closest training point in the dataset.
Why is the training error of 1-NN always zero?
In 1-Nearest Neighbor classification, when predicting the label of a data point, the algorithm finds the closest point in the dataset. But if you test 1-NN on the same training data, then every training point’s nearest neighbor is itself (distance = 0). So the classifier simply returns its own label, which is always correct.
Thus: Training Error = 0, because no point is misclassified when it's compared with itself.A. Gaussian prior
B. Beta prior
C. A prior that assigns probability 1 to a single parameter value
D. Uniform prior
Explanation:
A degenerate (a.k.a. point-mass or delta) prior forces the parameter to a fixed value regardless of data, so MAP ≠ MLE even with infinite samples.
A. Boosting has no natural stopping point
B. Boosting inherently underfits
C. Boosting does not use loss functions
D. Boosting requires validation to update weights
Explanation:
Boosting can overfit if allowed to run indefinitely. CV selects the optimal number of rounds.
Boosting keeps improving training accuracy indefinitely and can easily overfit, so cross-validation is needed to decide how many boosting steps to perform.
What is boosting?
Boosting is a family of ensemble learning techniques that turn a collection of weak learners (models that are only slightly better than random guessing) into a single strong learner with high predictive accuracy. The core idea is simple: train models sequentially, each one focusing on the mistakes made by the previous ones, and then combine their predictions (usually by a weighted vote or sum). By doing this, the ensemble corrects its own errors over time and ends up far more powerful than any individual component.
What is cross-validation?
Cross-validation is a fundamental resampling technique used to evaluate machine learning models' ability to generalize to unseen data while preventing overfitting. It works by systematically partitioning the dataset into multiple subsets (called folds), training models on some subsets, and testing on others, with this process repeated multiple times to obtain a reliable performance estimate.
Why does boosting need cross-validation?
Boosting algorithms (like AdaBoost, Gradient Boosting, XGBoost, etc.) build models sequentially, adding weak learners (usually decision stumps/trees) one at a time.
Unlike many other models:- There is no built-in rule that tells you when to stop adding more learners.
- If you keep boosting longer, the model can overfit heavily.
This is why libraries like XGBoost include a parameter like early_stopping_rounds, which depends on a validation set.
A. KDE estimates a probability density; kernel regression estimates a function value
B. KDE uses only Gaussian kernels
C. Kernel regression cannot use kernels
D. KDE requires class labels
Explanation:
KDE estimates P(X), while kernel regression estimates the functional relationship ŷ(x) via weighted averages.
Differences between KDE and Kernel regression
What each method estimates/answers:
- KDE answers "what is the probability density?" (it answers, 'how are the data distributed?')
- Kernel regression answers "what is the function value or conditional expectation?" (it answers, 'Given X, what is Y?')
- KDE uses kernels to smooth the estimated probability distribution,
- Kernel regression uses kernels to perform weighted local averaging to estimate a conditional relationship between variables.
- Use Kernel Density Estimation when you want to understand how the data is distributed, especially when you do NOT assume the distribution is normal. Example: Estimate the density of customer ages
- Use kernel regression when you want to predict Y from X in a non-parametric, smooth way.
- KDE is unsupervised
- Kernel regression is supervised
A. Identical to that of each weak learner
B. A weighted combination that can be more complex
C. Always linear, regardless of weak learner type
D. Equivalent to a decision tree of depth 1
Explanation:
Boosting aggregates many weak rules, often resulting in highly nonlinear decision boundaries.
How does boosting affect the complexity of the final decision boundary?
Boosting (e.g., AdaBoost, Gradient Boosting) works by combining many weak learners, typically simple classifiers like decision stumps (depth-1 trees). Each weak learner itself has a simple decision boundary.
But boosting does not just average them; it takes a weighted combination based on each learner’s accuracy. Adding many simple boundaries creates a final decision boundary that can be very complex, often highly nonlinear.
This happens because each new weak learner focuses on misclassified points from previous learners, gradually bending the overall decision surface.
A. When each feature is continuous
B. When many features repeat but labels differ
C. When the impurity measure is entropy
D. When pruning is disabled
Explanation:
If identical feature vectors map to conflicting labels, the tree keeps splitting and can exceed depth n.
Why can a decision tree have depth greater than the number of training samples?
Because depth counts the number of splits along a path, not the number of unique samples or unique feature values. Even if features repeat, the tree keeps splitting as long as it can reduce impurity—possibly creating long chains of binary splits, each separating a subset of samples, even if they have identical feature values.
Why does this happen with repeated features?
When features repeat across multiple samples:- The tree must use the same features repeatedly to separate conflicting labels.
- Each split on a feature that has been previously split becomes less efficient at separating classes.
- The tree exhibits overfitting behavior, attempting to memorize individual samples rather than learn generalizable patterns.
- If samples are identical in their selected features but have different labels, the tree becomes unable to achieve purity through feature thresholds alone
Decision trees try to make leaves pure. If purity is impossible, depth grows uncontrollably. This is why real systems use: max_depth, min_samples_split, min_samples_leaf.
To avoid pathological overfitting trees.Example:
When the feature values are repeated (e.g. many rows have x = 5) but the labels differ, the tree may keep trying thresholds that slice right at the repeated value. If the algorithm does not enforce a “strictly decreasing impurity” condition, it could accept a split that leaves the dataset unchanged on one side.
A. Larger k reduces sensitivity to noise by averaging over more neighbors
B. Larger k forces the classifier to become linear
C. Larger k always guarantees zero training error
D. Larger k makes the classifier equivalent to a decision tree
Explanation:
When k increases, the prediction is based on a majority vote over a larger set of neighbors, which reduces the influence of mislabeled or noisy points. This typically improves generalization by lowering variance, although extremely large k can lead to underfitting.
Larger k reduces sensitivity to noise by averaging over more neighbors.- Averaging = majority vote – By looking at several nearby points instead of just one, the classifier “averages” their labels. If a few of those neighbours are mislabeled (or are outliers), they are unlikely to dominate the vote.
- Noise reduction – Random fluctuations in the training labels act like noise. Majority voting behaves like a low‑pass filter: it suppresses high‑frequency (noisy) variations while preserving the underlying signal.
- Result on test error – Lower variance ⇒ the learned decision surface is more stable on unseen data, so test error typically goes down (up to a point; if k becomes too large, bias dominates and performance can deteriorate).
Thus, averaging over more neighbours mitigates the effect of noisy or atypical training points, which is why test performance usually improves.
A. Generative models always achieve lower error
B. Discriminative models directly model P(Y∣X)
C. Generative models require fewer assumptions
D. Discriminative models estimate P(X∣Y)
Explanation:
Discriminative models learn P(Y∣X) or direct decision boundaries. Generative models learn P(X,Y).
