Interview Questions for Data Science

Chinmayi Sahu
11 min readMar 9, 2021

What is Central Limit Theorem and why it is important?

Given a sufficiently large sample size, the sampling distribution of the mean for a variable will approximate a normal distribution regardless of that variable’s distribution in the population. Importance: This theorem allows simplifying problems in statistics by allowing to work with a distribution that is approximately normal.

What is sampling? How many sampling methods do you know?

Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.

Probabilistic data sampling: Random Sampling, Stratified Sampling, Cluster Sampling, Multistage Sampling (Advanced clustering Technique), Systematic Sampling (Sampling from interval i.e. every 10th row of data)

Non-probability data sampling: Convenience (Data from an easily accessible and available group), Consecutive (Every subject that meets the criteria until the predetermined sample size is met), Purposive or judgmental (based on predefined criteria), Quota ( Equal representation within the sample for all subgroups)

What is the difference between Type I and Type II errors?

A type I error occurs when the null hypothesis is true but is rejected. A type II error occurs when the null hypothesis is false but fails to be rejected.

What is the p-value?

When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. The minimum alpha (the significance level) at which the coefficient is relevant. The lower the p-value, the more important is the variable in prediction and there is stronger evidence in favor of the alternative hypothesis. If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis.

What is a coefficient?

The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant.

What is R squared (R2) and Adjusted R squared?

R-squared measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model. If you keep on adding new variables, the R square value will stay the same or increase irrespective of the variable significance.

Adjusted R square calculates R square from only those variables whose addition in the model are significant in other word penalizes for adding variables which do not improve existing model.

N: Number of Samples, p: Number of variables/features

What are the assumptions required for linear regression?

  • There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data,
  • The errors or residuals of the data are normally distributed and independent from each other
  • There is minimal multicollinearity between explanatory variables
  • Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.

What is a statistical interaction?

Interaction effects occur when the effect of one variable depends on the value of another variable.

What Are the Types of Biases That Can Occur During Sampling?

  • Selection bias
  • Under coverage bias
  • Survivorship bias

What is Survivorship Bias?

It is the logical error of focusing on aspects that support surviving some processes and casually overlooking those that did not work because of their lack of prominence. For example, during a recession you look just at the survived businesses, noting that they are performing poorly. However, they perform better than the rest, which is failed, thus being removed from the time series

What is under coverage bias?

Under coverage bias occurs when some members of the population are inadequately represented in the sample

What is Selection bias?

Selection bias occurs when data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population.

The types of selection bias include:

  • Sampling bias: Due to a non-random sample of a population.
  • Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
  • Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
  • Attrition: Kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.

What is an example of a data set with a non-Gaussian distribution?

Binomial: multiple toss of a coin Bin(n,p)

Bernoulli: Bin(1,p) = Be(p)

Poisson: Pois(lambda)

Power law: English words

List the differences between supervised and unsupervised learning.

Supervised: Input data is labeled, Split in training/validation/test, Used for prediction, Classification, and Regression

Unsupervised: Input data is unlabeled, No split, Used for analysis, Clustering, dimension reduction, and density estimation

What is the bias-variance trade-off?

Bias: Bias is an error introduced in the model due to the oversimplification of the algorithm. It can lead to underfitting.

Low bias ML algorithms: Decision Trees (Depth), k-NN (K), and SVM ( C )

High bias ML algorithms: Linear Regression, Logistic Regression (No. of features)

Variance: Variance is the error introduced in the model due to a too complex algorithm. A little change in data makes the model performs poorly.

The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.

What is a confusion matrix?

It is a matrix with a combination of the actual and predicted values.

TPR = Recall = Sensitivity, TNR = Specificity

What is the difference between “long” and “wide” format data?

Wide: The data we see in pandas data frame i.e. in a single row

Long: Each row is a one-time point per subject.

What do you understand by the term Normal Distribution?

Data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of symmetrical, bell-shaped curve.

Properties:

  1. Unimodal (Only one mode)
  2. Symmetrical (left and right halves are mirror images)
  3. Bell-shaped (maximum height (mode) at the mean)
  4. Mean, Mode, and Median are all located in the center
  5. Asymptotic

What is the correlation?

The technique for measuring and estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related. The value lies between -1 and 1.

What is co-variance?

Covariance is a measure that indicates the extent to which two random variables change in a cycle.

What is the difference between Point Estimates and Confidence Interval?

Point Estimation gives us a particular value as an estimate of a population parameter.

A confidence interval gives us a range of values that is likely to contain the population parameter.

What is the goal of A/B Testing?

A/B testing is a method to check the performance/effectiveness of the newly created/modified model.

How can you generate a random number between 1–7 with only a die?

Roll the die twice

Why resampling is done?

Resampling is a methodology of economically using a data sample to improve the accuracy and quantify the uncertainty of a population parameter.

Two commonly used resampling methods:

• Bootstrap. Samples are drawn from the dataset with replacement (allowing the same sample to appear more than once in the sample), where those instances not drawn into the data sample may be used for the test set.

  • k-fold Cross-Validation.

Resampling is done in any of these cases:

  • Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points
  • Substituting labels on data points when performing significance tests
  • Validating models by using random subsets (bootstrapping, cross-validation)

What are the differences between over-fitting and under-fitting?

In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Model is excessively complex, such as having too many parameters. Poor predictive performance.

Underfitting occurs when an algorithm cannot capture the underlying trend of the data. For example, when fitting a linear model to non-linear data. Poor predictive performance.

How to combat Overfitting and Underfitting?

Combat Overfitting: Add noise, Feature selection, increase training set, Regularization, Use cross-validation techniques such as k folds cross-validation, Boosting/bagging, Dropout, Early stopping, Remove inner layers

Combat Underfitting: Add features, Increase training time.

What are generalization and regularization?

Generalization refers to your model’s ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model. To make the model generalize we use regularization.

What is regularization? Why is it useful?

Regularization is the process of adding a tuning parameter (penalty term) to a model to induce smoothness in order to prevent overfitting. The model predictions should then minimize the loss function calculated on the regularized training set.

What Is the Law of Large Numbers?

The average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.

What Are Confounding Variables?

A confounder is a variable that influences both the dependent variable and independent variable.

If you are researching whether a lack of exercise leads to weight gain:

lack of exercise = independent variable

weight gain = dependent variable

A confounding variable here would be any other variable that affects both of these variables, such as the age of the subject.

What are hard voting (majority voting) and soft voting in the ensemble?

  • Hard Voting. Predict the class with the largest sum of votes from models
  • Soft Voting. Predict the class with the largest summed probability from models.

Explain how a ROC curve works?

The ROC curve is a graphical representation of the contrast between true positive rates (sensitivity)and false-positive rates (1-Specificity)at various thresholds.

What is TF/IDF vectorization?

TF-IDF is short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Why we generally use Soft-max (or sigmoid) non-linearity function as the last operation in-network? Why RELU in an inner layer?

Soft-max (or sigmoid): It is because it takes in a vector of real numbers and returns a probability distribution. It should be clear that the output is a probability distribution: each element is non-negative and the sum over all components is 1.

RELU: Avoids the vanishing gradient issue

What is the use of the activation function?

The activation function is a function that is added into an artificial neural network in order to help the network learn complex patterns in the data. The activation function is at the end deciding what is to be fired to the next neuron.

Why we use Optimizer?

During the training process, we tweak and change the parameters (weights) of our model to try and minimize that loss function, and make our predictions as correct and optimized as possible. Optimizer ties together the loss function and model parameters by updating the model in response to the output of the loss function.

What is the difference between univariate, bivariate, and multivariate analysis?

Univariate analyses are descriptive statistical analysis techniques that can be differentiated based on one variable involved at a given point in time as in Boxplot

The bivariate analysis attempts to understand the difference between two variables at a time as in a scatterplot.

The multivariate analysis deals with the study of more than two variables to understand the effect of variables on the responses as in the pair plot.

What are Eigenvectors and Eigenvalues?

Eigenvectors are used for understanding linear transformations. We usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing, or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of the eigenvector or the factor by which the compression occurs.

Can you cite some examples where a false positive is important than a false negative?

Let’s say an e-commerce company decided to give $1000 Gift voucher to the customers whom they assume to purchase at least $10,000 worth of items. They send free voucher mail directly to 100 customers without any minimum purchase condition because they assume to make at least 20% profit on sold items above $10,000. Now the issue is if we send the $1000 gift vouchers to customers who have not actually purchased anything but are marked as having made $10,000 worth of purchase.

Can you cite some examples where a false negative important than a false positive?

Example 1 What if a Jury or judge decides to make a criminal go free?

Example 2 Fraud detection.

Example 3 Cancer detection

Can you cite some examples where both false positive and false negatives are equally important?

Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad customers.

Medical domain

Can you explain the difference between a Validation Set and a Test Set?

Training: Fit the parameters i.e. weights

Validation: part of the training set, for parameter selection, avoid overfitting

Test: For testing or evaluating the performance of a trained machine learning model, i.e. evaluating the predictive power and generalization.

Explain cross-validation

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into.

Steps:

  1. Shuffle the dataset randomly (or stratified).
  2. Split the dataset into k groups
  3. For each unique group:

a. Take the group as a holdout or test data set

b. Take the remaining groups as a training data set

c. Fit a model on the training set and evaluate it on the test set

d. Retain the evaluation score and discard the model

4. Summarize the skill of the model using the sample of model evaluation scores

Why Transformer models perform well as compared to RNN/LSTM?

Transformers are better than all the other architectures because they totally avoid recursion, by processing sentences as a whole and by learning relationships between words during training time.

--

--