The Data Analyst Interview Guide

This guide was put together by Diana Cherny. You can find her on Twitter and Linkedin.

🔎 Overview

Over the last few years, many people have asked me how to break into Data Analytics. I usually respond by suggesting many of the links below, so I wanted to create a guide to consolidate and share all of the resources that have been helpful for me in the past. Whether you're an aspiring Data, Product, or Business Analyst, I hope this guide will be useful for you.

If you have any feedback or questions, please connect with me!😀

🔤 SQL Study Guide

Learning SQL

Mode Analytics
SQL Zoo
W3 Schools
SQL Joins

Beginner Interview Questions

👉🏼 Focus on: Select statements, where clauses, joins, grouping, unions, aggregations
Intermediate/Advanced Interview Questions

👉🏼 Focus on: Cases, window functions, advanced counting, self-joins, sub-queries
- The Best Medium - Hard SQL Interview Questions
- Window Functions to help you pass a SQL Interview

📊 Statistics Overview

Vocabulary

**A/B Testing:** Experimentation technique where 2 or more versions of a product are shown to a random group of users to determine which will perform better given a conversion goal.

🎯 Users are split evenly into Treatment and Control groups, where half of the sample is shown a product with a slight change to it (treatment) and the other half are shown a product with no change (control). Differences in behavior between the treatment and control groups are studied.

🧠 If this sounds familiar, that's because it is! In your high school/college stats courses, you may have come across this technique under the name of Hypothesis Testing. A/B Testing is hypothesis testing for the business world.

🎨 For example, you may want to run an A/B test to find out if changing the color of a button on your webpage will result in more signups.

<aside> 💡 A/B Testing Process Collect Data: Gather data on what part of your webpage/app you would like to optimize. It's typically recommended to begin by examining with high traffic areas of your site or app, as that will allow you to gather data faster. Look for pages with low conversion rates or high drop-off rates that can be improved.

Identify Goals: Pick metrics that will help you decide whether or not the variation performs better than the original version.

Generate Hypothesis: After choosing a goal, you can begin crafting hypotheses to test. Make sure to prioritize them based on expected cost, impact, and urgency.

Create Variations: Create copy/alter design and decide what to compare against the control. This may include changing the size of font, altering images/colors, moving buttons around the page.

Run Experiment: Randomly assign users to the control/treatment groups and wait for users to interact with the experiment. You typically want to wait at least 2 weeks for an experiment to run.

Analyze Results: Look at the p-value (see below): if your p-value is lower than significance level alpha, then we reject the null and have statistical significance, meaning that our observed variation is due to something more than just random chance. P-value indicates probability of obtaining results as extreme or more than indicated by the null hypothesis. A low p-value means low chance of that happening. *Source: https://www.optimizely.com/optimization-glossary/ab-testing/*

</aside>

Statistical Significance: The likelihood that the difference in conversion rates between a given variation and the baseline is not due to random chance.

P - Value: This refers to the probability value of observing an effect from a sample. A p-value of < 0.05 is the conventional threshold for declaring statistical significance.

**Linear Regression:** Mathematical technique that models the relationship between one or more explanatory variables and a distinct response variable. For example, you can use a regression to find out how/if # of hours spent studying for an exam can predict test score performance on average.

📈 To make a regression, create a scatterplot between your X and Y variables. The line going through the data is the line of best fit, aka your regression line. Using the example above, # of hours spent studying would got on the X-axis, while test score will go on the Y-axis.

❌ Remember that correlation does not imply causation! While regressions show relationships between 2 variables, they do not tell us anything about causation.

➕ Check out Towards Data Science for more info on regressions.

Standard Deviation: A measure of how far each observed value is from the mean. Essentially, how spread out your data is.

Variance: Average of squared differences from the mean.

Significance Level: The significance level (or α) is the probability of rejecting the null hypothesis when it is true.

<aside> 💡 For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference. Lower significance levels indicate that you require stronger evidence before you will reject the null hypothesis.

Use significance levels during hypothesis testing to help you determine which hypothesis the data support. Compare your p-value to your significance level. If the p-value is less than your significance level, you can reject the null hypothesis and conclude that the effect is statistically significant. In other words, the evidence in your sample is strong enough to be able to reject the null hypothesis at the population level. *Source: https://statisticsbyjim.com/glossary/significance-level/*

</aside>

🏅Choosing and Evaluating Metrics

a16z 16 Startup Metrics

a16z 16 More Startup Metrics

Sequoia Capital - Selecting the right user metrics

Sequoia Capital - Measuring Product Health