In this article, I’m going to give you three ways where you can get practical data science experience on your own. By completing these projects, you’ll develop a strong understanding of SQL, Pandas, and machine learning modeling.
- First, I’m going to provide you with real-life SQL case studies in which you’re given a business problem and are required to query databases to diagnose the problem and formulate a solution.
- Second, I’m going to provide you with dozens of practice problems for Pandas, a library in Python meant for data manipulation and analysis. This will help you develop the skills that are required for data wrangling and data cleaning.
- Third, I’m going to provide you with a variety of machine learning problems where you can develop a machine learning model to make predictions. By doing so, you’ll learn how to approach a machine learning problem, as well as the fundamental steps required to develop a machine learning model from start to finish.
With that said, let’s dive into it!
1. SQL Case Studies
If you want to be a data scientist, you have to have strong SQL skills. Mode provides three practical SQL case studies that simulate real-life business problems, as well as an online SQL editor where you can write and run queries.
To open Mode’s SQL editor, go to this link and click on the hyperlink where it says ‘Open another window to Mode’.
If you’re new to SQL, I would first start with Mode’s SQL tutorials, where you can learn basic, intermediate, and advanced SQL techniques. Feel free to skip this if you already have a good understanding of SQL.
Case Study 1: Investigating a Drop in User Engagement
The objective of this case is to determine the cause for a drop in user engagement for Yammer’s project. Before diving into the data, you should read the overview of what Yammer does here. There are 4 tables that you should work with.
The link to the case will provide you with much more detail pertaining to the problem, the data, and the questions that should be answered.
Check out how I approached this case study here if you’d like guidance.
Case Study 2: Understanding Search Functionality
This case is more focused on product analytics. Here, you’ll be required to dive into the data and determine whether the user experience is good or bad. What makes this case interesting is that it’s up to you to determine what ‘good’ and ‘bad’ mean and how the user experience will be evaluated.
Case Study 3: Validating A/B Test Results
One of the most practical data science applications is performing A/B tests. In this case study, you’ll dive into the results of an A/B test where there was a 50% difference between the control and treatment groups. Your task for this case is to validate or invalidate the results after a thorough analysis.
2. Pandas Practice Problems
When I first started developing machine learning models, I found that my lack of Pandas skills was a big limitation to what I could do. Unfortunately, there aren’t many resources on the internet that allow you to practice your Pandas skills, unlike Python and SQL.
A few weeks ago, however, I came across this resource — this is a repository full of practice problems specifically for Pandas. By completing these practice problems, you’ll know how to:
- Filter and sort your data
- Group and aggregate data
- Use .apply() to manipulate data
- Merge datasets
- And much more.
If you can complete these practice problems, you should be able to confidently say that you know how to use Pandas for data science projects. It will also help you out significantly for the next section.
3. Machine Learning Modeling
One of the best ways to get data science experience is by creating your own machine learning models. This means finding a public dataset, defining a problem, and solving the problem with machine learning.
Kaggle is one of the world’s largest data science communities with hundreds of datasets that you can choose from. Below are a couple of ideas that you can use to get started.
Predicting Wine Quality
This dataset contains data on various wines, their composition, and their wine quality. This can be a regression or classification problem, depending on how you frame it. See if you can predict the quality of a red wine given 11 inputs (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, and alcohol.
If you’d like some guidance creating a machine learning model for this dataset, check out my approach here.
Used Car Price Estimator
Craigslist is the world’s largest collection of used vehicles for sale. This dataset is composed of scraped data from Craigslist and is updated every few months. Using this data set, see if you can create a dataset that predicts whether a car listing is over or underpriced.