I believe that these lessons are so important because they are instrumental to having a successful data science career. After reading this, you’ll realize that there’s much more to being a good data scientist than building complex models.
With that said, here are my 3 most important lessons I’ve learned in my data science career!
1. A large portion of time is actually spent in between your projects (before and after).
Image created by Author.
One thing that I noticed is that almost all data science courses and boot camps emphasize and elaborate on the modeling phase of the lifecycle of a project, while in reality, that only makes up a small component of the entire process.
If it takes you a month to build a preliminary machine learning model at work, you can expect to spend a month understanding the business problem beforehand and documenting and socializing the project afterward.
It’s not just recommended that you complete these steps prior and subsequent to building your model, but it’s pivotal for the success of your project.
Let’s dive into the importance of each:
- Business Understanding: understanding the business problem at hand is critical for your success. For example, if you’re building a machine learning model, you should know what the model is supposed to predict, who’s going to use it, how it’s going to be used practically, what metrics you’ll use to assess the model, and so on. It’s essential that you take the time to understand everything about the business objective to create an applicable model.
- Documentation: While I agree that documentation is less exciting than munging through data and building models, it’s important that you have clear and concise documentation for your code, for any tables that you build, and for how the model was built. This is really important so that you OR someone else can easily refer to these resources when using your models or when fixing them.
- Socialization: Socialization rarely gets talked about, but your projects won’t be successful if they’re not used by the business. Socializing your projects entails presenting them with relevant stakeholders, explaining their value, and how to use them. The more stakeholders you can sell your ideas to, the more likely they will adopt your data products and the more successful your projects will be.
What do all three of these steps have in common? They’re all a form of COMMUNICATION. In fact, I’d argue that good communication is the difference-maker between data scientists and senior data scientists.
2. Fundamentals will get you >80% of the way.
When I started learning data science, I tried learning the most complicated concepts without learning the basics.
After years of experience, I’ve realized that the basics are sufficient enough to get you over 80% of the way in your career. Why? Simpler solutions always win. They’re easier to understand, easier to implement, and easier to maintain. Once a simple solution demonstrates its value to the company, only then could you look into more complex solutions.
So what exactly are the fundamentals?
After 3 years of work, I am convinced that mastering SQL is pivotal to have a successful career. SQL is not a hard skill to learn (i.e., SELECT FROM WHERE), but it is certainly a hard skill to perfect. SQL is essential for data wrangling, data exploration, data visualization (building dashboards), building reports, and building data pipelines.
Check out my guide below if you want to master SQL: A Complete 15 Week Curriculum to Master SQL for Data Science
B) Descriptive and Inferential Statistics
Having a good understanding of fundamental descriptive and inferential statistics is also very important.
Descriptive statistics allow you to summarize and make sense of your data in an easy manner.
Inferential statistics allow you to make conclusions based on limited amounts of data (samples). This is essential for building explanatory models and A/B testing.
C) Python for EDA and Feature Engineering
Python is important mainly for performing EDA and feature engineering. That being said, these two steps can also be completed using SQL, so that’s something to keep in mind. I personally like to have Python in my tech stack because I find it easier to perform EDA in a Jupyter Notebook than a SQL console or a dashboard. Check out: An Extensive Step by Step Guide to Exploratory Data Analysis
3. It’s better to iterate and build several versions of a model than to spend an enormous amount of time on building one final model.
Build, test, iterate, repeat.
Generally, it’s always better to spend less time on a model to get an initial version into production and iterate from there. Why?
- Allocating less time on an initial model incentivizes you to come up with a simpler solution. And like I said earlier in this article, there are several benefits to a simpler solution.
- The faster you come up with a POC (proof of concept), the faster you can receive feedback from others to improve on it.
- Business needs constantly change, so you’re more likely to be successful if you can deploy your project sooner than later.
The point I’m trying to make is not to rush your projects but to quickly deploy them so that you can receive feedback, iterate, and improve your projects.