You nailed the interview and landed the hottest job in America; go you, now you’re a data scientist! As you come into work every day to incorporate data science into product development, or apply machine learning to solve business problems, it’s blue skies everywhere. There are so many possibilities it’s hard to know where to begin and how to spend your time.
Well, here are a few suggestions, broken down into the stages of a typical data science project, from my experience practically guaranteed to work every time.
Disclaimer: No data scientists were intentionally harmed. Any resemblance to your coworkers, past or present, or the project you’re working on is purely coincidental, although expected.
Picking your problem
Start working on whatever problem you want.
— Try to pick the least defined problem, hopefully one you vaguely understand. That way you’ll have plenty of room to explore, and you don’t have to worry about running out of things to try.
— Don’t bother wasting your time asking anyone what would be useful or valuable for them, you know better anyway! They’re probably not going to have the answers. And really, the fewer people that know about your work the better, you don’t want them interrupting you.
— You shouldn’t have to explain yourself either, the value of data science is self-evident; everybody wants it! But if someone does ask you what you’re working on, a good rule of thumb is that the more you have to convince them that it’s useful, the more you’re on the right track and can rub it in their face later.
— The business will eventually find value in whatever you do, don’t worry about when or how yet.
Start working on any and every problem someone comes to you with.
— Best not to ask too many questions, they’re the expert after all. They should know exactly what they need, and anytime someone comes to you it means there’s a data science solution that’s worth building. You don’t want them thinking you’re not up to it. They may start doubting your commitment.
— You have to show them that with data science you can do anything. They’ve likely read the news recently, so reassure them AI can do whatever they want, probably more and better and faster and stronger and 24/7/365 with a smile.
Getting the data
Use only the data you have readily available, trust it implicitly.
— Data’s data, there’s not much you can do about it. Whatever you have to work with will have to do. It’s best to leave everything in, as-is, since it’s more representative of reality.
— You also don’t want to waste your time here munging data when you can be doing the fun stuff! No one’s impressed with how many null fields you filled in, let the algorithms figure out what’s useful, isn’t that what machine learning is for?
— If it’s not readily machine accessible, it’s not worth your time. If people want you to help them, they should consolidate those spreadsheets, clean those logs, join the 15 databases, and OCR that pile of documents on their desk first.
Don’t start work until you have 597 Petabytes of data
— Data is gold. The more of it you have, the better your eventual solution will be. Best not to prematurely build something when more data will do the trick.
— You know that unless you’re dealing with “big data” problems, the problem’s probably not worthy of your time, right? If it could theoretically be done in excel without coding, that’s probably not going to impact the business.
Throw away any data that’s weird or imperfect
—Got those pesky data points throwing everything off? I know what you mean, don’t try to figure out what’s going on, just push them under the rug, it’s our secret.
Solving the problem
Spend most of your time building the most technically complex or interesting solution you can think of
— Anyone can make the simple solution, it’s a waste of your skills. Impress everyone with how many different features and algorithms you can put together. Remember complexity=value. It runs for days through a combination of scripts and code snippets on your laptop? Respect.
—Implement everything yourself, don’t look for existing or previous solutions, each problem is unique after all. You get all the credit, and you’re also the only one who can keep maintaining it, so job security.
—Don’t stop until you’ve squeezed every ounce of accuracy (insert favorite evaluation metric here) possible. It’s quite embarrassing to know you could do better. Everyone will appreciate how much extra time you’ve spent on it.
— Throw away any experiment that didn’t work out, no on wants to see that.
— Most importantly, don’t give up. You’ve got the right problem, you’ve put in this much time already, no sense reframing it now. You only have one shot to get this right.
Build a robust scalable beast
— Who says data scientist can’t engineer for production scale. Prove them all wrong. Start from day 1 rigorously testing every piece of code you write, you never know what you’re going to end up using; you’ll be grateful it’s ready.
Presenting your solution
— Or at least delay showing the solution to end users until it’s definitely ready. No mockups, no janky prototypes, it’ll only confuse them and they may poke holes in it. You don’t want them to think less of you, and you already know it’s not perfect, you don’t need anyone else thinking that.
Don’t explain how it works or where it fails
— Any time spent outside of coding is wasted, and frankly, not your job. Minimize attendance at meetings where you may have to explain anything. It’s better when everyone is in awe of you as the mysterious stranger with the magic black box.
— If you have to make a presentation, try to keep it as technical as possible, focus on how and what you’re building. No questions or discussion means you nailed it.
— Knowing upfront what you’ve built has limitations will make it less likely to be used, people want complete solutions.
Redefine how people do things
— Most people hate what they’re doing now and how they’re doing it. They want to do things differently, and they’ll thank you for forcing them to change their ways.
Make your output completely transparent
— Let’s see all those numbers. How can people trust your output without understanding the accompanying probabilities and confidence intervals? People demand transparency!
—Graphs and visuals are pretty self explanatory ways of presenting information. You can pack a lot in there, 1000’s of words worth. People love clicking around interactive visualizations to discover the insights for themselves.
— If the output is simple, how would people know how much work went into the solution?
Deploying your solution
— Since you’re done with prototyping, it’s done, now you can finally start showing it off. The hard part’s over.
— Everything you assumed is available during development will be in production, same data source, same ETL. Just pass the code along, it should be pretty straightforward for someone to engineer it to run in production, it’s not your concern how it’s going to get deployed.
Deploy it once. Never update it. Forget about it.
—It works now, don’t pick at it until someone complains. As long as it can run once you’re good.
Don’t worry, not all steps have to executed perfectly, usually coming close in even one is sufficient to achieve the desired result.
That’s meant to be tongue-in-cheek, but everyone I know (including myself) has done at least one of those at least once, and usually more. Some of these are obviously misguided, while some are a bit more subtle to realize, and took me a few failed projects and a long while to understand.
Work on something important
— How do you know what’s important? You talk to the end users, decision makers, product owners, a lot, early and often. Identify problems where you are most likely to provide information that changes behavior, and be comfortable questioning and even passing on projects.
— Make sure everyone understands and agrees on what problem is being solved and how success is being measured, and you update them throughout. It’ll likely be slightly the wrong problem and solution at first, that’s fine, you can course correct together.
Get a solution out quickly
— Build a functional, simple solution using a reasonable set of data. Enough data that it’s not obviously missing key pieces, but not so much that it becomes a hinderance. Find the simplest algorithm that will function at scale, you can increase the complexity of the algorithm later as necessary, but the longer it can stay simple, the better.
— Go out of your way to create opportunities for feedback. These are a great forcing function for getting to intermediate completion, thinking about design and UX, and creating excitement and sense of progress.
— Get to a point where you can fail sooner rather than later. Don’t prematurely optimize one piece before you have things working end-to-end.
Learn from it and iterate
— Having something out there and working, even when you know it can be improved, means you have something completed. Don’t underestimate what that means for trust, morale, and improved understanding of what it takes to go through this process for another problem.
— Don’t stop here though, things keep moving: data shifts, models make mistakes, perspectives change. Collect the feedback and update as necessary.
Inspired by How to Have a Bad Career in Research.