How to become a data scientist

Michael Ke Zhang
13 min readApr 17, 2015

We at Wolfe Career help people discover the best careers for them (we just launched our beta — so check us out!).

In this post, we would like to tell you what we learned in our Careers in Data Science for Stanford Students. We invited data scientists from Intuit, StitchFix, Coursera, Jetlore, Facebook, Ayasdi, Pinterest, and the founder of Insight Data Science Fellows program to come to Stanford and spent a night with students. Thank you StartX for hosting it.

Not long after we announced our event, we were overwhelmed by the response from the Stanford community. In a short period of 3 days, we had over 450 people signed up, which is 9 times what we planned to have. We asked those who signed up what they study and what degrees they are pursuing: 35% are pursuing a Ph.D, 38% are pursuing a Masters, and 25% are pursuing a Bachelors Degree. We also compared the field of study of the 300 data scientists in our alumni network to those who signed up (the signups were heavily biased towards physicists). Physics, CS, Statistics, Math, and EE are consistently ranked the top five majors for data scientists. (A thank you goes to Joy Rimchala from Intuit in helping with the data mining and analysis.)

Comparison of the field studies of Stanford Data Science Alumni vs our Attendees

So for physicists, you should definitely consider data science as a career option, and you can still be a data scientist if you are studying biology. The background of a data scientist is really diverse.

For the same 300+ Stanford alumni working in data science, we also parsed their skills. If you are already studying science, this is a good news: for the top 10 skills, you already have most of them. You perhaps only need to polish/learn a good programming language(python), some knowledge of data structures, machine learning (Andrew Ng’s ML Coursera class is a good start), structured data (sql — you can learn that in half a day), and algorithms. If you start some side projects or contribute to open source libraries, you will be very well prepared.

Now we started the panel discussion. A big thank you to our Moderator Patricia Burchat — she did one amazing job in moderating the panel. We also need to thank Stephanie Tai for her incredible detailed notes.

Moderator: What were the most valuable subjects you studied or skills or attitudes you acquired? How does your academic background influence your work in data science?

Max Song (Ayasdi): You have to have a good attitude: confidence. Because you are always confronted with things you don’t know. My colleagues do not know everything, but they are good at learning on the fly.

Zhenghao Chen (Coursera): Be curious about things and want to understand them. Results of often emerge from “why does this happen?” more than “get this thing done”.

Eldar Sadikov (Jetlore): The core technical skills are important (statistics, machine learning, computer science, etc.). The clarity of thinking is very important: pose the question you’re trying to solve, be able to clearly define assumptions, see what kind of data you can actually observe, identify possible biases present, isolate those biases, and be clear with what you’re trying to do. This is the same as you solve a scientific problem.

Eldar Sadivorv on core skills of data scientists.

Results often emerge from “why does this happen?” more than “get this thing done”.
- Zhenghao from Coursera

Moderator: Are there ‘missed opportunities’ you wish you had taken advantage of when you were an undergrad, grad student or postdoc?

Jake Klamka (Insight): (in line with Zhenghao’s remark about curiosity) Have you gone out and explored? Have you done everything you can with the data? Do more fun side projects (specifically uses “fun”). You need to learn fundamentals, but also think about interesting and fun little nugget problems: how can I do this, how can I analyze this, what findings can I post on a blog ­­something different from what your usual homework assignments ask of you.

(prompted by our moderator) Encouraged hack days, online challenges, etc. as a way to get into these fun projects.

Work on interesting and fun little nugget problems.
- Jake Klamka from Insight Data Science Fellowship

Moderator: Data science means something different at every company — e.g., it can mean machine learning, statistics, data infrastructure, business analytics, etc. How do you think about exploring each area and picking one that fits you?

Anuranjita Tewary (Intuit): “A data scientist is just a business analyst who lives in California” (laughter). In an ideal world, all the skills described would exist in one person (unicorn). Data scientist is different from data engineer or data analyst in that a data scientist behaves like a scientist in many ways. The scientific method, curiosity, etc. come from the “scientist” part, but data scientists then have to take it a step further and be able to do the “product” part (referring to earlier joke).

Eldar Sadikov (Jetlore): Data analyst: queries database to answer questions. Data engineer: engineers infrastructure to take data from one place and then save it to a table in some form so a data analyst can query and use it. Data scientist: s/he doesn’t just query databases and answer easy questions, but tries to get insights out of data, gets features and builds predictive models.

Editor’s notes: by analyzing over 250 data scientist profiles, this study found that there were four categories of data scientists : data businessperson, data engineer, data researcher and data creative. They all come from different background and have very different skill profiles. It essential to understand the difference among those if you are thinking about a career in data science.

Anu Tewary on different types of data scientists.

Moderator: What were the most significant challenges you faced in finding career opportunities in data science or transitioning to a career in data science?

Think more about where you want to go; what companies you want to work for, instead of thinking about who would hire me.
- Max Song from Ayasdi

Max Song (Ayasdi): DJ Patil offered great advice: when you want to do something new, you’re stuck with no prior work experience (cycle of no relevant experience, so no job offers, so no relevant experience, etc.). When you want to make this transition (especially from academia to industry), you need to essentially run and jump across the cliff, but you also need someone to catch you. Be persistent; if there’s a ten percent chance you’ll be hired, you need to apply to ten places before you can expect to be hired. My personal story: I actually applied to Ayasdi three or four times before they found a good, fitting role for me.

Think more about where you want to go. What companies you want to work for, instead of thinking about who would hire me.

Moderator: People who go to where they want to go are the happiest.

Jake Klamka (Insight): Right. It is easy to fixate on a single thing when making this transition (e.g. stats, machine learning), but it’s helpful to have a good toolset. For example, lots of science students use MATLAB, but a lot of companies might not use MATLAB; rather than hoping for them to hire you so you can learn Python on the job, it’s better to try and learn Python yourself.

First you can hit the ground running and try to learn the language: try to talk to people and go to panels and events (like this) to pick up the language. A lot of times two people are talking about the exact same thing but in different languages. For example: talking about A/B testing: “You guys run a lot of studies here.” “No, we don’t really do studies, but we do a lot of experiments”.­ Both people are talking about the same thing (A/B testing), but there is a language barrier.

So knowing the right language and keywords is important.

Jake Klamka on learning the language of data scientists.

You should learn the language of data scientists.
- Jake Klamka from Insight Data Science Fellowship.

Moderator: At Coursera, Inuit and Jetlore, what strategies do you use to recruit talent? What are you looking for in a technical hire?

Zhenghao Chen (Coursera): Cousera looks for technical people from any background; they value diversity, so this widens the pipeline quite a bit. We value people who can demonstrate that they can pick things up quickly or can be motivated to learn new things (reference to Jake’s earlier remark regarding fun side projects, and Max’s comment about confidence.)

Anuranjita Tewary (Intuit): We look for three things in people. First, communication skills: we want to have data scientists who will be part of a team, who will communicate with very senior people at the company, who can tell stories with data, who can make the case for why this decision should be made or why that product should be built.

Second, business savvy: data scientists are heavily involved in product development and all aspects of business; they need to understand what they’re working with (the whole system). For example, interviewees are asked “how could you monetize Mint?” (how can Mint make money) to see if people did their homework, if they tried the product out, if they know how the product works, how they would suggest improving it.

Third, we are looking for creativity with data: what data actually represents? Data scientists have to be creative with how they use data; need that creativity and curiosity.

Eldar Sadikov (Jetlore): We value clarity of thinking: suppose you have a hypothesis that products people click on are indicative of what they will buy, and you have some type of data; how do you validate that hypothesis? We want to see the person’s thought process and clarity of thinking (see Eldar’s earlier remarks regarding clarity of thinking)

We also want someone who will hit the ground running; MATLAB is great, but you need to be comfortable programming (C++, Python, Java, etc.); need to know the basic stats models and distributions. We are a startup and thus we expect people contribute to the team immediately.

We also think prior projects are good indicators; how did you do them, did you do anything special? Can you explain it?

We are looking for people who can communicate, and are business savvy and creative with data.
- Anu Tewary from Intuit

Moderator: What are some strategies for preparing for interviews for data science positions?

Max Song (Ayasdi): First you have to understand the business and the interviewer and what they are looking for. You have to understand the other side of the table (are they looking for a data researcher or a data engineer?).

If presented with a dataset, can you interact with data in a meaningful way? They might require you to solve the problems on site or you will be given a set amount of time and hand in the solutions later.

Editor’s note: you should really checkout Max’s book on interview questions. We offered 35 free copies to our attendees.

Max Song on strategies for preparing for interviews for data scientist positions.

Moderator: Please describe how programs like the Insight Data Science Fellows program might address some of the challenges in making the transition to a career in data science?

We value clarity of thinking and fundamental skills such as programming and statistics.
- Eldar Sadikov from Jetlore

Jake Klamka (Insight): In starting Insight, we asked ourselves, what do people value in data scientists? It turned out that employers want to hire someone who has actually done data science, not just read about it. So we created the six ­week program, and the bulk of which was doing data science and creating projects, and also meeting data scientists who come in from industry and talk about what they do and want on their teams.

We very much encourage both doing data science (side projects) and talking to other data scientists to learn from them. Just a 20 minutes meeting can do you much benefit.

Anu Tewary (Intuit): We love people from Insight because they demonstrate what they can do in six weeks. The graduates from Insights have great impacts on our business. We want to see: given a short period of time, can you take something and build something? Can you prioritize? It doesn’t have to be perfect, but you can make something and then iterate on it.

We like people who can whittle away and figure out what’s just noise and what needs to be focused on, what to work on and get out the door first. This is very different from academia.

You can get started by just working on open source projects. There are so many information out there you should pick up these tools and projects. There are no excuses not to.

Moderator: Do you have any advice for students who are particularly interested in applying data science for “social good” — e.g., non-profits, public policy, resource management and allocation, and education?

Max Song (Ayasdi): It is sad to see that the best minds of our generation spend so much time getting people clicking ads. In order to be effective in applying data science to nonprofits, take time to be humble and understand the nonprofit. For example: if you want to address education, there isn’t a lot of data, but traditional machine learning assumes an abundance of data; you need to understand that this is not always feasible with nonprofits.

Working with nonprofits is a good way to show you have real ­world relevance (not just specialized skills in which you’ve been trained); they have real and not very clean data, they maybe didn’t know what they wanted, but you were able to solve a problem for them.

Zhenghao Chen (Coursera): It’s easy to focus on what you can measure and it’s easy to lose sight of what you can’t easily measure. It’s hard to measure social good, and it takes more effort to keep track of, but it’s still important. As quantitative people, we have a predisposition to think that data is just number and matrices, but it’s much more than that.

Zhenghao chen on applying data science to the “social good”.

Moderator: Now we open questions from the audience. Attendee: where do you think data science and big data are heading to?

Eldar Sadikov (Jetlore): In my opinion, the next trend in big data is domain ­specific data companies (financial, healthcare, e-­commerce).

Jake Klamka (Insight): It used to be very internet-­focused; now it is spreading out, both in startups and big corporations. Memorial Sloan Kettering and corporations such as Bloomberg are starting to build data science teams.

Moderator: So more and more companies are recognizing that they need data science.

Max Song (Ayasdi): We all know that software is eating the world. But anywhere that software is eating the world, it is also pooping data. Everywhere software is eating the world, in 2 or 3 years there is a lot of data to be analyzed.

A lot of current algorithms were invented in the 1980s; we have to come up with new math to do new things with the data we’re collecting. At Ayasdi we have to invent new math to do what we are trying to do. That’s another direction the industry is headed.

Attendee: A lot of job descriptions seem to require a PhD. How can I overcome biases against applicants who don’t have graduate degrees?

Anuranjita Tewary (Intuit): Intuit has a rotational development program for undergraduates coming straight out of college. They rotate at different teams to learn how things work, get more context, then they start narrowing onto something specific. We do hire undergraduates, but it does take more training.

Eldar Sadikov (Jetlore): If someone has done a great side project or addressed a problem and applied something really cool they haven’t seen before, that makes the applicant stand out. It’s about what you’ve done, not just the fact that you’re an undergraduate.

Attendee: Recommendations for someone interested in building mathematical models? Do we need to know things like social behavior (not taught in math/physics)? Should we take a social class or other classes outside our field?

Anuranjita Tewary (Intuit): I highly recommend it, as it is a good way to learn about things that make a great impact.

Max Song (Ayasdi): We should do lots of math and physics (mental machinery), then do something completely enigmatic (acting, poetry). All of us are trying to model human behavior, so it’s important to understand what is signal vs noise. We need good priors, and we need to understand people. Yes, be divergent; you’re more interesting and more diverse and will bring more to your projects/companies.

Attendee: Do you recommend working at a big company or at a startup? If startup, where do you get big data? (big data comes from big companies)

Eldar Sadikov (Jetlore): when starting a startup, you always start from the problem. You can’t build a data startup thinking “i want to build a data startup”. Instead, you need to think of a domain and a problem to solve, then hopefully the solution is data­ intensive.

Moderator: thanks everyone. Now we will go to the next session — small group discussions.

Jake Klamka answering attendees’ questions.
Anu is awesome. She was busy, and we were lucky to have her.
Will Chen is a Data Scientists from Quora. He’s the co-author with Max on the Handbook of Data Science. Most of the Q&A stuff on data science in Quora (you perhaps read some of them) is from him. Smart guy.
Max is incredibly resourceful and intelligent. Here he’s helping!
Zhenghao helping an attendee.
Eldar not only served as one of our panelists, he also brought in his co-founder and directors to help in this event. StartX alumni are always awesome.
Yihua is a manager in data science at Facebook. He is one of our data science small discussion leaders.
Joy helped us putting together the field of studies graphics you saw. She also shared her personal journey in transition from biology to data science. Here she’s helping by answering questions.
Michael Ke Zhang (black T-shirt) is the organizer of the event and the founder of Wolfe Career. Justin White (green grid shirt) is the advisor for Michael. Here we have a small Caltech reunion.
Thank you again StartX for hosting the event.

Originally published at blog.wolfecareer.com. Check out our beta site at www.wolfecareer.com. Wolfe Career — discover the best career for you.

--

--