In my previous blog post (of which you can read here) I linked to a 2012 HBR article in which you can read that role of the Data Scientist is: ‘…: The Sexiest Job of the 21st Century’. Leaving the rhetoric of this statement out of it, if there is some truth in this title how does one go about becoming a Data Scientist?
Before we can answer this I’d like to emphatically state that there is no one route/pathway that works to become a good Data Scientist. I myself am still learning the vast area of Data science and will be for many years to come. I come from a more unusual background (Database design, administration and development) than many other ‘true’ Data Scientists. What I mean here is that everybody is different, with different skills that face the needs of the job market at the time and learning of data science at a different time. This means you may have good skills at a particularly sub optimal time. You can of course see the opposite of this too. I’ve met with many a student studying a combined Comp Sci & Statistics degree and have to say that this is a very powerful combination.
Anyway, everyone knows or has heard the phrase that a picture paints a thousand words, and I feel the below Venn diagram does the job of explaining the varying areas of a Data Scientist pretty well:
It’s not new but still summarises the key areas for a successful|useful|not winging it Data Scientist.
Specifically that you – ideally – need to be experienced in several areas to carry out solid data science work. One of my favourite sayings regarding the role of a Data Scientist is:
“Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.” Josh Wills – Twitter.
Why solid? Well, how do you deal with data drift or unforeseen changes in a Production environment with the model you’ve deployed? What about if the distribution of the data you’re dealing with changes over time in some way that the algorithm you’re applying is no longer relevant or accurate? (e.g. your data is no longer of the normal distribution) How do you even check this? How can you ensure that the model you’ve deployed is robustly tested and does not by some freak of nature or statistics have a 95% accuracy/f1 score by a very very lucky sample of data? I could go on. The point of the above diagram is that ideally as an expert Data Scientist you should have some solid software engineering skills, some solid math & statistical knowledge that is backed up by in depth substantive expertise in a given area of interest.
I make this point because throughout the years i’ve been learning data science i’ve found it is all too easy to use off the shelf packages from Python or R and essentially go nuts with machine learning models to get the result that looks good (see danger zone! above). I’ve seen it in my colleague’s work (also guilty of developing myself) and it’s all too easy to do.
Anyway, before I step further on to my mini Data Science soapbox and continue this rather lame diatribe how do you get started? Well, it’s simple really.
First off, start with asking yourself what do you enjoy doing with the use of data and the analysis of it? Because, Data Science takes a while to learn and you need to enjoy it! There is little point picking the area up without looking at what you would like to achieve out of it or you know what you enjoy in it already. It is however only natural for you to not even know what you want to start on and that’s absolutely ok!
But all the job descriptions say you need a Masters?
Before we look at the options for courses and resources you can read, study and try i’d also like to make one other point: It is not absolutely essential to do a Masters degree in Data Science. Many many (if not all) jobs on the market these days ask for a Masters or PhD in Data Science and that’s fair enough, however, if you are able to show through Github or Kaggle your learnings and achievements that can go some way. I’m not suggesting that this alone will replace 1-4 years of postgraduate studying but if someone presented me with a resume for a Data Science with the candidate placing well in Kaggle i’d definitely take a further look!
As of writing this post i’m 1/3rd of the way though my MSc in Data Science and whilst I thoroughly enjoy it, there are some strong reasons why some people may not take this option and I think it’s only fair to share them. Specifically:
- The cost – In Australia at the University of Sydney each subject is around $4k AUD and you need essentially 6 + capstone (2 units) over a 1 year full time or 2-4 part time
- The relative rigidity/irrelevance of teaching – This is a tough one to stay on top of for universities. Data Science is in a golden age at the moment and new deep learning/machine learning algorithms are being created or updated daily.
- The bureaucracy/poor organisation of University – Trust me this can be somewhat of an issue and a potential waster of your time.
My choice for taking the MSc was pretty clear. I currently lead up a team of engineers and analysts at PwC that focus on Data Analytics, however, I really wanted to know the low-level nuts and bolts of Data Science for when I lead teams of other Data Scientists in the future. I wouldn’t personally feel comfortable leading others without knowing something of the subject area myself. This won’t be the same for others but this was my decision.
This blog post is pretty lengthy now and part 2 will be coming soon, however, what options do you have immediately available to get you started with Data Science?
- Massively Open Online Courses (MOOCs) – MOOCs can get a bad rep from some hardcore Data Scientists (and sometimes with good reason, the content can vary), however, if you are brand new to Data Scientist they can be a good start. I’ve completed or been through many of the below ones:
- https://www.coursera.org/courses?languages=en&query=Data+Science – This is a 10 course specialisation rather than one course but is a very good start for a new Data Scientist and has some pretty good content in it.
- Kaggle – Kaggle.com not only hosts some pretty cool competitions with Data Science, that you can earn REAL MONEY IN (typically more advanced projects mind you) but has a host of data sets you can learn how to apply a variety of Data Science techniques to
- Siraj Raval – YouTube – Siraj Raval – Siraj Raval definitely has a ‘unique’ sense of humour (which I love!) and his videos are not only entertaining but interesting and loaded with ideas. Worth looking at.
- Mathematical Monk – YouTube – Mathematical Monk – We’ve been quite light on Math so far in this blog post, it’ll come more in Pt.2 however, this channel is great for looking more at the Math behind Data Science.
So, to summarise: No, you don’t need a Masters in Data Science to be a Data Scientist (although it will help) but do you need bags of enthusiasm, patience and a general love of data, math/stats and coding to keep you moving along the Data Science pathway 🙂
In Part.2 i’ll be picking up where we left off here with some AWESOME podcasts for those who are audio champions and like listening to data, as well as next steps for intermediate Data Scientists out there, how useful Meetups are for networking and what you need to start doing Data Science on your own machine!
Until next time, thanks for reading 🙂