Blog

Visualising and exploiting Network Structures with Python and Networkx

How would you visualise a complex social system or the evolution of the influenza virus? Possibly a histogram or a very large number of pie charts? Unlikely, with any great elegance or efficiency…

Network theory and the analysis of network data structures allow us to better understand measures of influence, popularity and centrality in a network powered by mature statistical approaches.

Sharing for the win

This morning I was fortunate enough to be invited to talk at Data Science Breakfast Sydney Meetup: Dan Bridgman on Visualising and Exploiting Network Structures with Python Geoff Pidcock

Always great to network and talk with others and I’ve attached my slide deck from this talk to this blog post. I think the session was also recorded so as soon that is available i’ll try and upload that here too!

Have a read and comment on what you think, always keen to hear what others think!

Visualising and Exploiting Network Structures with Python and Networkx

Applying Data Science to Social Good causes

Before I begin, I need to liberally link the two social good groups that are closest to my heart. I can only comment on the Sydney community of meetups, however, the two that stand out for me are:

  • The Minerva Collective – The Minerva Collective is a fantastic bunch of people from a variety of professions in Sydney that meet to discuss, brainstorm and tackle social problems at large by applying Data Science in Australia. Typically The Minerva Collective partners with organisations who will share data on the Minerva Collective – Data Republic platform to achieve a number of objectives. Some great work happening here.
  • Data for Democracy – Sydney – An amazing, talented and open minded group of data scientists, analysts, engineers & data nerds that all relate to: D4D – Origin Story. If you have an evening spare you should certainly go along!
  • Both Minerva & D4D – Sydney work because they have people like you willing to spend time and energy on the problems the community face. Give them a try!

Gratuitous plug over

For more years than I can count I’ve always enjoyed the satisfaction of helping others, solving problems and coming up with new ideas to try improve either a technical process or rethink how business processes work.

It’s only in the recent years that I’ve actually found like-minded individuals and groups in the Sydney scene for applying Data Science/Analytics to social good causes.

It’s not all totally selfless altruism and that’s OK

Now i’d be lying if I said I carried the work out for not for profits/NGOs or charities is purely through in blood altruism:

For me, I see it as a combination of altruism and a inherent feel good factor to try and help others that are in need.

And that’s OK as far as I see it. I’m still relatively new to applying my time and passion to solving these kind of problems, and what I now recognise is that you have to find a balance that keeps you happy.

As far as I see it, if you want to give back to society with your skills (whether in Data Science, Accounting or anything else) it takes time. But, unequivocally the time is absolutely worth it, regardless of the effort.

Success is all about persistence and doing the right thing for the long term

So, yep it’s a rather cliched title and I use it because working in your own time to apply your skills for the good of others is TOUGH and unrelenting, and it’s not generally due to technical difficulties. In my experience it has been due to a lack of maturity in the use of data/technology, lack of time/resource or substantive expertise in the area you’re focusing in. When you actually focus on the classifier, visualisation or clustering algorithm you’ve hopefully got a lot of the hard work done for you. Hopefully.

So, if its hard and you’re not doing it purely for altruistic reasons why should you devote your skills and time to help others?

Penny drop

pennydrop.JPG

As mentioned earlier i’m relatively new in the journey of attempting to help others by applying my skills in Data Science & Analytics, however, I can categorically state the feeling I first received from my first meetup with The Minerva Collective.

I left the meetup buzzing with excitement and passion for what was possible by meeting like-minded, talented and curious people together in a group setting to discuss approaches to a business problem or problems that a particular NFP/NGO or organisation may face. It was like the penny dropped in front of my eyes, and to me even better than this is seeing the penny drop in other’s eyes…here we were talking about problems like childhood obesity, mental health or domestic violence in a setting that simply set the neurons firing and got prototypes or hypotheses going.

Needless to say, this doesn’t even include the combination of new contacts and knowledge on Data Science you’ll certainly pick up by attending these events. Whether you’re mentoring someone or listening to a guru, you’ll learn and by the process of osmosis will improve your skills.

As always: 01110100 01101000 01100001 01101110 01101011 01110011 00100000 01100110 01101111 01110010 00100000 01110010 01100101 01100001 01100100 01101001 01101110 01100111

 

Markov chain models are so { random | hilarious | odd }

There’s something about Markov models to me that is cool yet very weird. They are one of the first types of examples of probability theory I stumbled on in around 2013/2014 and this was before I even knew what probability theory or a stochastic model meant or actually did. It’s only when I dived a bit further into the possible applications of them that I frankly found them so much fun.

Yes, that’s right Markov Chain models are a guilty confession of mine and as shown later in this post I sometimes tinker with them to create Frankenstein-esque applications in Python. Maybe its their – albeit very limited – capability to generate text and predict the future that keeps me entertained, who knows.

This blog post covers the basics of the markov chain process. I’ll readily state upfront that I certainly will not be able to cover over one hundred years of history/application of it in this one post. Instead my intention is to attempt to teach what they are, how they can be used and also how you might be able to have some fun with them 🙂

So, what the Markov?

Markov chain is “a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.”[1]

Markov Chains, Wikipedia

The markov process itself was named after the Russian mathematician Andrey Markov in the early 20th century when he published his first paper on the topic in 1906.

What I find particularly interesting about markov chains is that they essentially do not hold any kind of ‘attention’ or memory. The next step is only based on what state was attained previously.

As an example, consider the below table:

Markov Chains Explained, Tech Effigy Tutorials.

If it is cloudy, there is a 10% probability of the next state being cloudy, 50% of the next state being rain and 40% of the next state being sunny. Therefore it is most likely that the next day will be rainy and we move to the rainy state. The next day following is also likely to be rainy, based on a 60% probability.

Because this is a stochastic model and because of simply how probability theory actually works there is no way to say that this is what will happen. It’s just highly likely that the predictions from this table of data are that rainy days will follow cloudy days and that in turn rainy days are more likely to follow rainy days. One of the key things to remember here is that the prediction of the day’s state only has a dependency on the previous day, there are no other dependencies or form of memory/attention.

A good way to represent this table of data is also in a state diagram or a transition matrix that Joseph Armstrong has explained so well on the Tech Effigy pages:

Markov Chains Explained, Tech Effigy Tutorials.

Applications of a Markov Chain Model

The applications of a markov chain model are varied and there is also a number of derivatives of them. For example did you know that some of Google’s page rank algorithm uses markov chains in its prediction of what a random surfer of the search engine will do? Other examples of applications include:

  • Chemical reactions and engineering – Out of scope of this post but you only have to Google this to see a great deal of papers on the subject
  • Text prediction – Yes, your iPhone or Android (other phones are available) most likely uses markov chains in its text prediction based on what you have typed before.
  • Financial applications – Used for credit risk and predicting market trends
  • Text generation – Lots of applications here, however, some include:
    • Subreddit simulation – An automated process that uses markov chains to generate random submissions, titles and comments using markov chains.
    • Quote generation – You’ll see how this can be possibly achieved below
  • Google page rank

Admittedly, some of the above examples have limited scope (i..e. text generation) but some are also quite powerful and well used.

In a later post I’ll possibly also look to cover derivatives such as monte carlo markov chains (MCMC) & Hidden Markov Models (HMM).

MCMCs are widely used to calculate numerical approximations of multi-dimensional integrals and come from a large class of algorithms that focus on sampling of a probability distribution. Where as HMMs are a derivative of the standard markov chain that have some states that are hidden or unobserved.

Having fun with Markov chains

So, the fun bit. At least in my eyes!

It’s relatively easy (< 50-60 lines of code) to generate sentences using markov chains in Python from a given corpus of text. Shout out to Digitators.com who’s code I have used and modified for my purpose.

In the example I show below I’m looking to generate some new inspirational quotes based on around 350 existing classic quotes. Admittedly some of the quotes that I generated are a bit of a train crash, but then many others are hilarious and some of them are even quite profound! I have selected and included  a prize few quotes below the code, I DID NOT WRITE THESE MYSELF!

The code to generate new quotes

The overall process to create quotes revolves around:

  1. Getting data (in this case existing quotes) from a source file
  2. Ensuring data is in a correct format to create the pseudo-markov model, in this case a Python dictionary of words with associated values for each word.
  3. Generate a list of words of a maximum size that starts with a random START word, appends it to the list and then searches the dictionary for the next potential words from the START word.
  4. The loop finishes when either the max length is reached or an END word is selected.
 
# The only Python library we need for this work!
import random

Now to create the dictionary of START, END and associated words

 
# Create a dict to store our key words and associated values
model = {}
    for line in open('markovdata.txt', 'r', encoding="utf8") :
        line = line.lower().split()
        for i, word in enumerate(line):
            if i == len(line)-1: 
                model['END'] = model.get('END', []) + [word]
            else: 
                if i == 0:
                    model['START'] = model.get('START', []) + [word]
                    model[word] = model.get(word, []) + [line[i+1]];

And finally to create a function that will create a quote of max length or finishing with an END word.

def quotegen():
    generated = []

    while len(generated)< 30 :
        if not generated:
            words = model['START']
        elif generated[-1] in model['END']:
            break
        else:
            words = model[generated[-1]]
       generated.append(random.choice(words)) 

     print( ' '.join(generated))
 

That’s it! I’m sure the Python could be made more Pythonic but it allows you to see how you can easily use ANY text data to generate new sentences of data.

Example quotes:

“hope is not something ready made. it cannot be silenced.”

“fall seven times and that you go of my success is showing up.”

“challenges are empty.”

“each monday is during our possibilities become limitless.”

“to love what you feel inferior without your heart.”

“dream big and methods. third, adjust all your children, but people give up creativity. the dark, the light.”

“dream big and stand up someone else will forget what we must do.”

“change your whole life”

“a voice within yourself–the invisible battles inside all success.”

“the mind is afraid of success should be happy monday!”

I could go on but hopefully you get an impression of some of the zany and sometimes profound quotes that can be generated with less than 60 lines of code and approximately 350 quotes!

Until next time, thanks for reading 🙂

 

 

AI will take all of our jobs and that’s OK. Or is it?

Artificial Intelligence (AI) will take all of our jobs, and that’s ok

This was the initially damning sentence and also title of a Data Science breakfast meetup I attended at General Assembly this morning in Sydney. Organised by Geoff Pidcock and Anthony Tockar the roughly 50-100 people present were given a fantastic talk from Tomer Garzberg about a range of topics centered on AI and the current use of AI to replace jobs in Chinese factories simply by showing black screens of nothingness, representing that AI doesn’t and won’t need the same amenities such as running water, light and heat as us humans. That they can run in darkness, 24*7 for a period of time with little to no supervision. This is already happening!

So, unless you’ve been living under a rock and continuing your machine to destroy textile machines you’ll be aware that every day the uses of AI in employment are becoming more and more present and ubiquitous in our everyday lives and workplaces. Examples include the chinese manufactories mentioned above, AI lawyers, driverless cars as well as scores of other blue-collar work that follows a structured approach that is more easily automated.

The question you are probably asking yourself is: Is my job safe? What jobs are likely to be automated first? Well, really as Tomer and the panel this morning pointed out: No job is ultimately safe. This doesn’t mean that this kind of change will happen overnight, instead we can expect it to take generations, however, it is going to happen. 

Tomer’s also spoke to a number of variables that proposed the percentage a job area would be to automated in % terms when you looked at the variance, predictability and structure of the job in question. A good example to look at for comparison is a stock taker vs animal specialist. One has a very low variance, highly predictable and structured job, the other doesn’t and the complexity of applying AI & machine learning to one is much more difficult than the other. This again doesn’t mean it can’t or won’t happen, but that it is less likely.

Related but not the same is a recent development of a data scientist that used a neural network (later posts to come to explain this) to code a basic HTML and CSS website based on a picture of a design mockup. This shows that even web development can’t escape automation 😉 Turning Design Mockups Into Code With Deep Learning

Is anybody actually ready for this change?

After the enlightening talk from Tomer this morning a panel of 4 experts in the field of Data Science & AI answered questions regarding:

  • The deployment of AI in a business setting – I.e. a Junior lawyer being developed for a legal firm!
  • Political & economic issues – What will the governments do when people’s jobs changes – where will they go? What is the government doing? With the current capitalist market, what regulation is required to protect those jobs that are lost and can’t be re-skilled?
  • Psychological & Philosophical issues – Should we really be targeting higher areas of Maslow’s Hierarchy of Needs
  • Educational & governmental changes – Similar to above but more impactful, how will Educational systems need to change to prepare tomorrow’s students for a world changed by AI?

Whilst I still have very cautious optimism of the use of AI in the world and workplace I feel that no one is yet ready to fully embrace the change it will bring. More importantly I do not think the Australian government or educational system is ready for the economic changes that the rise of automation will bring and is already starting to bring. I say this not for impact or sensationalism but because it’s only a matter of time until the workplaces see the effects!

Further links for review about this topic:

A slightly different blog post today, I hope you’ve enjoyed reading it as much as I have in attending and learning more about the advance of AI in the workplace and the world. If you’re interested in the Data Science Breakfast meetup Sydney, it can be found here: Data Science Breakfast Meetup

 

 

 

Jupyter Notebooks – What? How? Why?

I’ve been using Jupyter notebooks for a few years now and whilst I don’t consider myself to be a ‘god like’ user of them, I do feel pretty comfortable with the innards of them.

First, there was a why

Asking why something is worth using is more powerful than just blathering on about how you can use it and what it does first I find.

So, with that in mind why you should you look at using Jupyter notebooks? And what for?

  1. They are fantastic for exploratory and prototype development
  2. They are easy to share and collaborate on with others to show research, findings, analysis etc
  3. They can be a relatively common medium for data science work in teams that carry out data analysis and science

And as the reverse of this, what shouldn’t you really look to use them for?

  1. Developing production ready Python or R application systems – Bad idea!
  2. Front end applications or web facing applications
  3. As a replacement IDE to something like Visual Studio/PyCharm etc

Before Jupyter we have to deal with snakes..

So, before we start talking exclusively about what Juptyer notebooks are and how can you use them I want to introduce a common Data Science software bundle: Anaconda. Anaconda essentially simplifies the use of Python and R for Data Science with a bunch of common scientific packages you will most likely use whilst carrying out Data Science work.

Its available on Windows, Mac OS and Linux and I would strongly recommend you check it out rather than having to install a Python engine, R Engine, Jupyter server and DB connectors separately. In addition with the Anaconda distribution you get the Conda package & environment manager which makes the download of more esoteric packages a piece of cake!

That’s awesome, but I came here for Jupyter notebooks

Ok! Rather than paraphrase what Jupyter notebooks are, i’ve taken the liberty of using Jupyter.org’s description of them:

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

http://jupyter.org/

Essentially, think of a Jupyter notebook as an interactive Python, R or Java (Scala too I think) place to write code, instructions in markdown and display visualisations and results from Machine learning, exploratory data analysis as well what is also said above. If you’ve used interactive Python notebooks you’ll definitely see the similarities.

As an example they look something like this:

JupyterNotebnookCapture

Jupyter notebooks work around a number of ‘cells’ that you can run individually, in tandem or as a selection. This is generally great but as you develop bigger notebooks you’ll need to also consider things like effective pipeline design and the right order of these cells. As is mentioned in: Reproducible Analysis Through Automated Jupyter Notebook Pipelines they also are not great for subsequent analysis runs. In other words you can’t easily automate the re-running of your analysis, again more of a problem when the notebook begins to bloat.

So, how do I go about setting up & using Jupyter?

To use Jupyter you’ve got a few of options:

  1. https://try.jupyter.org/ – A place to trial out Jupyter
  2. Download and install your own Jupyter notebook server/machine – A bit more involved but not that difficult and gives greater flexibility.
  3. https://mybinder.org/ – I saw Bindr for the first time the other day (shout out to data4democracy – Sydney and the guys!) and I have to say the concept is just awesome. The idea is that instead of running Jupyter locally on your own machine or using a remote notebook server you simply take advantage of Bindr to create a docker image of your Github repo and as an output it will create a Jupyter notebook! One point to note about this offering is that it’s still in Beta, however, if you’re not after a bullet proof, production ready way of hosting Jupyter notebooks I’d check this one out.

The trial and Bindr option need little extra explanation, however, I’ll now show you how easy it is to install and setup Jupyter on your own machine to carry out your own Data Analysis, cleaning, visualisation, whatever you want Data Science-wise.

Also from this point onwards I’ll be assuming you’ve using Python 2 or 3, are on Windows (only because it’s what I am using right now, I actually prefer Mac OS) and have it already installed via Anaconda/some other means. Ok, first:

Python 2:

python -m pip install --upgrade pip
python -m pip install jupyter

Python 3:

python3 -m pip install --upgrade pip
python3 -m pip install jupyter

Assuming no error messages you can now start the Jupyter notebook with:

jupyter notebook

Easy, huh? As an addition I’d recommend you create some form of batch/executable you can easily run/schedule so that it starts the notebook server without you having to run the above program manually each time. Here is an example:

cd "C:\Locationtostorenotebooks"
jupyter notebook
start chrome "http:\\localhost:8888"

Gotchas

Now, Jupyter is pretty great, however, like any product it does have its gotchas. One is that sometimes you’ll find that the server won’t let you ‘in’ without supplying a token for the server. To get around this you can do the following:

$ jupyter notebook list
Currently running servers:
http://localhost:8888/?token=abc... :: /home/you/notebooks
https://0.0.0.0:9999/?token=123... :: /tmp/public
http://localhost:8889/ :: /tmp/has-password

Security in the Jupyter notebook server

You can then copy the token into the box that it prompts you with.

Another more trivial issue that I’ve heard (but not yet seen) is that if you check Jupyter notebooks into Source Code repos like Git that it can screw with the formatting of the notebook. This is something I think that is easily avoided by using something like Bindr above.

Other funky Jupyter stuff you can do

So, you’ve got your Jupyter on and you’ve starting writing your own notebooks to harness the power of Data Science and all the cool stuff. What’s next? Well, the use of cell magic and other tips as shown in: 28 Jupyter Notebook tips, tricks, and shortcuts is worth checking out. A couple of examples I use commonly are:

%matplotlib inline -- This basically activates matplotlib plotting inline
with your notebooks
%%time -- Useful for finding out how long a cell block takes to run
%%writefile -- Export the contents of the cell to an external file

I don’t use this one currently but I definitely will be in future academic projects:

When you write LaTeX in a Markdown cell, it will be rendered as a formula using MathJax.

This should hopefully give you enough to get started!

Good luck and thanks for reading 🙂

Pt.2 of So, you want to be a Data Scientist?!

Hello!

In the part one of this blog post I covered some starting areas, considerations and general resources (MOOCs, Youtube channels) for people generally interested in Data Science. There are heaps of Data Science resources and blogs to learn from out there, and I’d encourage you to read as many as you can by looking at sites such as: Top 50 Data Science Blogs (which you’ll also find some of the top sites are blogs I follow too 🙂 )

That aside lets look at podcasts, Meetups, further steps for the people who want to go deep with Data Science and then look at how you can start tackling Data Science problems on your own machine today!

Podcasts

Podcasts should hopefully need no introduction, however, for those of you who are less audio attuned they are an episodic series of audio or video files about pretty much anything in which you can choose to download and or stream – Podcast, Wikipedia.

Personally, I feel they are a brilliant way to listen to new ideas, approaches or learn something new whilst you do something else. For me, i’ll usually listen to Data Science/Freakonomics podcasts on the way to work with a coffee in hand :coffee: 😀 as they the bandwidth of listening and walking go together so seamlessly.

Here are some of my favourite and recommended Data Science podcasts:

  1. Partially Derivative – Possibly one of my favourites series of all time, its unfortunately stopped, however, there are close to 3 years of podcasts. Often funny, often sweary, always informative with a healthy dose of alcohol during epsiodes.
  2. Linear Digressions – Running for around 2 years is Linear Digressions, a slightly more sober version of PD above that has buckets of decent content on varying DS topics. Katie & Ben are a great duo and come from some interesting backgrounds (Katie is a Physicist and has worked on the Udacity Machine learning MOOC)
  3. Talking Machines – I’ve only listened to a few of the episodes from this one but they are pretty content packed and can go quite deep, worth a listen!
  4. The Data Skeptic – One of the first podcasts I listened to regarding Data Science, a good start for beginners although the style of the podcast was not to my taste back in 2014/5

The above should give you a good breadth of listening material for some time, give it a shot when you’re next walking/commuting to work, hopefully you’ll enjoy them as much as I did 🙂

Meetups

If I was to give younger Dan some tips on how to develop a strong network of contacts to aid my Data Science or other tech experience I wouldn’t think twice to recommend Meetup.com! In the past five years in Sydney, Australia i’ve been to a variety of tech & DS meetups and it’s an absolutely fantastic way to not only build your networks out but also meet with other like-minded individuals who share the passion you’re there for too. Some good Data Science meetups in the Sydney area are for example:

  1. Data Science Sydney
  2. Data Science Breakfast Meetup
  3. Minerva Collective – I’m biased but a great bunch of brainy people doing good with Data
  4. Data for Democracy – Sydney
  5. Sydney Users of R (SURF)

I’m sure there are others but these are all ones I’ve had the pleasure of attending.

‘Actual’ Data Science Areas

So, we’ve been through a myriad of podcasts, meetups, resources etc but if you had no time deadlines and simply wanted go deep with Data Science/Machine learning et al what would you need to know to effectively practice it? I’m still learning this (and probably will be for a few decades to come!) but I would strongly recommend researching into (and in no order):

Related image

Linear Algebra, Khan Academy

  • Linear Algebra – Read this blog by Jason Brownlee to understand the relative importance of Linear Algebra in Machine learning/Data Science. It’s not crucial to know linear algebra for the running of a ML model, but it will certainly help for choosing the right algorithm for your problem and/or tuning your model when things can go south. Things like Singular Value Decomposition and Principal Component Analysis are heavily rooted in Linear Algebra, as is the understanding of higher dimensions and how to perform operations on them. A book worth checking out for this is by the legendary Gilbert Strang
  • Probability Theory – Also integral to portions of Data Science & Machine Learning is Probability Theory. Towards Data Science – Probability Theory summarises this meaty topic pretty well. I won’t spend any more time badly explaining it!

A conditional probability related to weather – In this example: Probability of rain occurring, given a sunny day.

  • Software Engineering – At some point as a Data Scientist you’ll need to write code. Whether this is Python, R or something else its key to have a good grasps of Software Engineering areas such as:
    • Version control
    • Unit testing
    • Modular programming
    • API design/creation
    • Taking an idea from Development to Production (this is something a decent Data Science team would have Data Engineers assist with but not everyone is this lucky)
  • Statistics & the Scientific Method – Ok, saying Statistics is a bit of a catchall I’ll admit but knowing when to perform a Student’s t-test over an Analysis of Variance (ANOVA) for samples of data is just one example of when knowledge of Statistics is useful. What about if we were looking to determine the value of a parameter of a population? Would you know that bootstraping/resampling is one way to achieve this? What about how cross-validation works? I could go on..

Wikipedia, Scientific method

  • The scientific method is absolutely key to performing – in my opinion – robust Data Science. I’ve purposely put this in caps because having a business/research problem and formulating a testable hypothesis is something that is generally missed when you learn Data Science on the MOOCs or Stack Overflow. By doing this you’re actually doing the ‘Science’ in Data Science!

This blog post has kind of gone for a bit a longer than I originally intended, so next time we’ll cover how to get Jupyter & Python running on your machine, possibly even using Bindr with Git! I’m also keen to get your thoughts on the length of these posts so please drop a comment if you like!

Until next time, thanks for reading 🙂

So, you want to be a Data Scientist?! – Pt.1

In my previous blog post (of which you can read here) I linked to a 2012 HBR article in which you can read that role of the Data Scientist is: ‘…: The Sexiest Job of the 21st Century’. Leaving the rhetoric of this statement out of it, if there is some truth in this title how does one go about becoming a Data Scientist?

Before we can answer this I’d like to emphatically state that there is no one route/pathway that works to become a good Data Scientist. I myself am still learning the vast area of Data science and will be for many years to come. I come from a more unusual background (Database design, administration and development) than many other ‘true’ Data Scientists. What I mean here is that everybody is different, with different skills that face the needs of the job market at the time and learning of data science at a different time. This means you may have good skills at a particularly sub optimal time. You can of course see the opposite of this too.  I’ve met with many a student studying a combined Comp Sci & Statistics degree and have to say that this is a very powerful combination.

Anyway, everyone knows or has heard the phrase that a picture paints a thousand words, and I feel the below Venn diagram does the job of explaining the varying areas of a Data Scientist pretty well:

 

Image result for data science venn diagram 2.0

[https://www.kdnuggets.com/2016/10/battle-data-science-venn-diagrams.html]

It’s not new but still summarises the key areas for a successful|useful|not winging it Data Scientist.

Specifically that you  – ideally – need to be experienced in several areas to carry out solid data science work. One of my favourite sayings regarding the role of a Data Scientist is:

“Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.” Josh Wills – Twitter.

Why solid? Well, how do you deal with data drift or unforeseen changes in a Production environment with the model you’ve deployed? What about if the distribution of the data you’re dealing with changes over time in some way that the algorithm you’re applying is no longer relevant or accurate? (e.g. your data is no longer of the normal distribution) How do you even check this? How can you ensure that the model you’ve deployed is robustly tested and does not by some freak of nature or statistics have a 95% accuracy/f1 score by a very very lucky sample of data? I could go on. The point of the above diagram is that ideally as an expert Data Scientist you should have some solid software engineering skills, some solid math & statistical knowledge that is backed up by in depth substantive expertise in a given area of interest.

I make this point because throughout the years i’ve been learning data science i’ve found it is all too easy to use off the shelf packages from Python or R and essentially go nuts with machine learning models to get the result that looks good (see danger zone! above). I’ve seen it in my colleague’s work (also guilty of developing myself) and it’s all too easy to do.

Anyway, before I step further on to my mini Data Science soapbox and continue this rather lame diatribe how do you get started? Well, it’s simple really.

First off, start with asking yourself what do you enjoy doing with the use of data and the analysis of it? Because, Data Science takes a while to learn and you need to enjoy it! There is little point picking the area up without looking at what you would like to achieve out of it or you know what you enjoy in it already. It is however only natural for you to not even know what you want to start on and that’s absolutely ok!

But all the job descriptions say you need a Masters?

Before we look at the options for courses and resources you can read, study and try i’d also like to make one other point: It is not absolutely essential to do a Masters degree in Data Science. Many many (if not all) jobs on the market these days ask for a Masters or PhD in Data Science and that’s fair enough, however, if you are able to show through Github or Kaggle your learnings and achievements that can go some way. I’m not suggesting that this alone will replace 1-4 years of postgraduate studying but if someone presented me with a resume for a Data Science with the candidate placing well in Kaggle i’d definitely take a further look!

As of writing this post i’m 1/3rd of the way though my MSc in Data Science and whilst I thoroughly enjoy it, there are some strong reasons why some people may not take this option and I think it’s only fair to share them. Specifically:

  1. The cost –  In Australia at the University of Sydney each subject is around $4k AUD and you need essentially 6 + capstone (2 units) over a 1 year full time or 2-4 part time
  2. The relative rigidity/irrelevance of teaching – This is a tough one to stay on top of for universities. Data Science is in a golden age at the moment and new deep learning/machine learning algorithms are being created or updated daily.
  3. The bureaucracy/poor organisation of  University – Trust me this can be somewhat of an issue and a potential waster of your time.

My choice for taking the MSc was pretty clear. I currently lead up a team of engineers and analysts at PwC that focus on Data Analytics, however, I really wanted to know the low-level nuts and bolts of Data Science for when I lead teams of other Data Scientists in the future. I wouldn’t personally feel comfortable leading others without knowing something of the subject area myself. This won’t be the same for others but this was my decision.

Starting options

This blog post is pretty lengthy now and part 2 will be coming soon, however, what options do you have immediately available to get you started with Data Science?

  1. Massively Open Online Courses (MOOCs) – MOOCs can get a bad rep from some hardcore Data Scientists (and sometimes with good reason, the content can vary), however, if you are brand new to Data Scientist they can be a good start. I’ve completed or been through many of the below ones:
  2. KaggleKaggle.com not only hosts some pretty cool competitions with Data Science, that you can earn REAL MONEY IN (typically more advanced projects mind you) but has a host of data sets you can learn how to apply a variety of Data Science techniques to
  3. Siraj Raval – YouTube – Siraj Raval – Siraj Raval definitely has a ‘unique’ sense of humour (which I love!) and his videos are not only entertaining but interesting and loaded with ideas. Worth looking at.
  4. Mathematical Monk – YouTube – Mathematical Monk – We’ve been quite light on Math so far in this blog post, it’ll come more in Pt.2 however, this channel is great for looking more at the Math behind Data Science.

So, to summarise: No, you don’t need a Masters in Data Science to be a Data Scientist (although it will help) but do you need bags of enthusiasm, patience and a general love of data, math/stats and coding to keep you moving along the Data Science pathway 🙂

In Part.2 i’ll be picking up where we left off here with some AWESOME podcasts for those who are audio champions and like listening to data, as well as next steps for intermediate Data Scientists out there, how useful Meetups are for networking and what you need to start doing Data Science on your own machine!

Until next time, thanks for reading 🙂