I’ve been using Jupyter notebooks for a few years now and whilst I don’t consider myself to be a ‘god like’ user of them, I do feel pretty comfortable with the innards of them.
First, there was a why
Asking why something is worth using is more powerful than just blathering on about how you can use it and what it does first I find.
So, with that in mind why you should you look at using Jupyter notebooks? And what for?
- They are fantastic for exploratory and prototype development
- They are easy to share and collaborate on with others to show research, findings, analysis etc
- They can be a relatively common medium for data science work in teams that carry out data analysis and science
And as the reverse of this, what shouldn’t you really look to use them for?
- Developing production ready Python or R application systems – Bad idea!
- Front end applications or web facing applications
- As a replacement IDE to something like Visual Studio/PyCharm etc
Before Jupyter we have to deal with snakes..
So, before we start talking exclusively about what Juptyer notebooks are and how can you use them I want to introduce a common Data Science software bundle: Anaconda. Anaconda essentially simplifies the use of Python and R for Data Science with a bunch of common scientific packages you will most likely use whilst carrying out Data Science work.
Its available on Windows, Mac OS and Linux and I would strongly recommend you check it out rather than having to install a Python engine, R Engine, Jupyter server and DB connectors separately. In addition with the Anaconda distribution you get the Conda package & environment manager which makes the download of more esoteric packages a piece of cake!
That’s awesome, but I came here for Jupyter notebooks
Ok! Rather than paraphrase what Jupyter notebooks are, i’ve taken the liberty of using Jupyter.org’s description of them:
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
Essentially, think of a Jupyter notebook as an interactive Python, R or Java (Scala too I think) place to write code, instructions in markdown and display visualisations and results from Machine learning, exploratory data analysis as well what is also said above. If you’ve used interactive Python notebooks you’ll definitely see the similarities.
As an example they look something like this:
Jupyter notebooks work around a number of ‘cells’ that you can run individually, in tandem or as a selection. This is generally great but as you develop bigger notebooks you’ll need to also consider things like effective pipeline design and the right order of these cells. As is mentioned in: Reproducible Analysis Through Automated Jupyter Notebook Pipelines they also are not great for subsequent analysis runs. In other words you can’t easily automate the re-running of your analysis, again more of a problem when the notebook begins to bloat.
So, how do I go about setting up & using Jupyter?
To use Jupyter you’ve got a few of options:
- https://try.jupyter.org/ – A place to trial out Jupyter
- Download and install your own Jupyter notebook server/machine – A bit more involved but not that difficult and gives greater flexibility.
- https://mybinder.org/ – I saw Bindr for the first time the other day (shout out to data4democracy – Sydney and the guys!) and I have to say the concept is just awesome. The idea is that instead of running Jupyter locally on your own machine or using a remote notebook server you simply take advantage of Bindr to create a docker image of your Github repo and as an output it will create a Jupyter notebook! One point to note about this offering is that it’s still in Beta, however, if you’re not after a bullet proof, production ready way of hosting Jupyter notebooks I’d check this one out.
The trial and Bindr option need little extra explanation, however, I’ll now show you how easy it is to install and setup Jupyter on your own machine to carry out your own Data Analysis, cleaning, visualisation, whatever you want Data Science-wise.
Also from this point onwards I’ll be assuming you’ve using Python 2 or 3, are on Windows (only because it’s what I am using right now, I actually prefer Mac OS) and have it already installed via Anaconda/some other means. Ok, first:
python -m pip install --upgrade pip python -m pip install jupyter
python3 -m pip install --upgrade pip python3 -m pip install jupyter
Assuming no error messages you can now start the Jupyter notebook with:
Easy, huh? As an addition I’d recommend you create some form of batch/executable you can easily run/schedule so that it starts the notebook server without you having to run the above program manually each time. Here is an example:
cd "C:\Locationtostorenotebooks" jupyter notebook start chrome "http:\\localhost:8888"
Now, Jupyter is pretty great, however, like any product it does have its gotchas. One is that sometimes you’ll find that the server won’t let you ‘in’ without supplying a token for the server. To get around this you can do the following:
$ jupyter notebook list Currently running servers: http://localhost:8888/?token=abc... :: /home/you/notebooks https://0.0.0.0:9999/?token=123... :: /tmp/public http://localhost:8889/ :: /tmp/has-password
You can then copy the token into the box that it prompts you with.
Another more trivial issue that I’ve heard (but not yet seen) is that if you check Jupyter notebooks into Source Code repos like Git that it can screw with the formatting of the notebook. This is something I think that is easily avoided by using something like Bindr above.
Other funky Jupyter stuff you can do
So, you’ve got your Jupyter on and you’ve starting writing your own notebooks to harness the power of Data Science and all the cool stuff. What’s next? Well, the use of cell magic and other tips as shown in: 28 Jupyter Notebook tips, tricks, and shortcuts is worth checking out. A couple of examples I use commonly are:
%matplotlib inline -- This basically activates matplotlib plotting inline with your notebooks
%%time -- Useful for finding out how long a cell block takes to run
%%writefile -- Export the contents of the cell to an external file
I don’t use this one currently but I definitely will be in future academic projects:
When you write LaTeX in a Markdown cell, it will be rendered as a formula using MathJax.
This should hopefully give you enough to get started!
Good luck and thanks for reading 🙂