Sunday, March 5, 2017

This blog will not be updated anymore

From now I will be posting about data science, python, R, statistics etc on a new blog: Astronomical Data Science. This blog will not be updated anymore. 

Monday, February 27, 2017

Calling R from inside a Jupyter notebook

I use mostly python + jupyter in my research--in this case astronomical data science--workflow. One challenge I have been facing recently is dealing with time series. Some of the routines I need to use were written in R. There are versions of them for python but they seem to be inferior. The question then is: is it possible to easily interface with R functions from inside a jupyter python notebook? The answer is--amazingly--yes!

To illustrate how easy this is, I create a jupyter notebook available on Gist. This notebook demonstrates how to:

  1. generate some simple mock data with python/numpy
  2. import that data on R
  3. perform a linear fit using R's methods and load the results back to python
  4. plot the R fit with python
This may sound complicated but it really isn't.

Friday, February 19, 2016

How to setup an IPython parallel cluster in your LAN via SSH

It has been a long hiatus since I last posted anything here so it is time to get back.

Today I will describe the following scenario: you have two or more machines (linux boxes or OS X) available in your LAN, and you would like to harness the power of their CPU to perform parallel computing with python. This is possible with IPython parallel and there are several ways to get it accomplished.

I will describe the steps required to configure a private IPython parallel cluster in your LAN using SSH. If everything works well, this should require about 30 min to 1 hour to be completed depending on your level of experience.

1. Command to create an IPython parallel profile:

ipython profile create --parallel --profile=ssh

2. Edit config file .ipython/profile_ssh/ in your home.

Specify the number of hosts and cores to be used:

c.SSHEngineSetLauncher.engines = {
 'macnemmen' : 4,
 'pcnemmen' : 8,
 'pcraniere' : 4,
where you specify the appropriate names of the machines in your LAN.

Specify the IP of controller (the main machine you use to launch ipcluster):

c.LocalControllerLauncher.controller_args = [""]
where you make sure the IP is correct.

3. Make sure python, jupyter, ipython and ipyparallel are installed in each computer.

In my case, I use the Anaconda distribution in all machines.

4. Setup SSH

Create a .ssh/config file such that all hosts and corresponding usernames that will run the servers are conveniently referenced.

Setup passwordless SSH login in each client machine from the controller machine.

5. Create a common alias in each client host pointing to the engine launcher binary: 

This is in order to avoid the clients not finding the binary if it is in a nonstandard location.
e.g. create aliases in /opt/ipengine

6. Edit config file .ipython/ pointing to the launcher alias

c.SSHEngineSetLauncher.engine_cmd = ['/opt/ipengine']

7. Launch the engines in all machines in your "cluster" 

ipcluster start --profile='ssh' --debug

Testing the engines

Now it is time to test if the engines were launched successfully.

Test that they are active in IPython:

import ipyparallel

The output of the last command should be a list with the number of elements matching the number of engines you launched. Otherwise, something went wrong.

Don't forget that the configuration files are located in .ipython/profile_ssh.

To learn how to use in practice such cluster to do CPU intensive tasks, you can read this tutorial.


Thursday, October 16, 2014

Parallel computing with IPython and Python

I uploaded to github a quick tutorial on how to parallelize easy computing tasks. I have chosen embarrassingly parallel examples which illustrate some of the powerful features of IPython.parallel and the multiprocessing module.

Examples included:

  1. Parallel function mapping to a list of arguments (multiprocessing module)
  2. Parallel execution of array function (scatter/gather) + parallel execution of scripts
  3. Easy parallel Monte Carlo (parallel magics)

Parallel computing with Python. 

Please stop using colormaps which don't translate well to grayscale

I recently printed a paper with very nice results in B&W but the color images simply did not make sense when printed in grayscale (here is the paper if you are curious). Why? Not the best choice of colormap. Jake Vanderplas reminded me of this issue with his very nice blog post.

Please don't choose colormaps for the images in your paper which do not translate well when printed in grayscale.

Check out Nature's advice on color coding. Also check out the advice here.

Feel free to suggest other useful references about this issue in the comments.

Wednesday, October 8, 2014

Python Installation instructions (including IPython / IPython Notebook)

This page describes how to install Python and the other packages (Numpy, Scipy, IPython, IPython Notebook, Matplotlib) required for the course for Mac OS X, Linux and Windows.


In Linux, the installation instructions are pretty straightforward. Assuming that you are running Debian or Ubuntu, you just need to execute the following command in the terminal:

sudo apt-get install python-numpy python-scipy python-matplotlib ipython-notebook

For Fedora users, you can use the yum tool.

Mac OS X, Linux, Windows

We recommend downloading and installing the Anaconda Python distribution. The installations instructions are available here

Just download the installer and execute it with bash.

Anaconda includes most of the packages we will use and it is pretty easy to install additional packages if required, using the conda or pip command-line tools.

If the above two methods do not work for OS X

The MacPorts way

You can try installing everything using MacPorts. First download and install macports and then issue the following command in a terminal:

sudo port install py27-zmq py27-tornado py27-nose

The avove dependencies are required in order to run IPython notebook. Then run:

sudo port install py27-numpy py27-matplotlib py27-scipy py27-ipython

The advantage of this method is that it easy to do. The downsides:

  • It can take a couple of hours to finish the installation depending on your machine and internet connection, since macports will download and compile everything as it goes. 
  • If you like having the bleeding edge versions, note that it can take a while for them to be released on macports 
  • Finally, macports can create conflicts between different python interpreters installed in your system

Using Apple’s Python interpreted and pip

If you feel adventurous, you can use Apple’s builtin python interpreter and install everything using pip. Please follow the instructions described in this blog.

If you run into trouble

Leave a comment here with the issue you found.

Wednesday, August 27, 2014

Distributed arrays for parallel applications

I came across recently a very promising module: DistArray. The idea behind DistArray is to
provide general multidimensional NumPy-like distributed arrays to Python. It intends to bring the strengths of NumPy to data-parallel high-performance computing. 

Some examples for easily creating distributed arrays are given in this IPython notebook.

Unfortunately I could not test DistArray so far because I am getting weird errors in my system, probably related to installation issues with my MPI installation.