AstroPython Blog: python

Showing posts with label python. Show all posts

Friday, February 19, 2016

How to setup an IPython parallel cluster in your LAN via SSH

It has been a long hiatus since I last posted anything here so it is time to get back.

Today I will describe the following scenario: you have two or more machines (linux boxes or OS X) available in your LAN, and you would like to harness the power of their CPU to perform parallel computing with python. This is possible with IPython parallel and there are several ways to get it accomplished.

I will describe the steps required to configure a private IPython parallel cluster in your LAN using SSH. If everything works well, this should require about 30 min to 1 hour to be completed depending on your level of experience.

1. Command to create an IPython parallel profile:

ipython profile create --parallel --profile=ssh

2. Edit config file .ipython/profile_ssh/ipcluster_config.py in your home.

Specify the number of hosts and cores to be used:

c.SSHEngineSetLauncher.engines = {
'macnemmen' : 4,
'pcnemmen' : 8,
'pcraniere' : 4,
}
where you specify the appropriate names of the machines in your LAN.

Specify the IP of controller (the main machine you use to launch ipcluster):

c.LocalControllerLauncher.controller_args = ["--ip=xx.xxx.x.xxx"]
where you make sure the IP is correct.

3. Make sure python, jupyter, ipython and ipyparallel are installed in each computer.

In my case, I use the Anaconda distribution in all machines.

4. Setup SSH

Create a .ssh/config file such that all hosts and corresponding usernames that will run the servers are conveniently referenced.

Setup passwordless SSH login in each client machine from the controller machine.

5. Create a common alias in each client host pointing to the engine launcher binary:

This is in order to avoid the clients not finding the binary if it is in a nonstandard location.

e.g. create aliases in /opt/ipengine

6. Edit config file .ipython/ipcluster_config.py pointing to the launcher alias

c.SSHEngineSetLauncher.engine_cmd = ['/opt/ipengine']

7. Launch the engines in all machines in your "cluster"

ipcluster start --profile='ssh' --debug

Testing the engines

Now it is time to test if the engines were launched successfully.

Test that they are active in IPython:

import ipyparallel
c=ipyparallel.Client(profile='ssh')
c.ids

The output of the last command should be a list with the number of elements matching the number of engines you launched. Otherwise, something went wrong.

Don't forget that the configuration files are located in .ipython/profile_ssh.

To learn how to use in practice such cluster to do CPU intensive tasks, you can read this tutorial.

References

Thursday, October 16, 2014

Please stop using colormaps which don't translate well to grayscale

I recently printed a paper with very nice results in B&W but the color images simply did not make sense when printed in grayscale (here is the paper if you are curious). Why? Not the best choice of colormap. Jake Vanderplas reminded me of this issue with his very nice blog post.

Please don't choose colormaps for the images in your paper which do not translate well when printed in grayscale.

Check out Nature's advice on color coding. Also check out the advice here.

Feel free to suggest other useful references about this issue in the comments.

Wednesday, October 8, 2014

Python Installation instructions (including IPython / IPython Notebook)

This page describes how to install Python and the other packages (Numpy, Scipy, IPython, IPython Notebook, Matplotlib) required for the course for Mac OS X, Linux and Windows.

Linux

In Linux, the installation instructions are pretty straightforward. Assuming that you are running Debian or Ubuntu, you just need to execute the following command in the terminal:

sudo apt-get install python-numpy python-scipy python-matplotlib ipython-notebook

For Fedora users, you can use the yum tool.

Mac OS X, Linux, Windows

We recommend downloading and installing the Anaconda Python distribution. The installations instructions are available here.

Just download the installer and execute it with bash.

Anaconda includes most of the packages we will use and it is pretty easy to install additional packages if required, using the conda or pip command-line tools.

If the above two methods do not work for OS X

The MacPorts way

You can try installing everything using MacPorts. First download and install macports and then issue the following command in a terminal:

sudo port install py27-zmq py27-tornado py27-nose

The avove dependencies are required in order to run IPython notebook. Then run:

sudo port install py27-numpy py27-matplotlib py27-scipy py27-ipython

The advantage of this method is that it easy to do. The downsides:

It can take a couple of hours to finish the installation depending on your machine and internet connection, since macports will download and compile everything as it goes.
If you like having the bleeding edge versions, note that it can take a while for them to be released on macports
Finally, macports can create conflicts between different python interpreters installed in your system

Using Apple’s Python interpreted and pip

If you feel adventurous, you can use Apple’s builtin python interpreter and install everything using pip. Please follow the instructions described in this blog.

If you run into trouble

Leave a comment here with the issue you found.

Wednesday, August 27, 2014

Distributed arrays for parallel applications

I came across recently a very promising module: DistArray. The idea behind DistArray is to

provide general multidimensional NumPy-like distributed arrays to Python. It intends to bring the strengths of NumPy to data-parallel high-performance computing.

Neat!

Some examples for easily creating distributed arrays are given in this IPython notebook.

Unfortunately I could not test DistArray so far because I am getting weird errors in my system, probably related to installation issues with my MPI installation.

Monday, July 21, 2014

Frequentism and Bayesianism

Jake VanderPlas has been writing a series of posts discussing frequentism and bayesianism. They are well-written, clear and insightful and use IPython for the statistical analysis. Here, I compiled his posts on the topic for convenience.

Frequentism and Bayesianism: A Practical Introduction
where he synthesizes the philosophical and pragmatic aspects of the frequentist and Bayesian approaches as they relate to the analysis of scientific data.

Frequentism and Bayesianism II: When Results Differ
where he discusses the difference between frequentist and Bayesian in the treatment of nuisance parameters.

Frequentism and Bayesianism III: Confidence, Credibility, and why Frequentism and Science do not Mix
where he discusses the subtle difference between frequentist confidence intervals and Bayesian credible intervals.

Frequentism and Bayesianism IV: How to be a Bayesian in Python
where he describes how to do Bayesian statistics in python with emcee, PyMC and PyStan.

Thursday, May 15, 2014

Linear regression with errors in X and Y with Python: BCES

I finally had the chance to upload my BCES linear regression python code to Github!

If you need to do linear regression with measurement errors in X and Y, including intrinsic scatter, please check it out. Even better, if you have suggestions to improve or speed up the code, please contribute by all means!

Monday, November 19, 2012

MacPorts update broke your ipython/matplotlib installation?

I recently updated my MacPorts installation just to find out that it broke my ipython installation, even though my python environment is configured independently from macports.

The error message was more or less like this: I run the command

ipython --pylab

and matplotlib does not work, complaining about the missing the library file libpng14.14.dylib with a message "image not found".

If you installed python, ipython and matplotlib following my tutorial (i.e. independently from the macports python installation), you just need to reinstall matplotlib with pip:

pip uninstall matplotlib

pip install matplotlib

This will reconfigure matplotlib with the updated dependencies.

Thursday, July 19, 2012

How to switch from IDL to Python

There is a new DIY tutorial on how to switch from IDL to Python hosted by AstroBetter. It is especially useful if you are an IDL programmer and want to grasp the basic concepts of python.

Go check it out.

Tuesday, June 12, 2012

Parallel computing in Python for the masses

Case scenario: you wrote a python routine that does some kind of time-consuming computation. Then you think, wow, my computer has N cores but my program is using only one of them at a time. What a waste of computing resources. Is there a reasonably easy way of modifying my code to make it exploit all the cores of my multicore machine?

The answer is yes and there are different ways of doing it. It depends on how complex your code is and which method you choose to parallelize your computation.

I will talk here about one relatively easy way of speeding up your code using the multiprocessing python package. I should mention that there are many other options out there but the multiprocessing package comes pre-installed with any python distribution by default.

I am assuming that you really need to make your code parallel. You will have to stop and spend time thinking about how to break your computation in smaller parts that will be sent to the different cores. And I should mention that debugging is harder for parallel code compared to serial code, obviously.

Parallelization is one way of optimizing your code. Other ideas for optimizing your code is using Cython or f2py. Both these approaches may imply >10x speedup and are worth exploring depending on your situation. But both will involve using the C or Fortran languages along with your python code.

The ideal case is when your problem "embarassingly parallel". What I mean by this is: your problem can be made parallel in a reasonably easy way since the computations which correspond to the bottleneck of the code can be carried out independently and do not need to communicate between each other. Examples:

You have a "grid" of parameters that you need to pass to a time-consuming model (e.g., a 1000x1000 matrix with the values of two parameters). Your model needs to evaluate those parameters and provide some output.
Your code performs a Monte Carlo simulation with 100000 trials which are carried out in a loop. You can then easily "dismember" this loop and send it to be computed independently by the cores in your machine.

Instead of giving code examples myself, I will point out the material I used to learn parallelization. I learned the basics of parallel programming by reading the excellent tutorial "introduction to parallel programming" written by Blaise Barney.

The next step was learning how to use the multiprocessing package. I learned this with the examples posted in the AstroBetter blog. I began by reading the example implemented with the pprocess package. The caveat here is that 'pprocess' is a non-standard package. The multiprocessing package which comes with python should be used instead. Somebody posted the original example discussed in the blog ported to the multiprocessing package.

As the posts above explain, the basic idea behind using 'multiprocessing' is to use the parallel map method to evaluate your time-consuming function using the many cores in your machine. Once you figure out a way of expressing your calculation in terms of the 'map' method, the rest is easy.

In my experience doing parallel programming in python using 'multiprocessing' I learned a few things which I want to share:

Do not forget to close the parallel engine with the close() method after your computation is done! If you do not do this, you will end up leaving a lot of orphan processes which can quickly consume the available memory in your machine.
Avoid using lambda functions when passing arguments to the parallel 'map' at all costs! Trust me, multiprocessing does not play well with lambda constructs.
Finally, as I mentioned before, parallelizing a code increases development time and the complexity of debugging your code. Only resort to parallelization if you really need it, i.e. if you think you will get a big speedup in your code execution. For example, if you code takes 24 hours to execute and you think you can get a 6x speedup by resorting to 'multiprocessing', then the execution time can be reduced to 4 hours which is not bad.

Thursday, May 24, 2012

Hidden features of Python

I learned about this link with many useful hidden features of Python via Eduardo.

I particularly like:

the use of enumerator in loops: for i,x in enumerate(array)
decorators as a simple way of enhancing methods: @method
one-line swapping of variables: a,b=b,a

Thursday, April 26, 2012

Using git to manage source code and more

Recently I learned how to use Git to manage source code (thanks to this guy). Let me tell you, it is such a fantastic tool! Especially when you have thousands of line of source code constantly evolving and you need to keep track of what changes.

In my case, I have been using it to manage the source code I wrote for my different scientific projects. And I will soon begin using it even to manage the writing of one paper.

Let me list the tutorials that I read and have been very useful in getting me started quickly:

Git Magic: I began learning git with this one. It goes straight to the point and illustrates the most important commands.
Pro Git: need more detailed information and have more time to spend learning git? Have a look at this one.

Quick reference for gitds:

Git reference: quick description of the commands
Git cheatsheet

I use SourceTree, a GUI on Mac, to check the evolution of the source code.

Changelog
May 24th 2012: replaced suggestion of GUI GitX -> SourceTree.

Thursday, March 22, 2012

Simple progress bar in terminal

If you need to incorporate a simple progress bar in your code, there is a module that does that and is very easy to use: fish.

The following script illustrates how to implement a simple progress bar which advances each time a loop counter increases.

 import fish  
 import time  
   
 steps=input('How many steps? ')  
   
 # Progress bar initialization  
 peixe = fish.ProgressFish(total=steps)       
   
 for i in range(steps):  
      # Progress bar  
      peixe.animate(amount=i)  
        
      time.sleep(0.1)

Here is a screenshot of what the progress bar looks like in action:

Wednesday, March 7, 2012

How to install a scientific Python environment on Mac OS X Lion

My Mac OS X workstation was just updated to Lion and I had to reinstall Python and associated scientific tools for plotting, statistics etc. What a pain.

After many trial-and-error procedures I finally found a way to get a scientific Python environment (Python + Scipy + iPython + Numpy + matplotlib) working correctly on Mac OS X Lion. I am reporting the steps I carried out hoping that it will help other people.

You will need a Python installation (in my example I use the one that comes by default with OS X), gfortran and Xcode. Here are the steps:

Install the requirements: Xcode which includes Python (via App Store), gfortran (via Macports), virtualenv, additional libraries required by matplotlib
Create a python environment with virtualenv
Install Numpy, Scipy, matplotlib, ipython with pip. Install readline with easy_install
Create an alias in .profile or .bash_profile (depending on your shell) to run ipython

After these steps are completed, you will get a working Python environment for scientific analysis, visualization and statistics with Mac OS X Lion.

Requirements

Xcode
Python 2.7, which comes pre-installed by default with OS X
gfortran
virtualenv
additional libraries required by matplotlib (optional)

1. How to get the requirements working

gfortran

In my case, I installed it by installing MacPorts and installing GCC which comes with gfortran:

 sudo port install gcc44

To make gfortran visible to the system I created an alias in /usr/local/bin:

 cd /usr/local/bin/  
 sudo ln -s /opt/local/bin/gfortran-mp-4.4 gfortran

virtualenv

I went to web page that hosts virtualenv and downloaded virtualenv.py. You will use virtualenv.py below.

Additional libraries required by matplotlib (optional)

I use the graphical backend TkAgg, which requires the following additional libraries for matplotlib to work: tk, freetype, libpng. I installed them using macports:

sudo port install tk

sudo port install freetype

sudo port install libpng

2. Create a python environment with virtualenv

Create a directory stdpy (in my example) somewhere and issue the command

 /usr/bin/python virtualenv.py stdpy

to create an isolated python environment based on the python provided by default with Mac OS X. This avoids trouble with mixing libraries. Activate the environment by running

 source stdpy/bin/activate

You should now see a (stdpy) showing up in your terminal.

3. Install Numpy, Scipy, matplotlib, ipython with pip and readline with easy_install

After activating the python environment, let's proceed and install the additional modules with pip and easy_install:

 pip install numpy  
 pip install scipy  
 pip install matplotlib  
 pip install ipython  
 easy_install readline

You may need to install additional libraries in order to get matplotlib compiled, depending on the kind of graphical backend that you choose. In my case, I use TkAgg which depends on Tk, freetype and libpng libraries which I installed via macports.

4. Create an alias in .profile or .bash_profile (depending on your shell) to run python

In my case I use Bash and I added the following line to the file .bash_profile in my home directory:

 alias ipy='source ~/stdpy/bin/activate && ipython --pylab'

Now, when I open the terminal and issue the command

ipy

it will automatically activate the python environment and run ipython.

Changelog:

Aug. 18th 2012: added instructions about additional libraries in matplotlib
Sep. 1st 2012: made explanation about matplotlib dependencies clearer (hopefully)

Friday, February 24, 2012

Plots with several histograms

Creating a plot with two histograms

Here is a method that you can use to plot two histograms in the same figure sharing the same X-axis, keeping some distance between the histograms:

 def twohists(x1,x2,xmin,xmax,x1leg='$x_1$',x2leg='$x_2$',xlabel='',sharey=False):  
      """  
 Script that plots two histograms of quantities x1 and x2.   
   
 Arguments:  
 - x1,x2: arrays with data to be plotted  
 - xmin,xmax: lower and upper range of plotted values, will be used to set a consistent x-range  
      for both histograms.  
 - x1leg, x2leg: legends for each histogram       
 - xlabel: self-explanatory.  
   
 Inspired by http://www.scipy.org/Cookbook/Matplotlib/Multiple_Subplots_with_One_Axis_Label.  
   
 Rodrigo Nemmen  
 v1 Dec. 2011  
 v1.1 Feb. 2012: Added sharey argument.  
      """  
   
      pylab.clf()  
      pylab.rcParams.update({'font.size': 15})  
      fig=pylab.figure()  
        
      a=fig.add_subplot(2,1,1)  
      if sharey==True:  
           b=fig.add_subplot(2,1,2, sharex=a, sharey=a)  
      else:  
           b=fig.add_subplot(2,1,2, sharex=a)  
        
      a.hist(x1,label=x1leg,color='b')  
      a.legend(loc='best',frameon=False)  
      a.set_xlim(xmin,xmax)  
        
      b.hist(x2,label=x2leg,color='r')  
      b.legend(loc='best',frameon=False)  
        
      pylab.setp(a.get_xticklabels(), visible=False)  
   
      b.set_xlabel(xlabel)  
      b.set_ylabel('Number',verticalalignment='bottom')  
      pylab.minorticks_on()  
      pylab.subplots_adjust(hspace=0.15)  
      pylab.draw()

... and here is a example script that uses the method above:

 """  
 Illustrates how to use the twohists method.  
 """  
 import nemmen  
 import scipy, pylab  
   
 # Generates a normal distribution  
 x1=scipy.random.standard_normal(100)  
   
 # Generates a uniform random distribution  
 x2=scipy.random.uniform(-3,3,100)  
   
 nemmen.twohists(x1,x2,-3,3,'Normal','Uniform')  
   
 pylab.show()

... to create the following plot:

Creating a plot with three histograms

I also wrote a recipe that makes a plot with three histograms:

 def threehists(x1,x2,x3,xmin,xmax,x1leg='$x_1$',x2leg='$x_2$',x3leg='$x_3$',xlabel='',sharey=False):  
      """  
 Script that plots three histograms of quantities x1, x2 and x3.   
   
 Arguments:  
 - x1,x2,x3: arrays with data to be plotted  
 - xmin,xmax: lower and upper range of plotted values, will be used to set a consistent x-range  
      for both histograms.  
 - x1leg, x2leg, x3leg: legends for each histogram       
 - xlabel: self-explanatory.  
 - sharey: sharing the Y-axis among the histograms?  
   
 Example:  
 x1=Lbol(AD), x2=Lbol(JD), x3=Lbol(EHF10)  
 >>> threehists(x1,x2,x3,38,44,'AD','JD','EHF10','$\log L_{\\rm bol}$ (erg s$^{-1}$)',sharey=True)  
   
 Inspired by http://www.scipy.org/Cookbook/Matplotlib/Multiple_Subplots_with_One_Axis_Label.  
   
 Rodrigo Nemmen  
 v1 Dec. 2011  
 v1.1 Feb. 2012:     Added sharey keyword.  
      """  
   
      pylab.clf()  
      pylab.rcParams.update({'font.size': 15})  
      fig=pylab.figure()  
        
      a=fig.add_subplot(3,1,1)  
      if sharey==True:  
           b=fig.add_subplot(3,1,2, sharex=a, sharey=a)  
           c=fig.add_subplot(3,1,3, sharex=a, sharey=a)  
      else:  
           b=fig.add_subplot(3,1,2, sharex=a)  
           c=fig.add_subplot(3,1,3, sharex=a)            
        
      a.hist(x1,label=x1leg,color='b')  
      a.legend(loc='best',frameon=False)  
      a.set_xlim(xmin,xmax)  
        
      b.hist(x2,label=x2leg,color='r')  
      b.legend(loc='best',frameon=False)  
   
      c.hist(x3,label=x3leg,color='y')  
      c.legend(loc='best',frameon=False)  
        
      pylab.setp(a.get_xticklabels(), visible=False)  
      pylab.setp(b.get_xticklabels(), visible=False)  
   
      c.set_xlabel(xlabel)  
      b.set_ylabel('Number')  
      pylab.minorticks_on()  
      pylab.subplots_adjust(hspace=0.15)  
      pylab.draw()

... and as before, here is a script that illustrates how to use the above method:

 """  
 Illustrates how to use the threehists method.  
 """  
 import nemmen  
 import scipy, pylab  
   
 # Generates a normal distribution  
 x1=scipy.random.standard_normal(100)  
   
 # Generates a uniform random distribution  
 x2=scipy.random.uniform(-3,3,100)  
   
 x3=scipy.random.standard_normal(1000)  
   
 nemmen.threehists(x1,x2,x3,-3,3,'Normal ($n=100$)','Uniform','Normal ($n=1000$)')  
   
 pylab.show()

... creating this plot:

Thursday, February 9, 2012

Additional color names for matplotlib plots

When creating plots with matplotlib, we usually use the "default" color names:

b : blue
g : green
r : red
c : cyan
m : magenta
y : yellow
k : black
w : white

It turns out that matplotlib accepts not only these default color names, but the full range of html color names! So for example, you can plot some data like this:

 pylab.plot(x, y, 'DarkOrange')

and get a "dark orange" color.

This is an easy way of specifying colors in addition to the standard ones.

Wednesday, February 8, 2012

How to begin learning Python

Many (perhaps most) people that want to learn Python get confused with the overwhelming number of references sources available. Where to start? So many options!

Motivated by this, I list in this post the references that I used to learn Python (and object-oriented programming as well), which can serve as a starting point for other people. This post is biased towards the scientists interested in learning Python.

Beginner material

Official Python documentation

Learned the basic syntax and capabilities of the language with the official Python tutorial. You can download all of this as PDF files. I suggest this for people with previous programming experience. For absolute beginners, have a look at the Think Python book below.

Computação científica com Python, Fabrício Ferrari (portuguese only)

Introductory lecture about Python, its syntax and science applications. It shows what Python is capable of for data analysis and plotting. Inspiring. The audio is also available for download as a MP3 file.

Using Python for interactive data analysis, Greenfield & Jedrzejewski (STSCI)

Tutorial on using Python for data analysis! How to replace IDL/Matlab with Python. Includes: plotting, FITS files, signal processing.

Introduction to Programming using Python, Programming Course for Biologists at the Pasteur Institute, Schuerer et al.

I learned object-oriented programming using this material. Very clear and "application-oriented" approach. You don't need to be a biologist to understand this.

Think Python: How to think like a computer scientist

Longer introduction for people with no previous extensive programming experience.

Quick reference

Python data analysis reference card

This is a cheat sheet with the basic commands needed for data analysis, array processing and plotting.

Reference card: Python commands equivalent to IDL/Matlab ones

Migrating from IDL/Matlab to Python.

IPython quick reference card

If you are going to do serious stuff with Python, I suggest using the enhanced interactive Python terminal IPython.

Longer introductory books

Learning Python, Mark Lutz

A primer on scientific programming with Python, Hans Petter Langtangen

Longer reference books

Python essential reference, David Beazley

Here is a collection of Python-related links: tutorials, references, modules etc.

Note: this post is a revised version of the text originally posted here.

Friday, February 3, 2012

Computing the chi-squared and reduced chi-squared of a model

Here are two codes for computing the chi-squared of a model compared to some data. Very useful when judging the goodness-of-fit of a model.

Source code for method that returns the chi-squared:

 def chisqg(ydata,ymod,sd=None):  
      """  
 Returns the chi-square error statistic as the sum of squared errors between  
 Ydata(i) and Ymodel(i). If individual standard deviations (array sd) are supplied,   
 then the chi-square error statistic is computed as the sum of squared errors  
 divided by the standard deviations.     Inspired on the IDL procedure linfit.pro.  
 See http://en.wikipedia.org/wiki/Goodness_of_fit for reference.  
   
 x,y,sd assumed to be Numpy arrays. a,b scalars.  
 Returns the float chisq with the chi-square statistic.  
   
 Rodrigo Nemmen  
 http://goo.gl/8S1Oo  
      """  
      # Chi-square statistic (Bevington, eq. 6.9)  
      if sd==None:  
           chisq=numpy.sum((ydata-ymod)**2)  
      else:  
           chisq=numpy.sum( ((ydata-ymod)/sd)**2 )  
        
      return chisq

Source code for method that returns the reduced chi-squared of a model. You need to provide the number of free parameters of the model as input to the method.

 def redchisqg(ydata,ymod,deg=2,sd=None):  
      """  
 Returns the reduced chi-square error statistic for an arbitrary model,   
 chisq/nu, where nu is the number of degrees of freedom. If individual   
 standard deviations (array sd) are supplied, then the chi-square error   
 statistic is computed as the sum of squared errors divided by the standard   
 deviations. See http://en.wikipedia.org/wiki/Goodness_of_fit for reference.  
   
 ydata,ymod,sd assumed to be Numpy arrays. deg integer.  
   
 Usage:  
 >>> chisq=redchisqg(ydata,ymod,n,sd)  
 where  
  ydata : data  
  ymod : model evaluated at the same x points as ydata  
  n : number of free parameters in the model  
  sd : uncertainties in ydata  
   
 Rodrigo Nemmen  
 http://goo.gl/8S1Oo  
       """  
      # Chi-square statistic  
      if sd==None:  
           chisq=numpy.sum((ydata-ymod)**2)  
      else:  
           chisq=numpy.sum( ((ydata-ymod)/sd)**2 )  
             
      # Number of degrees of freedom assuming 2 free parameters  
      nu=ydata.size-1-deg  
        
      return chisq/nu

Monday, December 19, 2011

To get the new "notebook" functionality working in iPython 0.12 ...

... I needed to install these additional modules under Mac OS X Snow Leopard:

sudo pip-2.7 install tornado
sudo fink install zmq-py27

Friday, December 16, 2011

Script illustrating how to do a linear regression and plot the confidence band of the fit

The script below illustrates how to carry out a simple linear regression of data stored in an ASCII file, plot the linear fit and the 2 sigma confidence band.

The script invokes the confband method described here to plot the confidence bands. The confband is assumed to be lying inside the "nemmen" module (not yet publicly available, sorry) but you can place it in any module you want.

I got the test data here to perform the fitting.

After running the script you should get the following plot:

where the best-fit line is displayed in green and the shaded area is in gray.

The script below is also available at Github Gist.

   
 import numpy, pylab, scipy  
 import nemmen  
   
 # Data taken from http://orion.math.iastate.edu/burkardt/data/regression/x01.txt.  
 # I removed the header from the file and left only the data in 'testdata.dat'.  
 xdata,ydata = numpy.loadtxt('testdata.dat',unpack=True,usecols=(1,2))  
 xdata=numpy.log10(xdata) # take the logs  
 ydata=numpy.log10(ydata)  
   
 # Linear fit  
 a, b, r, p, err = scipy.stats.linregress(xdata,ydata)  
   
 # Generates arrays with the fit  
 x=numpy.linspace(xdata.min(),xdata.max(),100)  
 y=a*x+b  
   
 # Calculates the 2 sigma confidence band contours for the fit  
 lcb,ucb,xcb=nemmen.confband(xdata,ydata,a,b,conf=0.95)  
   
 # Plots the fit and data  
 pylab.clf()  
 pylab.plot(xdata,ydata,'o')  
 pylab.plot(x,y)  
   
 # Plots the confidence band as shaded area  
 pylab.fill_between(xcb, lcb, ucb, alpha=0.3, facecolor='gray')  
   
 pylab.show()  
 pylab.draw()

Wednesday, December 14, 2011

Quick Python reference documents

These quick reference sheets will help you to quickly learn and practice data analysis, plotting and statistics with Python:

Python data analysis reference card (created by Fabricio Ferrari)

Python commands for people with previous IDL/Matlab experience

IPython quick reference card