Monday, December 19, 2011

To get the new "notebook" functionality working in iPython 0.12 ...

... I needed to install these additional modules under Mac OS X Snow Leopard:
  • sudo pip-2.7 install tornado
  • sudo fink install zmq-py27

Friday, December 16, 2011

Script illustrating how to do a linear regression and plot the confidence band of the fit

The script below illustrates how to carry out a simple linear regression of data stored in an ASCII file, plot the linear fit and the 2 sigma confidence band.

The script invokes the confband method described here to plot the confidence bands. The confband is assumed to be lying inside the "nemmen" module (not yet publicly available, sorry) but you can place it in any module you want.

I got the test data here to perform the fitting.

After running the script you should get the following plot:
where the best-fit line is displayed in green and the shaded area is in gray.

The script below is also available at Github Gist.

   
 import numpy, pylab, scipy  
 import nemmen  
   
 # Data taken from http://orion.math.iastate.edu/burkardt/data/regression/x01.txt.  
 # I removed the header from the file and left only the data in 'testdata.dat'.  
 xdata,ydata = numpy.loadtxt('testdata.dat',unpack=True,usecols=(1,2))  
 xdata=numpy.log10(xdata) # take the logs  
 ydata=numpy.log10(ydata)  
   
 # Linear fit  
 a, b, r, p, err = scipy.stats.linregress(xdata,ydata)  
   
 # Generates arrays with the fit  
 x=numpy.linspace(xdata.min(),xdata.max(),100)  
 y=a*x+b  
   
 # Calculates the 2 sigma confidence band contours for the fit  
 lcb,ucb,xcb=nemmen.confband(xdata,ydata,a,b,conf=0.95)  
   
 # Plots the fit and data  
 pylab.clf()  
 pylab.plot(xdata,ydata,'o')  
 pylab.plot(x,y)  
   
 # Plots the confidence band as shaded area  
 pylab.fill_between(xcb, lcb, ucb, alpha=0.3, facecolor='gray')  
   
 pylab.show()  
 pylab.draw()  

Wednesday, December 14, 2011

Quick Python reference documents

These quick reference sheets will help you to quickly learn and practice data analysis, plotting and statistics with Python:

Python data analysis reference card (created by Fabricio Ferrari)



Generating random numbers from an arbitrary probability distribution using the rejection method

I am pretty used to generating random numbers from a normal distribution. There are built-in methods to carry out that task in Python and many other data analysis softwares.

But what if you need to produce random numbers based on a non-standard probability distribution? In my case that challenge appeared and I needed to produce random numbers from a power-law (?!) probability density function. Astronomy can be pretty interesting.

The method below solves that problem. It is not very efficient since I did not try to optimize it. But it gets the job done.

Let me know if you have a more clever way of solving this problem in Python or if you use the method and find bugs.


 def randomvariate(pdf,n=1000,xmin=0,xmax=1):  
  """  
 Rejection method for random number generation  
 ===============================================  
 Uses the rejection method for generating random numbers derived from an arbitrary   
 probability distribution. For reference, see Bevington's book, page 84. Based on  
 rejection*.py.  
   
 Usage:  
 >>> randomvariate(P,N,xmin,xmax)  
  where  
  P : probability distribution function from which you want to generate random numbers  
  N : desired number of random values  
  xmin,xmax : range of random numbers desired  
    
 Returns:   
  the sequence (ran,ntrials) where  
   ran : array of shape N with the random variates that follow the input P  
   ntrials : number of trials the code needed to achieve N  
   
 Here is the algorithm:  
 - generate x' in the desired range  
 - generate y' between Pmin and Pmax (Pmax is the maximal value of your pdf)  
 - if y'<P(x') accept x', otherwise reject  
 - repeat until desired number is achieved  
   
 Rodrigo Nemmen  
 Nov. 2011  
  """  
  # Calculates the minimal and maximum values of the PDF in the desired  
  # interval. The rejection method needs these values in order to work  
  # properly.  
  x=numpy.linspace(xmin,xmax,1000)  
  y=pdf(x)  
  pmin=y.min()  
  pmax=y.max()  
   
  # Counters  
  naccept=0  
  ntrial=0  
   
  # Keeps generating numbers until we achieve the desired n  
  ran=[] # output list of random numbers  
  while naccept<n:  
  x=numpy.random.uniform(xmin,xmax) # x'  
  y=numpy.random.uniform(pmin,pmax) # y'  
   
  if y<pdf(x):  
   ran.append(x)  
   naccept=naccept+1  
  ntrial=ntrial+1  
    
  ran=numpy.asarray(ran)  
    
  return ran,ntrial  

Update Oct. 23rd 2014: This code snippet is available (and updated) at github. Correction: pmin=0 (thanks PZ!)

Calculating the prediction band of a linear regression model


The method below calculates the prediction band of an arbitrary linear regression model at a given confidence level in Python.

If you use it, let me know if you find any bugs.




  1. def predband(xd,yd,a,b,conf=0.95,x=None):
  2.     """
  3. Calculates the prediction band of the linear regression model at the desired confidence
  4. level.
  5. Clarification of the difference between confidence and prediction bands:
  6. "The 2sigma confidence interval is 95% sure to contain the best-fit regression line.
  7. This is not the same as saying it will contain 95% of the data points. The prediction bands are
  8. further from the best-fit line than the confidence bands, a lot further if you have many data
  9. points. The 95% prediction interval is the area in which you expect 95% of all data points to fall."
  10. (from http://graphpad.com/curvefit/linear_regression.htm)
  11. Arguments:
  12. - conf: desired confidence level, by default 0.95 (2 sigma)
  13. - xd,yd: data arrays
  14. - a,b: linear fit parameters as in y=ax+b
  15. - x: (optional) array with x values to calculate the confidence band. If none is provided, will
  16.  by default generate 100 points in the original x-range of the data.
  17.  
  18. Usage:
  19. >>> lpb,upb,x=nemmen.predband(all.kp,all.lg,a,b,conf=0.95)
  20. calculates the prediction bands for the given input arrays
  21. >>> pylab.fill_between(x, lpb, upb, alpha=0.3, facecolor='gray')
  22. plots a shaded area containing the prediction band  
  23. Returns:
  24. Sequence (lpb,upb,x) with the arrays holding the lower and upper confidence bands
  25. corresponding to the [input] x array.
  26. References:
  27. 1. http://www.JerryDallal.com/LHSP/slr.htm, Introduction to Simple Linear Regression, Gerard
  28. E. Dallal, Ph.D.
  29. Rodrigo Nemmen
  30. v1 Dec. 2011
  31. v2 Jun. 2012: corrected bug in dy.
  32.     """
  33.     alpha=1.-conf   # significance
  34.     n=xd.size   # data sample size
  35.     if x==None: x=numpy.linspace(xd.min(),xd.max(),100)
  36.     # Predicted values (best-fit model)
  37.     y=a*x+b
  38.     # Auxiliary definitions
  39.     sd=scatterfit(xd,yd,a,b)    # Scatter of data about the model
  40.     sxd=numpy.sum((xd-xd.mean())**2)
  41.     sx=(x-xd.mean())**2 # array
  42.     # Quantile of Student's t distribution for p=1-alpha/2
  43.     q=scipy.stats.t.ppf(1.-alpha/2.,n-2)
  44.     # Prediction band
  45.     dy=q*sd*numpy.sqrt( 1.+1./n + sx/sxd )
  46.     upb=y+dy    # Upper prediction band
  47.     lpb=y-dy    # Lower prediction band
  48.     return lpb,upb,x

Computing the scatter of data around linear fits

The Python method below computes the scatter of data around a given linear model.

Let me know if you find any bugs.



   
 def scatterfit(x,y,a=None,b=None):  
  """  
 Compute the mean deviation of the data about the linear model given if A,B  
 (y=ax+b) provided as arguments. Otherwise, compute the mean deviation about   
 the best-fit line.  
   
 x,y assumed to be Numpy arrays. a,b scalars.  
 Returns the float sd with the mean deviation.  
   
 Author: Rodrigo Nemmen  
  """  
   
  if a==None:   
  # Performs linear regression  
  a, b, r, p, err = scipy.stats.linregress(x,y)  
    
  # Std. deviation of an individual measurement (Bevington, eq. 6.15)  
  N=numpy.size(x)  
  sd=1./(N-2.)* numpy.sum((y-a*x-b)**2); sd=numpy.sqrt(sd)  
    
  return sd  

Calculating and plotting confidence bands for linear regression models

This method calculates the confidence band of an arbitrary linear regression model at a given confidence level in Python. Using this method you can get plots like this:
where the shaded area corresponds to the 2sigma confidence band of the linear fit shown in green. A script illustrating how to use the confband method below is available.

If you use it, let me know if you find any bugs.

Note: the scatterfit method called below is available here.