How to find a Pearson’s correlation and ordinary least squares in Python

I'm going to try to blog more technical stuff, as well as simple recipes, here more. I'm working out the best way to present it - whether to bundle everything together or separate out the technical posts. Today we're going to use Python to find a simple correlation, and then fit a straight line to the curve. First you want to install the SciPy and NumPy libraries - they have a lot of cool functions for Python. On Mac, if you have MacPorts installed, this is trivial:
$ sudo port install py27-numpy py27-scipy
Then you find a Pearson's correlation as follows:
# The dependent variable
>>> x = [1, -2, 2, 3, 1]

# The independent variable
>>> y = [7.5, -3.5, 14.5, 19, 6.6]

#Find a correlation
>>> from scipy.stats.stats import pearsonr

#First value is the r-value, 2nd is the p-value
>>> pearsonr(x,y)
(0.98139984935586166, 0.0030366388199721478)

# To find the best-fit line, use the numpy directory
>>> from numpy.linalg import lstsq

# Put the x variable in the correct format.
>>> A = numpy.vstack([x, numpy.ones(len(x))]).T

>>> A
array([[ 1.,  1.],
       [-2.,  1.],
       [ 2.,  1.],
       [ 3.,  1.],
       [ 1.,  1.]])

>>> lstsq(A, y)
(array([ 4.5 ,  4.32]), array([ 10.848]), 2, array([ 4.53897844,  1.84327826]))
# 4.5 is the slope, 4.32 is the y-intercept. 10.848 is the sum of the residuals. 
# 2 is the rank. The last array are the singular values.
For more details, including how to plot the correlation, see the NumPy documentation here.

Liked what you read? I am available for hire.

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments are heavily moderated.