How to do a linear regression with sklearn


Generalized Linear Models

Step 1: Installing scikit-learn

We require the user to have a python anaconda environment already installed.

conda install scikit-learn

Test that scikit-learn was correctly installed::

from sklearn.linear_model import LinearRegression

Step 2: Generate random linear data

We are going to choose fixed values of m and b for the formula y = x*m + b. Then with a random error of 1% will generate the random points. Usually you will not known in advance this information, we generate this data for teaching purposes. With this points we are going to use sklearn to create a linear regresion and verify how close we got to the fixed m and b values that we choose.

import random
# let's create a linear function with some error called f
def f(x):
    res = x* 25 + 3
    error = res * random.uniform(-0.01, 0.01) # you can play with the error to see how it affects the result
    return res + error

values = []
# now using f we are going to create 300 values.
for i in range(0, 300):
    x = random.uniform(1, 1000)
    y = f(x)
    values.append((x, y))

In our code we first define the f function, which is a linear function. In particular this function adds +/- 1% of random error to the result, if not we are going to get a straight line. Then we generate 300 random point that we will use to train a model.

If we plot the point we will get:

Random points of a linear function with an error of %1

Step 3: Use scikit-learn to do a linear regression

Now we are ready to start using scikit-learn to do a linear regression. Using the values list we will feed the fit method of the linear regression. Also we separate the data in two pieces: train and test. With thet in step 5 we are going to measure the error of the trained linear model.

from sklearn import linear_model
regr = linear_model.LinearRegression()
# split the values into two series instead a list of tuples
x, y = zip(*values)
max_x = max(x)
min_x = min(x)
# split the values in train and data.
train_data_X = map(lambda x: [x], list(x[:-20]))
train_data_Y = list(y[:-20])
test_data_X = map(lambda x: [x], list(x[-20:]))
test_data_Y = list(y[-20:])
# feed the linear regression with the train data to obtain a model.
regr.fit(train_data_X, train_data_Y)
# check that the coeffients are the expected ones.
m = regr.coef_[0]
b = regr.intercept_
print(' y = {0} * x + {1}'.format(m, b))

The output should be similar to:

y = 25.177610247385566 * x + -7.94027696509

Step 4: Plot the result

Now we are going to plot the regression fitted:

import matplotlib.pyplot as plt

# now we are going to plot the points and the model obtained
plt.scatter(x, y, color='blue')  # you can use test_data_X and test_data_Y instead.
plt.plot([min_x, max_x], [b, m*max_x + b], 'r')
plt.title('Fitted linear regression', fontsize=16)
plt.xlabel('x', fontsize=13)
plt.ylabel('y', fontsize=13)

Final result of the fitter linear regression

Step 5: Measure the error

import numpy as np
# The mean squared error
print("Mean squared error: %.2f" % np.mean((regr.predict(test_data_X) - test_data_Y) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(test_data_X, test_data_Y))

The output should be similar to:

Mean squared error: 3352.52
Variance score: 1.00


Simple linear regression is a statistical method that allows us to summarize and study relationships between two or more continuous (quantitative) variables.

It's a good idea to start doing a linear regression for learning or when you start to analyze data, since linear models are simple to understand. If a linear model is not the way to go, then you can move to more complex models.

Linear regresion tries to find a relations between variables.

Scikit-learn is a python library that is used for machine learning, data processing, cross-validation and more. In this tutorial we are going to do a simple linear regression using this library, in particular we are going to play with some random generated data that we will use to predict a model.

When to use linear regression

When you think there is a relation between the variable to analyze (X and Y), like we see in figure 0.

What we are going to fit is the slope (m) and y-interceptor (b), so we are going to get a function like: y = x*m + b.

In this case the linear combination only has x since we are using 2D data, but the general linear model where y is the predicted model is:


How to fix error Error: ImportError: numpy.core.multiarray failed to import

try to upgrade numpy with:

conda install numpy

If you are not using anaconda:

pip install -U numpy

How to fix ValueError: Found arrays with inconsistent numbers of samples

Check that the X values of the fit function is a list of list of possible values for each Y.