Problem 1 - Regression

Contents

  1. Preparing the Data
  2. Least Square Regression
  3. Ridge Regression
  4. LASSO
  5. RMSE of Models
  6. Sparsity of LASSO (answer to question 2)

Data source: https://archive.ics.uci.edu/ml/datasets/BlogFeedback

Task Summary

  1. Perform Least Squares Regression, Ridge Regression, and LASSO to predict the target variable.
  2. Train on BlogData Train.csv and test on blogData_test-2012.03.31.01_00.csv.
  3. Report RMSE for each model.
  4. Answer: what are the most important features according to LASSO?

Importing packages

In [1]:
import pandas
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import svd
from scipy.stats import describe
from sklearn.linear_model import Lasso

Preparing the Data

In [2]:
# read data and compute correlation
df_train = pandas.read_csv('blogData_train.csv', header=None)
df_test = pandas.read_csv('blogData_test-2012.03.31.01_00.csv', header=None)
trainCorrMat = df_train.corr().to_numpy()
In [3]:
x_train_raw = df_train.to_numpy()[:, :-1]
x_train_attr_mean = np.mean(x_train_raw, axis=0)
x_train_attr_std = np.std(x_train_raw, axis=0)
x_train_attr_std[np.abs(x_train_attr_std) < 1.0e-4] = 1.0

# standardize the data
def get_xy(df):
    npArr = df.to_numpy()
    x = npArr[:, :-1].copy()
    x -= x_train_attr_mean
    x /= x_train_attr_std
    y = npArr[:, -1]
    return x, y

x_train, y_train = get_xy(df_train)
x_test, y_test = get_xy(df_test)
In [4]:
df_train.describe()
Out[4]:
0 1 2 3 4 5 6 7 8 9 ... 271 272 273 274 275 276 277 278 279 280
count 52397.000000 52397.000000 52397.000000 52397.000000 52397.000000 52397.000000 52397.000000 52397.000000 52397.000000 52397.000000 ... 52397.000000 52397.000000 52397.000000 52397.000000 52397.000000 52397.000000 52397.0 52397.000000 52397.000000 52397.000000
mean 39.444167 46.806717 0.358914 339.853102 24.681661 15.214611 27.959159 0.002748 258.666030 5.829151 ... 0.171327 0.162242 0.154455 0.096151 0.088917 0.119167 0.0 1.242094 0.769505 6.764719
std 79.121821 62.359996 6.840717 441.430109 69.598976 32.251189 38.584013 0.131903 321.348052 23.768317 ... 0.376798 0.368676 0.361388 0.294800 0.284627 1.438194 0.0 27.497979 20.338052 37.706565
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000
25% 2.285714 5.214318 0.000000 29.000000 0.000000 0.891566 3.075076 0.000000 22.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000
50% 10.630660 19.353120 0.000000 162.000000 4.000000 4.150685 11.051215 0.000000 121.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000
75% 40.304670 77.442830 0.000000 478.000000 15.000000 15.998589 45.701206 0.000000 387.000000 2.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 1.000000
max 1122.666600 559.432600 726.000000 2044.000000 1314.000000 442.666660 359.530060 14.000000 1424.000000 588.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 136.000000 0.0 1778.000000 1778.000000 1424.000000

8 rows × 281 columns

In [5]:
plt.pcolor(trainCorrMat, cmap='bwr')
plt.title('Pairwise correlation coefficients between columns')
plt.colorbar()
plt.show()
In [6]:
lastRowCorr = trainCorrMat[-1, :]
plt.title('Correlation coefficients between columns and last column (target variable)')
plt.plot(np.arange(lastRowCorr.size), lastRowCorr, '.')
plt.show()

Least Square Regression

The linear model is given by $Y = XW + \epsilon$. We want to minimize the error term $\lVert \epsilon \rVert^2=\lVert Y-XW\rVert^2$. Matrix calculus gives $W = (X^TX)^{-1}X^TY$. The SVD allows us to write matrix $X$ as $X=U\Sigma V^T$, where both $U$ and $V$ are unit matrices. This allows us to write

$$ \begin{align*} W &= (X^TX)^{-1}X^TY\\ &= (V\Sigma U^TU\Sigma V^T)V\Sigma U^TY\\ &= V\Sigma^{-1}U^TY. \end{align*} $$

Inverting matrix $\Sigma$ is simple, for it is a diagonal matrix.

In [7]:
svd_u, svd_diags, svd_vh = svd(x_train, full_matrices=False, compute_uv=True)
In [8]:
# eliminate small singular values and invert the diagonal element
def invert_diags(diags, eps=1.0e-3):
    newDiags = diags.copy()
    newDiags[np.abs(newDiags) < eps] = 0.0
    newDiags = 1.0 / newDiags # division by zero is expected
    newDiags[np.invert(np.isfinite(newDiags))] = 0.0
    return newDiags
In [9]:
svd_v = svd_vh.T

svd_inv_diags = invert_diags(svd_diags)
# compute the W for least square regression
w_lsr = svd_v @ np.diag(svd_inv_diags) @ svd_u.T @ y_train
/home/alan/.local/lib/python3.6/site-packages/ipykernel_launcher.py:5: RuntimeWarning: divide by zero encountered in true_divide
  """
In [10]:
# compute the abs error term and observe the statistics
eps_lsr = np.abs(y_train - x_train @ w_lsr)
print(describe(eps_lsr))
DescribeResult(nobs=52397, minmax=(1.0227716217059424e-05, 1351.616203506152), mean=11.327067807487191, variance=820.7285439804408, skewness=13.33809224294594, kurtosis=307.5671332620857)

Ridge Regression

In ridge regression, we add a regularization term to the objective function, namely $\lVert Y-XW\rVert^2 + \lambda \lVert W \rVert^2$. Again, using matrix calculus gives $W=(X^TX + \lambda I)^{-1}X^TY$. Using SVD, we can write

$$ \begin{align*} W&=(X^TX + \lambda I)^{-1}X^TY\\ &= (V\Sigma^2V^T + \lambda VV^T)^{-1}V\Sigma U^TY\\ &= [V(\Sigma^2 + \lambda I)V^T]^{-1}V\Sigma U^TY\\ &= V(\Sigma^2 +\lambda I)^{-1}\Sigma U^TY. \end{align*} $$
In [11]:
def get_w_ridge(l):
    sqrDiags = np.square(svd_diags)
    ones = np.ones(sqrDiags.shape, sqrDiags.dtype)
    invDiags = invert_diags(sqrDiags + l * ones)
    return svd_v @ np.diag(invDiags) @ np.diag(svd_diags) @ svd_u.T @ y_train


def get_w_ridge_sklearn(l):
    from sklearn.linear_model import Ridge
    skRidge = Ridge(l, solver='svd', fit_intercept=False)
    skRidge.fit(x_train, y_train)
    return skRidge.coef_
In [12]:
w_ridge_test = get_w_ridge(10.0)
eps_ridge_test = np.abs(y_train - x_train @ w_ridge_test)
print(np.median(eps_ridge_test))
print(describe(eps_ridge_test))
6.614339697407586
DescribeResult(nobs=52397, minmax=(0.0002106446266787465, 1356.2535849452177), mean=11.31797824839462, variance=821.0893876727262, skewness=13.354403324460842, kurtosis=308.66058720907637)

LASSO

The objective function of LASSO is $\lVert Y-XW\rVert^2 + \lambda \lvert W \rvert$. Since the optimization of LASSO is complicated, I am using scikit-learn's LASSO implementation.

In [13]:
def get_w_lasso(l):
    skLasso = Lasso(alpha=l, fit_intercept=False)
    skLasso.fit(x_train, y_train)
    return skLasso.coef_
In [14]:
w_lasso_test = get_w_lasso(3.0)
eps_lasso_test = np.abs(y_train - x_train @ w_lasso_test)
print(describe(eps_lasso_test))
DescribeResult(nobs=52397, minmax=(0.0005799679534010949, 1408.451113722044), mean=9.789800393042356, variance=899.1453471321606, skewness=14.125710697038981, kurtosis=326.138684245896)

RMSE of Models

In [15]:
def compute_rmse(y_true, y_pred):
    dist = np.square(y_pred - y_true)
    #return np.sqrt(np.median(dist))
    return np.sqrt(np.sum(dist)/dist.size)

def compute_test_rmse(y_pred):
    return compute_rmse(y_test, y_pred)
In [16]:
# test RMSE of methods
print(compute_rmse(y_train, x_train @ w_lasso_test))
print(compute_test_rmse(x_test @ w_lsr))
print(compute_test_rmse(x_test @ w_ridge_test))
print(compute_test_rmse(x_test @ w_lasso_test))
31.54311935467014
41.22146051641091
41.17887985975923
42.6591094563153
In [17]:
lambdaVals = [1.0e-3, 1.0, 10.0, 50.0, 100.0, 200.0, 500.0, 1000.0, 2000.0]

lsr_rmse = compute_test_rmse(x_test @ w_lsr)
ridge_rmses=[]
lasso_rmses=[]

for val in lambdaVals:
    w_ridge = get_w_ridge(val)
    w_lasso = get_w_lasso(val)
    ridge_rmse = compute_test_rmse(x_test @ w_ridge)
    lasso_rmse = compute_test_rmse(x_test @ w_lasso)
    ridge_rmses.append(ridge_rmse)
    lasso_rmses.append(lasso_rmse)
/home/alan/.local/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:475: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 19256368.409870643, tolerance: 7689.360900000001
  positive)
In [18]:
fig, ax = plt.subplots(1, 2)
ax[0].set_title('Test RMSE of ridge regression')
ax[0].plot(np.arange(len(ridge_rmses)), ridge_rmses, label='Ridge')
ax[1].set_title('Test RMSE of LASSO')
ax[1].plot(np.arange(len(lasso_rmses)), lasso_rmses, label='LASSO')

for i in range(2):
    ax[i].set_xticks(np.arange(len(lambdaVals)))
    ax[i].set_xticklabels(lambdaVals, rotation=90)
    ax[i].set_xlabel('$\\lambda$')
    ax[i].axhline(lsr_rmse, color='g', label='LSR')
    ax[i].legend()

The RMSE is large due to extreme values. If we compute the mean of absolute error, the value is smaller.

In [24]:
def compute_mae(y_true, y_pred):
    dist = np.abs(y_pred - y_true)
    #return np.sqrt(np.median(dist))
    return np.sum(dist)/dist.size

def compute_test_mae(y_pred):
    return compute_mae(y_test, y_pred)

lsr_mae = compute_test_mae(x_test @ w_lsr)
ridge_maes=[]
lasso_maes=[]

for val in lambdaVals:
    w_ridge = get_w_ridge(val)
    w_lasso = get_w_lasso(val)
    ridge_mae = compute_test_mae(x_test @ w_ridge)
    lasso_mae = compute_test_mae(x_test @ w_lasso)
    ridge_maes.append(ridge_mae)
    lasso_maes.append(lasso_mae)

fig, ax = plt.subplots(1, 2)
ax[0].set_title('Test MAE of ridge regression')
ax[0].plot(np.arange(len(ridge_maes)), ridge_maes, label='Ridge')
ax[1].set_title('Test MAE of LASSO')
ax[1].plot(np.arange(len(lasso_maes)), lasso_maes, label='LASSO')

for i in range(2):
    ax[i].set_xticks(np.arange(len(lambdaVals)))
    ax[i].set_xticklabels(lambdaVals, rotation=90)
    ax[i].set_xlabel('$\\lambda$')
    ax[i].axhline(lsr_mae, color='g', label='LSR')
    ax[i].legend()
/home/alan/.local/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:475: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 19256368.409870643, tolerance: 7689.360900000001
  positive)

Sparsity of LASSO

LASSO tends to generate a sparse $W$ vector, which makes storing weights and predicting easier.
In [30]:
# use the same regularization parameter
w_ridge = get_w_ridge(100.0)
w_lasso = get_w_lasso(10.0)

print('number of zeros in ridge regression:', np.count_nonzero(np.abs(w_ridge) < 1.0e-3))
print('number of zeros in LASSO:', np.count_nonzero(np.abs(w_lasso) < 1.0e-3))
print('non-zero indices of LASSO: ', np.where(np.abs(w_lasso) >= 1.0e-3)[0])
number of zeros in ridge regression: 5
number of zeros in LASSO: 277
non-zero indices of LASSO:  [ 9 20 51]
In [ ]: