A Succinct Solution For Building Decision Stumps

Background

I encountered this code when I was studying uni. The code logic was very intriguing and I personally translated it in to python. It applies some mathematics and may look scary to understand at the first sight. However, you will find out the elegance of the code after you comprehend it.

Why decision stump for boosting algorithms?

When it comes to boosting algorithms, we have to talk about weak learners. Weak learners are algorithms in classification which can achieve slightly better than 50% accuracy. Boosting algorithms basically combine those weak learners and train them by using training data to obtain the appropriate weights of each weak learner, to produce a strong classifier with a high accuracy. A decision stump is a decision tree with only one depth (Please refer to decision tree algorithm). Due to it is a very simple weak learner and unlikely to occur overfitting, it would be a safe option to choose as weak classifier and the performance is usually good. Nonetheless, of course, you can choose other classifiers as weak learners and test them out.

What if the prediction accuracy is less than 50%

In binary case: Easy! Just need to flip the labels from one to another. As you will see in the code.

Matlab

% build a stump from each component and return the best
function [stump] = build_stump(X,y,w)

d = size(X,2);
w = w/sum(w); % normalized the weights (if not already)

stump = cell(d,1);
werr = zeros(d,1);
for i=1:d,
  stump{i} = build_onedim_stump(X(:,i),y,w);
  stump{i}.ind = i;
  werr(i) = stump{i}.werr;
end;

[min_werr,ind] = min(werr);

stump = stump{ind(1)}; % return the best stump


% build a stump from a single input component

function [stump] = build_onedim_stump(x,y,w)

[xsorted,I] = sort(x); % ascending
Ir = I(end:-1:1); % descending

score_left  = cumsum(w(I).*y(I)); % left to right sums
score_right = cumsum(w(Ir).*y(Ir));  % right to left sums

% score the -1 -> 1 boundary between successive points
score = -score_left(1:end-1) + score_right(end-1:-1:1);

% find distinguishable points (possible boundary locations)

Idec = find( xsorted(1:end-1)<xsorted(2:end) );

% locate the boundary or give up

if (length(Idec)>0),
  [maxscore,ind] = max( abs(score(Idec)) ); % maximum weighted agreement
  ind = Idec(ind(1));

  stump.werr = 0.5-0.5*maxscore; % weighted error
  stump.x0   = (xsorted(ind)+xsorted(ind+1))/2; % threshold
  stump.s    = sign(score(ind)); % direction of -1 -> 1 change
else
  stump.werr = 0.5;
  stump.x0   = 0;
  stump.s    = 1;
end;

Core idea

score_left  = cumsum(w(I).*y(I)); % left to right sums
score_right = cumsum(w(Ir).*y(Ir));  % right to left sums

% score the -1 -> 1 boundary between successive points
score = -score_left(1:end-1) + score_right(end-1:-1:1);

Explanation: “score_left” sums up all the weighted labels on the left of all the possible splits and “score_right” sums up all the weighted labels on the right of all the possible splits. And “score” is an array of values which determines how good each spilt is. Using “score” we can find the highest “information gain” for the best split in order to have one type of weighted label (“+” or “-“) value as higher as possible on each side.

Python3

Here is the code with the same logic that I translated into Python

#tree stump
class Stump:
    def __init__(self, weighted_error, split, dimension, sign):
        self.weighted_error = weighted_error
        self.split = split
        self.dimension = dimension
        self.sign = sign

#####################################################################
#Build the Decision Tree Stump
#INPUT:    X_train -- the training features
#          y_train -- the training labels
#          w -- weight of each sample
#OUTPUT:   dtStump[index] -- the best split stump
#####################################################################
def build_DTStump(X_train, y_train):
    dimension_num = X_train.shape[1]
    dtStumps = []
    weighted_error = np.zeros(dimension_num)
    #initialise decision tree stumps for each dimension
    for i in range(dimension_num):
        dtStumps.append(build_onedim_stump(X_train[:,i],y_train))
        dtStumps[i].dimension = i
        weighted_error[i] = dtStumps[i].weighted_error
    index = np.argmin(weighted_error)
    return dtStumps[index]

#####################################################################
#Build one dimensional Decision Tree Stump
#INPUT:    X_train -- the training features
#          y_train -- the training labels
#          w -- weight of each sample
#OUTPUT:   stump -- the best split stump in this dimension
#####################################################################
def build_onedim_stump(x,y):
    stump = Stump(None,None,None,None)

    sortedx = np.sort(x, axis = 0)
    #indices in ascending order
    I = (np.argsort(x, axis = 0)).reshape(len(x))
    #indices in descending order
    Ir = np.flip(I, axis = 0)
    #left to right sums
    score_left = np.cumsum(y[I])
    #right to left sums
    score_right = np.cumsum(y[Ir])

    score = -score_left[:-1] + (np.flip(score_right, axis = 0))[1:]

    indices = np.argwhere(sortedx[:-1] < sortedx[1:])[:,0]

    #locate the boundary
    if(len(indices) > 0):
        maxscore = np.max(abs(score[indices]))
        index = np.argmax(abs(score[indices]))
        index  = indices[index]

        stump.weighted_error = 0.5 - 0.5*maxscore
        #get the split value
        stump.split = (sortedx[index]+sortedx[index+1])/2
        stump.sign = np.sign(score[index])
    else:
        stump.weighted_error = 0.5
        stump.split = 0
        stump.sign = 1
    return stump

Discussion

If using decision stump as weak learner, will the classifier still overfit?
The answer is YES! This blog is not going to talk about the mathematical mechanism behind this. But it is proven that if there are too many iterations on training, overfitting will still occur.

Please leave a comment if you have any questions or insights about this blog. Or if you would like to help construct this website, please Email bowen.liu_owen@qq.com.