The regression tree is actually a decision tree with splits. A regression tree is a form of nonparametric regression that employs multiple step functions. The concept is that by using a sufficient number of sample splits, any functional form can be accurately run. These trees can be particularly useful in regression with categorical variables, where traditional methods such as kernel and series do not apply [1].
[1] indicates that the literature on regression trees has developed a colorful language for describing tools based on the metaphor of a living tree:
i) A split point is node. ii) A subsample is a branch. iii) Increasing the set of nodes is growing a tree. iv) Decreasing the set of nodes is pruning a tree.
In today's study, we focus on obtaining regression trees using Scikit-Learn. Let's start by using a sample data set. We can access this data set from https://nl.mathworks.com/help/stats/select-predictors-for-random-forests.html. Overall, you can download the xlsx version of this data is in https://github.com/halilibrahimgunduz/data.
Firstly we load the car dataset. Secondly, we focus on predicting a car's fuel efficiency by a number of potential dependent variables. They are number of cylinders, engine displacement, horsepower, weight, acceleration, model year, and origin country. Finally, we look at the measurment units of these variables.
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_excel("test.xlsx")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 406 entries, 0 to 405 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MPG 398 non-null float64 1 Cylinders 406 non-null int64 2 Displacement 406 non-null float64 3 Horsepower 400 non-null float64 4 Weight 406 non-null int64 5 Acceleration 406 non-null float64 6 Model_Year 406 non-null int64 7 Origin 406 non-null object dtypes: float64(4), int64(3), object(1) memory usage: 25.5+ KB
This data set contains 8 variables in total of 406 vehicles. While 7 of these variables are quantitative, one of them is qualitative. In addition, we see that some observation of variables are missing (On the one hand, MPG Non-Null Count and Horsepower are 398 and 400. On the other hand, 406 non-null observations are seen except these variables.). We shall remove any rows that have a missing value:
df=df.dropna()
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 392 entries, 0 to 405 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MPG 392 non-null float64 1 Cylinders 392 non-null int64 2 Displacement 392 non-null float64 3 Horsepower 392 non-null float64 4 Weight 392 non-null int64 5 Acceleration 392 non-null float64 6 Model_Year 392 non-null int64 7 Origin 392 non-null object dtypes: float64(4), int64(3), object(1) memory usage: 27.6+ KB
The measurment of variable origin is qualitative. Using OrdinalEncoder to transform categorical values, transform these observations to numeric values:
from sklearn.preprocessing import OrdinalEncoder
enc=OrdinalEncoder()
df[["Origin"]]=enc.fit_transform(df[["Origin"]])
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 392 entries, 0 to 405 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MPG 392 non-null float64 1 Cylinders 392 non-null int64 2 Displacement 392 non-null float64 3 Horsepower 392 non-null float64 4 Weight 392 non-null int64 5 Acceleration 392 non-null float64 6 Model_Year 392 non-null int64 7 Origin 392 non-null float64 dtypes: float64(5), int64(3) memory usage: 27.6 KB
From the test.xlsx file, you can see this variable has 7 different categories. We can access these categories by enc.categories_. It gives result sorted in alphabetic order and not by order of appearance.
enc.categories_
[array([0., 1., 2., 3., 4., 5., 6.])]
We then determine that X stacks the inputs we use in our regression tree and y is the MPG (fuel efficiency of a car):
y=df[["MPG"]]
X=df[["Cylinders","Displacement","Horsepower","Weight","Acceleration","Model_Year","Origin"]]
After saving features and MPG values to X and y, we create the test and train sets by using train_test_split. In my opinion, we should allocate roughly 80% of the data for training and 20% for testing.
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=78)
It is useful to use random_state so that in your scripts the dataset is always organized in the same way.
First, we need to import our Decision Tree Regressor. In doing so, we should enter some beneficial information. When using a family of functions, please always consider the structures inside. From here https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html you can find detailed info. For instance, I prefer criterion="poisson" because of Poisson technique that employs reduction in Poisson deviance to locate splits. I also determine max_depth=3 as you should select a maximum number N of nodes (see [1] for further details). Finally, random_state=78 may be selected again.
from sklearn.tree import DecisionTreeRegressor
Regression_Trees_model=DecisionTreeRegressor(criterion="poisson",max_depth=3,random_state=78)
Regression_Trees_model.fit(X_train, y_train)
predictions=Regression_Trees_model.predict(X_test)
print(predictions)
print("Length of the predictions array: ", len(predictions))
[29.0325 28.9375 29.0325 16.70714286 23.08135593 29.0325 18.95849057 13.7734375 36.1974359 16.70714286 27.25 13.7734375 13.7734375 23.08135593 23.08135593 23.08135593 13.7734375 16.70714286 18.95849057 23.08135593 18.95849057 29.0325 18.95849057 36.1974359 16.70714286 18.95849057 29.0325 29.0325 36.1974359 29.0325 29.0325 28.9375 13.7734375 36.1974359 36.1974359 23.08135593 23.08135593 18.95849057 13.7734375 16.70714286 18.95849057 28.9375 36.1974359 29.0325 36.1974359 18.95849057 29.0325 13.7734375 23.08135593 29.0325 23.08135593 13.7734375 29.0325 13.7734375 18.95849057 28.9375 23.08135593 23.08135593 36.1974359 18.95849057 18.95849057 13.7734375 27.25 28.9375 13.7734375 18.95849057 23.08135593 23.08135593 28.9375 23.08135593 36.1974359 23.08135593 23.08135593 16.70714286 13.7734375 18.95849057 18.95849057 29.0325 18.95849057] Length of the predictions array: 79
Note that the calculated number of observations (79) matches the number of observations reserved for testing (79).
y_test
MPG | |
---|---|
390 | 34.0 |
136 | 31.0 |
348 | 23.5 |
271 | 18.1 |
274 | 27.5 |
... | ... |
9 | 15.0 |
54 | 19.0 |
207 | 18.0 |
326 | 31.3 |
234 | 19.0 |
79 rows × 1 columns
With predictions and y_test, model performance can be calculated using the root mean square error (RMSE):
import math
import sklearn.metrics
mse = sklearn.metrics.mean_squared_error(y_test,predictions)
rmse = math.sqrt(mse)
print("The difference between actual and predicted values", rmse)
The difference between actual and predicted values 4.110725703480176
Let's visualize our regression tree (classification trees tend to provide more insights in their graphical representation):
from sklearn.tree import plot_tree
plt.figure(figsize=(10,10), dpi=200)
plot_tree(Regression_Trees_model, feature_names=X.columns);
As you will notice, the process is completed after 3 nodes, which is the maximum number of nodes we determined after the first regression node. The significance of what we've obtained lies in our ability to estimate numerical values using a regression tree. In subsequent stages, we will move on to bagging and random forests!
[1] Hansen, B. (2022). Econometrics. Princeton University Press.