XY Titel

Pushing Ordinary Least Squares to the limit with Xy()

André Bleier Blog, Data Science

Introduction to Xy()

Simulation is mostly about answering particular research questions. Whenever the word simulation appears somewhere in a discussion, everyone knows that this means additional effort. At STATWORX we are using simulations as a first step to proof concepts we are developing. Sometimes such a simulation is simple, in other cases a simulation is plenty of work. Though, research questions are always pretty unique, which complicates the construction of a simulation framework. With the Xy() function I try to establish such a simulation framework for supervised learning tasks by trying to find the right balance between capability and simplicity.

xy hexagon logo

I have already introduced this function in a recent blog post. Since then more and more functionalities were implemented and the overall interface was improved. If you are interested in the motivation behind the function, I strongly encourage you to read my last blog post. In the following we are going to analyze OLS coefficients under weak signal to noise ratios and get to know all revisions of the function.

Key Functionalities

The purpose of Xy() is to simulate supervised learning data with maximum degrees of freedom. The user can independently choose how features generate the target by choosing interaction depths, user-defined nonlinearity transformations, categorical variables and all that jazz.
In the recent update I wrapped a package around the function which can be conveniently forked on GitHub or installed via library(devtools):

# Install the package 
# load the package 
# start hacking 
my_first_simulation <- Xy::Xy() 

By default, the function simulates regression learning data with 1000 observations, two linear and two nonlinear features as well as five randomly generated variables, which are not used to generate the target. One main goal of the function is to challenge feature selection algorithms, thus there is an option to generate random variables. Let us create a simple example:

# Simulating regression data with four numeric effects,
# one categorical effect and five irrelevant features 

reg_sim = Xy(n = 1000,  
             numvars = c(2, 2), 
             catvars = c(1, 2), 
             noisevars = 5, 
             family = Xy_task(), 
             nlfun = function(x) {x^2}, 
             interactions = 1, 
             sig = c(1,10),  
             cor = c(0.1,0.3), 
             weights = c(5,10), 
             sigma = NULL, 
             stn = 4, 
             noise.coll = FALSE, 
             intercept = TRUE) 

Xy Simulation
  | + task regression 
  | + observations 1000 
  | + interactions 1D 
  | + signal to noise ratio 4 
  | + effects  
    | - linear 2 
    | - nonlinear 2 
    | - categorical 1 
    | - noise 5 
  | + intervals  
    | - correlation [0.1,0.3] 
    | - weights [5,10] 
    | - sd [1,10] 

Target generating process:  
y = 416.07 + 9.81NLIN_1 + 5.23NLIN_2 + 5.53LIN_1 + 5.69LIN_2 + 6.75DUMMY_1__2 + e ~ N(0,487.15) 

The function comes with a print(), plot() and coef() method to further analyze the simulation object. The print method summarizes the settings that were used for the simulation. On the bottom you can see the target generating process, where you can see the coefficients and their respective weights as well as the noise. There are four key arguments in the function. The n argument controls the number of observations. The user can choose a desired amount of linear and nonlinear features with the numvars argument. Further, categorical variables and their respective levels can be determined with catvars. Of course there are many other functionalities, however going through all these would be a little dry, so why not use a toy example?

Pushing Ordinary Least Squares to the limit

Suppose we are interested how OLS can cope with different signal strengths. The signal (strength) is defined as

    \[\newcommand{\Var}{\operatorname{Var}} signal = \frac{\Var[f(X)]}{\Var[\epsilon]}\]

Hence the signal strength controls how many information about the target can be extracted from our features. Higher values mean that there is relatively more variation in our features opposed to the noise. In such cases we can be more confident about the estimation results of our statistical model. In the following simulation we will try to find the sweet spot for the signal strength. In the Xy() function the signal strength can be manipulated with the stn argument.

# Simulating one supervised learning task with signal to noise ratio of 0.01 

sig_sim = Xy(n = 1000,
             numvars = c(2, 0),
             catvars = 0,
             noisevars = 0,
             family = Xy_task(),
             sig = c(1,30),
             cor = 0,
             weights = c(-10,10),
             stn = 0.01,
             intercept = TRUE)

# extract the training data
training <- sig_sim$data

# underlying effects
(Intercept)       LIN_1       LIN_2  
   44.68159    -6.07000     3.50000
coef(lm(y ~ ., data = training))
`(Intercept)`         LIN_1         LIN_2  
   -47.455572     -4.807992      2.936850 

So in this setup we simulate a regression target which is generated by 2 linear features and dominating noise, since the signal to noise ratio is very weak. We can see that the OLS estimators are pretty off, which is not surprising with such a weak ratio.
In the following we are going to analyze the accuracy of the estimator by repeating the above simulation a thousand times for a single signal to noise ratio. We are going to analyze a total of nine different signal to noise ratios.

# create a signal to noise ratio 
ratios <- c(rev(1/seq(1, 5, 1)), seq(2, 5, 1)) 
[1] 0.2000000 0.2500000 0.3333333 0.5000000 1.0000000 2.0000000 3.0000000 4.0000000 
[9] 5.0000000 

You can find the complete code of this simulation on my GitHub. I’ve summarized the results in the following figure.

You can observe the mean squared error of estimated coefficients against the real underlying coefficients of the Xy() simulation. The dotted line marks the break-even point of a one to one signal to noise ratio. The marginal decrease in the MSE is saturates from a signal of 2 onwards. Out of curiosity I ran this simulation with a signal to noise ratio of 1000, to get a sense of the absolute differences in terms of MSE between a signal of 5 and 1000. The overall MSE of the OLS coefficients estimated with such a strong signal lies within the fourth decimal place, whereas it is only in the second decimal place at a signal to noise ratio of 5.

I hope you got a sense of the function and by now identified some applications for yourself. Feel free to drop me some feedback on the functionalities and on the overall interface. I am glad to improve the function and I will try to implement new functionalities on request as soon as possible.

In my next blog post I will introduce an upcoming feature of Xy() which will be especially useful in the field of machine learning.

Über den Autor
André Bleier

André Bleier

André ist im Data Science Team und organisiert intern unsere Afterwork Veranstaltungen. In seiner Freizeit treibt er Sport und programmiert gerne kleine und große R Codes.