Applying Differencing on a time series, before or after train and test split?

I am attempting to improve my RNN model by making my dependent variable, a stock price, non-stationary. I am aiming to make the series stationary by removing the trend with a log transformation and then performing moving average differencing to remove noise.

I have a function that initially logs the series, to penalise the larger values and then performing rolling mean differencing on the values.

def moving_avg_differencing(col, n_roll=30, drop=False):
    log_values = np.log(col)
    moving_avg = log_values.rolling(n_roll).mean()
    ma_diff = log_values - moving_avg

My conundrum is, if I perform this differencing before my train-val-test split, I will be informing my validation and test set of mean values that precede their respective values.

If I perform the differencing after my train-val-test split, and process the transformations individually, I will have 30 NaN values before my validation and test set. This seems messy.

Is there a better approach to differencing?

Topic python-3.x time-series python

Category Data Science


According to econometrics literature, the standard approach is to convert your data into log returns as follows: $r'(t) = log(P{t} / P_{t-1})$, where $P(t)$ is the price at timestep $t$. This improves results because it de-trends the input and is relatively stationary compared to raw prices.

There is little difference if this is performed before or after train-test split, because the log return of each row relies only on the previous row. Specifically, if you do it after split you just lose one row of data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.