Use Correlation to predict Market Index

Published in

Analytics Vidhya

5 min readJan 1, 2021

Market Index consists of a list of major companies stock price. There should be a correlation between their prices. Here I would like to use the Machine Learning Model (LSTM) to predict the market index with the historical data of certain stocks.

Data Collection

I use the python package yfinance to get the daily stock price. I downloaded 3-year figures including “Open”, “Close”, “High”, “Low”, and “Volume”

Data Preprocessing

The target variable of the prediction is the Index ETL close price rate of return of the next day, which is defined as

target(t) = ( Close(t+1) - Close(t) )/Close(t) * 100%

Since the value of the stock volume is too large, I transformed it into:

log-volume = log(volume+1)

p.s. adding +1 to avoid log of zero.

Feature Engineering

Since it is assumed that the price of the index ETL will depends on the historical stock price, I used the figures: “Open”, “Close”, “High”, “Low”, and “Volume” to construct the features.

For each column in [“Open”, “Close”, “High”, “Low”, “Volume”], I computed the 5-day lag, which is the previous-i-day figure (i ranges from 1–5).
Compute the lag-i-return by:

lag-i-return(t) = ( value(t) - value(t-i) )/ value(t-i) * 100%

3. Construct PCA transform for all the lag-i-values in [“Open”, “Close”, “High”, “Low”, “Volume”], totally 5*5 features.

4. Use the first 3 PCA components as the final features because the first 3 components already explained over 80% of the total variance.

Model training

I feed the 3 PCA features and the target variable into the LSTM model, which is a common Recurrent Neural Network for the time series. We downloaded 3-year data, and use about 80% for model training, and 20% unseen data for model testing. PyTorch-Lightning is the ML package we used to code the LSTM model. You can find the code in my Git, and the notebook as well.

Hyperparameter Tuning

I used the package Ray Tune for the hyper-parameter tuning of the pytorch model. The hyper-parameters includes:

sequence length of the time series
no. of hidden states in the LSTM layer
batch size for the model training
dropout rate for the LSTM output
learning rate (lr) for model training
no. of LSTM layers

"seq_len": tune.choice([5, 10]),
"hidden_size": tune.choice([10, 50, 100]),
"batch_size": tune.choice([30,60]),
"dropout": tune.choice([0.1, 0.2]),
"lr": tune.loguniform(1e-4, 1e-1),
"num_layers": tune.choice([2, 3, 4])

Performance evaluation

Although the predicted value of the model is the future rate of return of the market index, and the loss function is MSE, we only focus on the accuracy to evaluate the performance.

It is because in the real trading situation, it only concerns whether it makes profit or lose. If it is predicted to rise, and it actually rises, the trade has made profit. If it is predicted to drop, and it actually dropped, we can still make profit by buying the inverse ETF.

However, for the MSE, the difference of +0.1% and -0.1% is very small, but it is actually a finance loss for trading. If the direction (rise/drop) is predicted correctly, say the predicted value is +0.3%, but the actual rise is just +0.1%, the difference is as the same as the previous case (0.2%), but the trade is still a profit.

Stock chosen for training and testing

As I have limited computing resources, I only chose 10 stocks to predict the market Index. For each of the 10-stock, I joined the historical price with the market index, and then run the model training, testing, and hyper-parameter tuning.

I compared the HK market (HSI), and US market (NASDX).

For HSI, the chosen stocks are:

For US market, the chosen stocks are:

‘FB’, ‘AAPL’, ‘AMZN’, ‘GOOG’, ‘NFLX’, ‘SQ’, ‘MTCH’, ‘AYX’, ‘ROKU’, ‘TTD’

Here is the final result.

Using Mei Tuan (3690) to predict HSI can achieve the highest accuracy of 58%. Tencent (700) and Ping An (2318) can also achieve a high accuracy of ~56%.
For US market, using Facebook (FB), and Netflix (NFLX) can achieve the highest accuracy of ~61% and 58% respectively.
Here is the graph of the testing data of HSI (yellow line), and the predicted value (red line) using Mei Tuan (3960) data.

Trading Strategy

A very simple trading Strategy can be like this:

For each of the 10-chosen stocks, compute the predicted future rate of return of the market index.
Short listed the result for stock with accuracy > 50%.
Compute the expected value of the future rate of return by computing the weighted average of the predicted rate of return using stock-i data over the weight of the accuracy of the stock-i prediction.

E(Return) = Sum of { accuracy * predicted value)/ sum(accuracy)

4. If the expected return > predefined threshold (e.g. 0.03% of transaction fee), we buy the index ETF with take-profit/stop-loss (e.g. 0.5%).

5. If the expected return is negative, we can still do the same with inverse ETF

6. Repeat the same strategy every day with the latest daily data.

7. (Optional) Close all position at the market end everyday.

Profit/Loss Backtesting

For simplicity, I use the ETF (7200.HK) as the target rate of return, and also only backtest with this simplified trading strategy:

If the predicted rate of return (p[i][0]) is postive, we will buy the index ETF at the market open, and then sell it at the market end of the same day, taking the acutal return (y[i][0]) as P/L.

for i in range(n):
    if p[i][0] > 0:
        total = total + 1
        if y[i][0] > 0:
            hit = hit + 1
            delta = min(y[i][0],3)
        else:
            delta = max(y[i][0],-3)

        bal = bal + delta

The code above is to calculate the daily P/L (delta). It assume the order executed is bounded with take-profit limit price and stop-loss limit at 3%. Finally it returns the daily average (bal/total).

The backtest result is:

Average daily profit of 0.2% — 0.3%

Analytics Vidhya

Use Correlation to predict Market Index

Data Collection

Data Preprocessing

Feature Engineering

Model training

Hyperparameter Tuning

Performance evaluation

Stock chosen for training and testing

Trading Strategy

Profit/Loss Backtesting

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Analytics Vidhya

Written by Matthew Leung

No responses yet