Bohr article

1. Introduction

Predicting the stock market is a primary concern for stock traders, individual investors, and portfolio managers. Due to the COVID-19 pandemic, stock markets around the world have plummeted. The stock markets of major countries began to recover after a sharp decline in 2020 compared to 2019. This is because liquidity in the stock market increased as the share of the IT industry in the stock market expanded (1). In 2021, the recovery accelerated sharply, and in 2022, a sharp return to pre-COVID-19 pandemic levels began. However, it is not easy to predict trading patterns and stock prices in the stock market, and it has become more difficult to determine this direction due to complex international relations and economic indicators.

Recently, research has been conducted using actual stock indices and technical indicators to predict changing stock indices influenced by various market fluctuations. In addition, research is being conducted using macroeconomic indicator data, which are important indicators of national management and economic growth or fluctuations (2, 3). As stock price prediction research is being conducted in various stock markets using artificial intelligence algorithms, a method that is different from data combining methods, the possibility of helping users make various investment decisions with high prediction performance is increasing (4). In particular, the performance of artificial intelligence algorithms is improving as new algorithms are continuously added. Machine learning and deep learning are popular in the stock prediction market due to their efficient modeling methods based on big data with less dependence on prior knowledge (5, 6).

Therefore, in this study, to more accurately predict fluctuations in stock price prediction, basic stock price data, technical indicators derived from stock price prediction, and various macroeconomic indicators are generated, and extended input features are applied to machine learning algorithms to test whether stock price prediction is effective. For the forecast period, we want to compare the closing price prediction performance from the closing price of the next day to that of 5 days later.

2. Literature review

2.1. Stock price prediction

There has been a lot of research on stock price prediction in the stock market, and experiments using various methodologies are still ongoing. In particular, with the advent of artificial intelligence, various techniques are being explored for stock price prediction using machine learning methodologies to predict stock market movements (7). Related studies that have recently been developed in this field are as follows, and a brief overview is shown in Table 1. Recognizing the difficulties of forecasting stock prices, tree bagging, random forests, and logit models have been developed under the premise that they are more successful at predicting stock price directions. As a result of performing a performance comparison, it was suggested that tree bagging and random forest are more useful methods for predicting stock prices (8). It created a model to predict the prices of 10 high-yielding stocks in the CSI300 index, and random forests outperformed ANN and logistic regression (9). Sharma and Juneja (10) used LSBoost and showed that a model using technical indicators outperformed support vector regression (SVR). The proposed model achieved better prediction performance than did SVR (10). Bucci (11) developed a model to predict the volatility of the S&P 500 index. Capturing the long-term dependence of volatility through a nonlinear autoregressive model process using LSTM (long short-term memory) and NARX (exogenous input) neural network models was shown to improve forecasting accuracy even during periods of high volatility (11). Guo et al. (12) presented that a hybrid model combining the LSTM model and the LightGBM model showed high performance.

TABLE 1

Table 1. Related study summary.

In addition, studies using data processed to suit the characteristics of time series data were also conducted. First, Yong et al. (13) created a deep neural network (DNN) model to predict the FTSE Strait Time Index t days in the future and proposed a trading system with a 70% profitability (13). Montenegro and Molina (14) presented a sliding window technique based on S&P 500 index data. This has contributed to providing scholars with experience with new approaches and supporting investment decisions in the stock market (14). Wu et al. (15) combined the k-means and AprioriAll algorithms to create a model and showed that it can be used for a long period of time to make profits in real markets (15). Wen et al. (16) presented results showing that models using augmentation methods for time series classification models had low accuracy. However, it was confirmed that the performance slightly improved when the scaling method was applied to the DNN algorithm, and in this study, we developed a model by applying the scaling method to machine learning and deep learning algorithms.

Like the previous studies discussed above, stock prediction research uses a variety of methodologies to test predictability. These studies used ML algorithms and technical indicators. However, the stock price can be affected by macroeconomic factors in addition to information about the company that issued the stock, transaction price, and volume. Kim and Kim (17) analyzed the correlation between macroeconomic variables and stock prices. Granger causality analysis results showed that many variables were correlated. In this study, various technical and economic indicators are used, and stock price prediction is performed using various and latest artificial intelligence algorithms.

2.2. Algorithm

2.2.1. Random forest

It is an algorithm that performs machine learning based on bagging (bootstrap aggregating), an ensemble learning method. RF draws a final conclusion by synthesizing the results of multiple decision trees obtained from subdatasets in a unique original training set by bagging (18, 19). The random forest algorithm further reduces prediction error by decreasing the correlation between decision trees. This robustness is not significantly affected by noise or outliers, and as the number of trees increases, the overfitting problem of decision trees can be overcome. In addition, since several predictors are randomly combined during the random node optimization process, stable results are derived even when there are many predictors. It has the advantage of being able to analyze the influence of various predictors without being influenced by the influence of variables (20).

2.2.2. LightGBM

Gradient boosting decision tree model developed by Microsoft in 2017. Compared with other existing machine learning algorithms, this approach reduces computational speed and memory consumption and optimizes parallel learning (21). While conventional tree-based algorithms use a level-by-level partitioning method, LightGBM uses a leaf-by-leaf tree partitioning method.

2.2.3. Deep neural network

A DNN uses more layers than the existing artificial neural network and adds features to greatly increase the expressive power of specialized data; the greater the number of hidden layers is, the greater the feature extraction function can be (22). The DNN is designed as a back-propagation algorithm (23). To improve the learning performance of the built model, optimal results can be derived by appropriately applying various techniques that can solve the problem of the model.

3. Methodology

The analysis environment of this study is shown in Figure 1. The analysis data was stored using network-attached storage. The server was operated using a laptop and tablet PC. Data preprocessing includes null value replacement according to the standard and generation of technical indicators. Feature extraction is performed for final use and data is saved to a file. The training ratio is 80% and the test ratio is 20%. After that, the data was converted to a dataset for min-max scaling and sliding windows, and then principal component analysis was performed conditionally. After creating a model using various algorithms, it is synthesized through scoring.

FIGURE 1

Figure 1. Analysis environment.

3.1. Data source and descriptive analysis

The selection of major indices used the indices of major countries used in previous studies, and included exchange rate information and digital currency between major countries. Futures included major raw materials and food. For analysis, we extract 12 years of stock price information data from finance.yahoo.com. Two of those years are used to generate technical variables and are not used for training or testing. The types of variables used after the null processing process are stock indices such as KOSPI, KOSDAQ, FTSE, S&P500, DowJones, NASDAQ, Nikkei, and HSI. Exchange rate information is Korea, UK, EUR, Japan, China, and Bitcoin. Bond is 10 years, 2 years. Future is Gold, Copper, Crude Oil, Natural Gas Corn, Rough Rice, Soybean Meal, and Coffee. The technical variables are used by extracting data from Yahoo Financial. That is simple moving average (SMA), weighted moving average (WMA), Relative Strength Index (RSI), Stochastic K% D%, Moving Average Convergence Divergence (MACD).

3.2. Model optimization

The model training and optimization process was performed as follows. First, by dividing the training set, various parameters were tested using the 5-fold cross-validation technique. This is a method of calculating the performance of a test set by selecting the best parameter according to the criteria and learning the best parameter by combining the five divided training sets. This method was used for model training in the random forest and LightGBM models. Second, early stopping, a method mainly used to prevent overfitting of the DNN model, is a method for stopping learning if there is no improvement in the performance of the validation set through loss calculation while learning the training set. In this study, the training set and the validation set were classified as 8:2, and when the loss reduction in the validation set was less than 0.00001 while learning the training set, early stopping was used to stop the learning.

4. Analysis results and discussion

4.1. Results

The following steps are taken for prediction: (1) Extract 12 years of data. (2) The first 2 years are used only for technical indicator generation. (3) 8 years of data out of 10 years are used for training/validation. (4) Provide 2 years of test results.

For the test set results, see Tables 2–6. The data collection period was 12 years before the reference date (FTAS 2022/11/03, GSPC 2022/10/31, HSI 2022/11/04, KS11 2022/10/28, and N225 2022/11/02), and 10 years were used for training/validation/testing. That is, the remaining two years are used for feature calculation of the moving average. The training/validation set and test set division should measure time intervals that are not included in the training/validation set; 8 years should be used for training/validation, and 2 years should be used for testing. That is, data from 2020 to 2022 were used (KOSPI test set: 2020/10/29 to 2022/10/28; training set: 2012/10/29 to 2020/10/28). Therefore, for training/validation, various parameters were used to improve the performance because the pattern of the stock market changes greatly due to a sharp rise in 2020 and a rise in late 2021 and 2022 after a decline in 2019.

TABLE 2

Table 2. Prediction performance of the KOSPI.

TABLE 3

Table 3. Prediction performance of the S&P 500 Index.

TABLE 4

Table 4. Prediction performance of the FTSE.

TABLE 5

Table 5. Prediction performance of the Nikkei Index.

TABLE 6

Table 6. Prediction performance of the Hang Seng Index.

4.2. Discussion

See summary (Table 7). We applied several models to predict the composite stock indices of major countries and selected the optimal model. The optimal model learns according to the optimal parameter selection method presented above. For each type of feature, one of nine features is selected as the DNN, RF, or LGBM for each type of basic, technological, overall index, or algorithm. The optimal model has 15 technical indicators and 5 basic/total indicators for each type of feature; in the algorithm, 16 LGBMs, 5 DNNs, and 4 RFs are used. Therefore, the features that should be applied first for stock price prediction include basic indicators and technical indicators (SMA and WMA of 5, 10, 20, 60, and 120 days, RSI, Stochastic K%, Stochastic D%, MACD, signal line, and histogram, respectively) combined. The reason that the performance of the technical indicator data, including basic indicators, is greater than that of the basic indicators when using the fewest indicators and that of the overall indicators (including basic, technical, and macroeconomic indicators) when using the most data is that indices with a high correlation (over 0.995) of T+1 to 5, such as the KOSPI and S&P, are included in the foundation; thus, it was concluded that the inclusion of technical indicators would improve the optimal performance. In addition, an algorithm that should be applied prior to RF or DNN for stock price prediction is proposed as a LightGBM model. The structure is considered to minimize overfitting and maximize performance due to the characteristics of the algorithm.

TABLE 7

Table 7. The best model for predicting each index.

In previous studies, exchange rates, raw materials, and bonds are mentioned as indicators that affect stock price prediction, and some studies have shown them to be important factors, but the corresponding indicator was effective in Nikkei and Hang Seng. In other words, the results of this study provide important data for reference in the use of various algorithms and indicators depending on the indicator.

5. Conclusion

The COVID-19 pandemic has significantly shaken stock markets around the world. The need for more accurate stock price predictions has grown as stock prices undergo major changes. Various studies have mentioned that the addition of technical and macroeconomic indicators to input features for improving predictive performance leads to substantial performance improvement. This study combines the effective activities presented in these studies to present a model suitable for predicting the global stock market during the COVID-19 pandemic and after. To this end, a model was developed to predict the closing price of the composite stock indices of five major countries 1, 2, 3, 4, and 5 days after each, and the results were compared. As results, the optimal model was the most selected model with input features combining basic and technical index features, and the model using the LGBM algorithm was selected the most.

The implications of this study are as follows. First, to predict stock prices, stock prices must be derived through model optimization and best model selection according to the nationality and forecasting time of the indicator, but there are input indicators that can be used preferentially. In other words, the model using only basic and technical indicators has greater performance or less performance difference than the model added by synthesizing macroeconomic indicators. Therefore, in short-term forecasting, applying a model that uses only basic indicators or technical indicators, including basic indicators, is more efficient for model creation and operation (15 out of 25 basic/technical indicators were selected as the best model). Second, the algorithm that can be used preferentially is LightGBM (16 out of 25 models using LGBM were selected as the best model). A variant of the study using multi-sliding windows is shown (24). For model optimization, it is necessary to identify and compare the performances of various algorithms, which require high amounts of system resources and take a long time. The optimization method and results presented in this study have implications for implementing an efficient system for stock price prediction.

6. Limitations

In this study, various economic indicators and algorithms were used, and the best model was selected and compared using optimization techniques. However, macroeconomic indicators such as interest rates and price indices are also suggested to be important factors in the stock market. This study has the following limitations. First, based on the data provided on the finance.yahoo.com website, interest rates, price indices, real estate price indices, etc., are not considered, and it is difficult to generalize a model using macroeconomic input features. Second, it is difficult to generalize because it is impossible to apply various algorithms by concentrating on a few representative algorithms and presenting comparison results. Third, it was not possible to present a direction for mid- to long-term stock price prediction by generating a model to predict it based on short-term prediction. Finally, it is difficult to generalize because it is not possible to compare and analyze various prediction techniques, such as sliding window and multistep prediction methods, used for time series prediction. It is expected that future research will be able to derive more generalized information if such research is conducted.

Data availability statements

Data derived from public domain resources.

Conflict of Interest

The author has no relevant financial or non- financial interests to disclose. No funds, grants, or other support were received during the preparation of this manuscript.

References

1. Koreacapitalmarketinstitute.Comparison of Global Stock Market Performance After the Spread of COVID-19. (2022). Available online at: https://www.kcmi.re.kr/publications/pub_detail_view?syear=2020&zcd=002001016&zno=1551&cno=5537 (accessed December 11, 2022).

Stock price prediction through an artificial intelligence model using basic, technical, and macroeconomic indicators