Can web-searching index help to predict renminbi exchange rate ?

The stability of exchange rates plays a decisive role in a country’s economic development and internal and external equilibrium. Therefore, the prediction of short-term exchange rate is of vital importance to maintain a country’s economic stability and financial security. Traditional time series forecasting models are only based on historical data, which cannot reflect other important factors, like investors’ currency exchange expectations and their emotions. This study aims to build a web-searching index to predict the short-term exchange rate, which is based on web search data, the natural language processing and information retrieval sharing platform (NLPIR) which is a Chinese segmentation technique and the TextRank keywords extraction system. In this way, the study establishes a Conditional Autoregressive Model (CAR) model integrated with our web-searching index to include the investor’s expectations and emotions in the prediction, and therefore enhance prediction accuracy. The outcome shows that the accuracy of the CAR model based on search data can be significantly higher than other models. Besides, compared with traditional exchange rate prediction models, the integrated CAR model has a better fitting effect and a lower prediction error.


INTRODUCTION
Since the exchange rate reform in 21 July, 2005, China's exchange rate regime has become a managed floating exchange rate system, which tracks a package of currencies.From July, 2005 to December, 2014, China Yuan or Renminbi (CNY) nominal exchange rate cumulatively appreciated by 34%, and CNY real exchange rate cumulatively appreciated to 46.21%, which illustrates that the regime of CNY exchange rate tends to be more flexible and market-oriented than before (Data source: Wind Database).The continuous appreciation and greater volatility of CNY not only lead to a relative devaluation of foreign exchange reserves, but also result in increased exchange rate risk of financial and trading activities that are not CNY denominated.Therefore, it is of vital importance to predict short-term exchange rate to manage exchange risks.Besides, the CNY exchange rate formation mechanism has become more market-oriented, and therefore the difficulty of exchange rate forecasting increases significantly.It becomes increasingly relevant to explore a precise and effective approach to predict the short-term exchange rate.
Previous forecasting methods generally relied on time series models to forecast short-term exchange rates, like autoregressive (AR), autoregressive integrated moving average models (ARIMA) and generalized autoregressive conditional heteroskedasticity (GARCH) model.More recent approaches employ neural networks and chaos theory.However, these models are under an assumption that market transactions have contained all the relevant information, which makes it possible to do accurate prediction which only relies on historical data.But the reality is that price itself can not contain all the information, especially in developing countries.The investors' expectations and their emotions cannot be reflected promptly in historical price data.Therefore, if we want to get an accurate prediction by time series models, we have to take the expectations and emotions factors into consideration.
With the advent of the internet era, it becomes possible to obtain people's expectations and their anticipated trading behaviors by analyzing data from Internet.When the two trading parties' psychological expectations change, it will be reflected in their network behaviors.The expectations can be presented in multiple way, such as the relevant comments written by economists, the news and financial columns related to the fluctuations of exchange rates, "weibo" messages and forums.We can catch the changes of expectations and emotions through the web-searching index, and make some pre-judgments of the fluctuations of exchange rates, and therefore make accurate prediction.In 2015, the number of internet users exceeded 668 million people (Data source: China Internet Network Information Center statistics).The behavior of such huge amount of Internet users has constructed an unprecedented scale of big data, including their hobbies, expectations, concerns and doubts.This data can help us to do prediction for exchange rate accurately.
Based on big data context therefore, this paper constructs a web-searching index to predict exchange rate.Firstly, we use web-searching data, the NLPIR Chinese segmentation technique and the TextRank keywords extraction system to obtain the keywords which are highly related to Chinese Yuan (CNY) exchange rate.Then, the study built a keywords web-searching index system and developed a qualitative measure of internet users' expectations and judgments from big data.Afterwards, the study constructs a comprehensive websearching index based on the keywords web-searching index.The study employ Conditional Autoregressive Model (CAR) model integrated with web-searching index to forecast the CNY short-term exchange rate.The result illustrates that the CAR model integrated with the index has lower prediction error than traditional exchange rate forecasting models.

Literature related to exchange rate prediction
The fluctuations of exchange rate play a decisive role in a country's economic development and internal and external equilibrium.Therefore, researchers pay a lot of attention on the prediction of exchange rate accurately.This study classifies this research into three strands.The first strand is the exchange rate forecasting method based on macro-economy theory.But this method cannot explain exchange rate volatility, forex-trading volume zoomed out effect and the separate of market exchange rate from equilibrium exchange rates, which are all common phenomenon in markets.The second strand is the exchange rate forecasting method based on market micro-structure.Responding to the existing problems in the first method, scholars established financial market microstructure theory.The theory points out that information are private and it is the key to price formation (Lyons, 2001).Forex traders who possess different information tend to play games while trading, and therefore leads to the fluctuations of exchange rates.The third strand is the time series forecasting method based on historical data.After years of development, this method has yielded fruitful results.Box et al. (1994), put forward the famous forecasting model-autoregressive moving average (ARMA) model.Based on previous research, Engle et al. (1982) improved the model and created autoregressive conditional heteroskedasticity (ARCH) model.Based on Engle's study, Bollerslev (1986) considered more likelihood estimation and testing and established generalized autoregressive conditional heteroskedasticity model.These forecasting models have profound impacts on exchange rate predicting research.
With the development of exchange rate prediction models, more and more scholars use time series models, like AR, ARIMA, ARCH, and GARCH, to predict shortterm CNY exchange rates.Compared with the previous predicted models, time series models can make better use of historical data, and also provide very accurate short-term forecasts.However, one assumption of time series method is that market transactions contain all the relevant information, and traders make decisions based on this information.In fact, price itself dose not contain all the information, especially in developing countries.The expectations of traders cannot be reflected in the price promptly.Therefore, traditional time series models, which are only based on historical data, cannot reflect factors affecting exchange rates such as traders' expectations and emotions.The models' forecasting accuracy is not high enough.

Literature related to web-searching system
With the development of internet, the information, which is not covered by market transactions, can reflect in the web big data.Some web-searching platforms launched web-searching index query based on keyword, and provide an approach for scholars to use web big data.
The keywords web-searching index, which is based on searching platforms, has many advantages.For example, searching platforms, has many advantages.For example, it contains a large volume of information, and it covers a wide field and multiple samples.Forecasting with websearching index therefore, has rich data advantages and tends to be more accurate.The reason is that people use search engine to inquire information to satisfy their requirements (Wilson, 2000), and this behavior is the response to the environment changes (Moe and Fader, 2004).Internet users' keywords searches on search engine are therefore; inevitably have close and natural links to external conditions and the environment changes (Sun and Lv, 2011).That is why lots of scholars use websearching data to research and forecast economic issues.
Forecasting based on web-searching was first used on health issues.Scientists discovered that the search index and the incidence of influenza have a long-term stable relationship.They used the search index to predict the incidence of influenza and influenza mortality in advance (Ginsberg el al., 2009).From the index study based on Google keywords, they found that the partial keywords web-searching index have a long-term positive correlation relationship.In other studies, scientists also built a prediction model and predicted the flu outbreak trend with two weeks notice.Scholars gradually discovered the results of binding the traditional forecasting model and keywords index were better than the original traditional forecasting model's results.
In the macro-economic studies, researchers found that monitoring and early-warning based on web-searching data shows better prediction result, such as the monitoring of cost performance index (CPI), gross domestic product (GDP), unemployment rate, private consumption index and consumer confidence index.Many research studies shows that the precision of a predicted model integrated with web-searching index is significantly higher than traditional models (Francesco el al., 2012;Askitas el al., 2009;Vosen and Schmidt, 2011;Nicolás el al., 2009;Kholodilin, Konstantin, 2010).Websearching index is also commonly used in the prediction of micro-market.Preis et al. (2013), found web-searching index has relevance to the quantified trading behaviors in stock markets.Besides, taking advantage of the websearching index to grasp investors' preference and the characteristics of their behaviors, this study can effectively predict stock market performance.In the mean time, making full use of web-searching index to conduct and build the investing strategy in stock markets will reduce stock markets risks and gaining stable profits (Kristoufek, 2013).Via internet search index, we can not only predict economic issues but also forecast social problems, that is because the search index show us a behavioral data sets combined extensive which can help us to get a better understanding of human behavior, emotion and expectation (Preis et al., 2013).
Internet search engine is the most popular tool to catch information.It is also the channel and carrier for users to obtain news.It has evolve into a central source of information which has been closely linked to people's day-to-day decisions, and of course the internet search data will reflects peoples' behavior and patterns which is hidden behind their searching action (Curme et al., 2014).This study integrate web-searching index into time series models to forecast the CNY short -term exchange rate.The study is expected to acquire better results than the traditional models.

Prediction mechanism
Traditional time series model only rely on historical data to forecast.For instance, traditional ARIMA model and even the improved ARIMA models is the autoregressive moving average of univariate variables, under the assumption that market transactions have contained all the relevant information, including the expectations of the traders.Hence, someone believes that traditional time series models could make accurate prediction based on historical data.However, the reality is that price itself does not contain all the information especially in developing countries.The expectation of buyers and sellers can not reflect promptly in market prices.Under this situation, traditional time series models cannot reach a high precision, because it does not contains all the information such as traders' emotions and their expectations of exchange rates.
Combining traditional time series models with websearching index would solve the problem caused by insufficient information.Web-based big data contains a lot of news including traders' behaviors and psychological information, which have been reflected in the current exchange rates but not contained in the prediction of traditional time series models.Therefore, the study compiles the network index as the proxy of web information, which could characterize the factors that influence exchange rates such as psychological expectations and traders' emotions.This study constructed a comprehensive web-searching index and integrated it into CAR model to make a short-time prediction of exchange rates.As is shown in Figure 1, the right side of the figure illustrates the logic of traditional time series model, and the left side presents the integration of web-searching data.The study also extracts the key words by NLPIR system and TextRank algorithm, and constructs a comprehensive websearching index to give short-term prediction of exchange rates.
Figure 2 presents the influential process of websearching index on exchange rates.Literature shows that the fluctuations of exchange rates are mainly influenced by three factors: macro-economic factors like monetary policy and inflation, political factors like trade disputes and political games, and micro-economic factors like hot money flows.The fluctuations of exchange rates caused  by these factors would change the two trading parties' psychological expectations and reflect in their network behaviors.The expectations can be shown in the relevant comments written by economists, the news and finance columns related to the fluctuations of exchange rates, weibo messages and forums.When we catch the changes of expectations and emotions through the websearching index, we are able to predict the next step of traders and make pre-judgment of the fluctuations of exchange rates, and therefore make accurate prediction.For this reason, this study hope to integrate websearching index that contains large amount of information with the traditional time series model, will, unlike the old model that used to only use a single argument (historical prices), become more diverse in its accept of arguments.This new type of multivariate model, in other words, the traditional time series model integrated with websearching index, would be multivariate time series model.There are many methods to build a multivariate time series models, example, continuous-time autoregressivemoving-average model (CARMA) and recursiveidentification of multivariable CAR model.But considering modeling CARMA is very complex and ineffective, in this experiment, the study chose to create the CAR model (Bollerslev 1986).The study model is like this: y t =a 1 y t-1 +a 2 y t-2 +•••+a n y t-n +b 0 Index t +b 1 Index t-1 +b 2 Index t-2 +•••+b n Index t-n +

The construction of comprehensive web-searching index
There are five steps to construct the comprehensive websearching index: the construction of corpus, which contains huge amount of text information related with exchange rates; segmentation of Chinese words;  keywords screening based on TextRank algorithm; obtaining Keywords web-searching index; and the construction of comprehensive web-searching index.

The construction of corpus
This study research on the CNY to US dollars exchange rates.Large amount of text materials from financial reports, financial columns, forums and weibo of financial commentators were collected.The time span is from 2011 to 2014.Tens of thousands words in the corpus after filtering and duplicate removal were gotten.

Segmentation of Chinese words
Then, the study did semantic segmentation of the corpus by employing NLPIR Chinese words segmented system.Based on Chinese semantics and tones, the study split large amounts of text information into the separate words using computer programs, which is named Chinese segmentation.For instance, if we do semantic segmentation for a sentence "The exchange rate of the CNY to US dollars has appreciated", we would get these words: "CNY", "US dollars" and "appreciated".
Chinese segmentation system NLPIR (former ICTCLAS) was designed by Zhang Huaping, a doctor in the department of computer science and technology at Chinese Academy of Sciences, in 2000.After 14 years developments and improvements, this system is employed by more than 300 thousands of people.It was also awarded the first prize of Qian Weichang Chinese Information Processing Science and Technology Award in 2010 and the first prize of Sighan segmentation competition in 2003.The 2014 edition NLPIR system could recognize English word prototypes automatically, tag the part of speech, and name the entities and keywords.Besides, based on Chinese lexical analysis, the system can accomplish complete and semantic analyses of documents, including names, places, institution names, article authors, published medias, keywords and abstracts.Therefore, this study made use of open-source NLPIC Chinese segmentation system to do word splitting work.
1 Keywords screening based on TextRank algorithm Mihalcea (2004), proposed TextRank 2 model in 2004.They firstly applied PageRank algorithm to the keywords screening work, and named the model TextRank.Using TextRank algorithm to screen, keywords has many advantages.Firstly, TextRank algorithm does not need numerous corpus to train, which saves time and costs.Secondly, TextRank algorithm chooses un-supervised way of learning, making it more applicable to any contents, subjects and any length of text.Thirdly, TextRank has fast Convergence rate.If you use TextRank algorithm to calculate matrix, it has faster speed and better results (Liang, 2010).
1 The introduction of the NLPIC Chinese segmentation system is from "NLPIR natural language processing and information retrieval platform", http:// ictclas.nlpir.org/ 2 The idea of TextRank model is from PageRank.It thinks that any text is a complexity network combined by groups of words.The central node of the network is the keywords we are looking for.Those nodes adjacent to the central node, although they have very low frequencies of appearance, they may still be the candidates of keyword.
By segmentation, the study decomposes the corpus and got preliminary vocabularies.Then we screen the keywords, which are closely related to exchange rate fluctuations, from a large number of vocabularies by TextRank algorithm.Finally, the study got 281 keywords, including "CNY exchange rates", "investments", "capital flows", "trading" and "US dollars".After removing the keywords without web-searching index, the study got 88 effective keywords.These keywords not only contain the words which are closely related with CNY to US dollars exchange rates such as "the CNY", "US dollars" and "exchange rates", but also contain some unrelated words like "markets", "programs", "space", "stocks", "data" and "processing".But these seemingly unrelated vocabularies are screened by statistical analysis, and may have some hidden relationship with exchange rates.These words were kept in this research.

Obtaining the web-searching index of keywords
The study obtained the web-searching index of keywords through search engine.In the past researches, scholars usually use Google search engine.Chinese scholars use Baidu search engine or Micro index.However, these search engines have their own disadvantages.Google search engine is sometimes blocked in China, and few people use it.Hence, there are significant differences between the Google search index with economic reality.Baidu search engine is not open-source, and the study cannot get the precise daily web-searching index for model.Sina weibo provides micro-index responding to weibo's contents and keywords, but it can only reflect the weibo users' concern degree about exchange rate rather than the whole internet concern degree.In the mean time, there are few weibo users who analyze exchange rate fluctuations professionally.Therefore, the study selected Haosou search engine to obtain the websearching index of each keywords.Haosou search engine provides open-source search index and it is in a leading position of the searching field.In order to acquire data as much as possible, the study downloaded the web-searching index for 88 keywords from 1November, 2013, and 66 thousands data in total were obtained.

The construction of comprehensive web-searching index
The study calculated the Pearson correlated coefficient of all the keywords web-searching index and CNY exchange rate, and pick up the highly correlated keywords.The study also gave sufficient consideration on the setting of threshold.If the threshold is too high, the study would lose the keywords that are very important to exchange rates.If the threshold is very low, the study would get multiple noisy keywords.Ginsberg is the earliest researcher on web-searching index.In his paper in 2009 he chose the threshold as follows: Firstly, he added the key words into the comprehensive web-searching index one by one, and during that time he observed the correlation coefficient between comprehensive websearching index and target benchmark.When the correlation coefficient cannot enhance after adding to a new key word, he believe that at that number of keywords the coefficient comes to its threshold.Scholars usually apply this method to determine the threshold.This study also follows this rule, and when the threshold rise from 0.1 to 0.4, it was found that the correlation between the search index sequence and the target sequence appeared as substantial growth if it reached 0.84.When the threshold was set for 0.5 the correlation coefficient was 0.86, the correlation coefficient declined to 0.85 when threshold was setting for 0.6, which means 0.5 is the best threshold value.But due to the correlation of large number of keywords concentrated in the range 0.4 to 0.6, and if the study insist to increase the threshold from 0.4 to 0.5, it will lost a large number of key words so as to reduce the prediction accuracy.
So in this study, the threshold of Pearson correlated coefficient is 0.4, and therefore 40 keywords in total were gotten.Part of them is listed in Table 1.In the experiment, the web-searching index of keywords presents multiple outliers and violent volatility.Therefore, the study combine multiple keywords searching indexes into a robust one by synthesis method, in order to exhibit a clear trend of web big data and include the traders' behavior into the prediction model.The weighted average of keywords web was calculated.
In order to calculate the comprehensive web-searching index, weight of each keyword first should be figured out.
In the related research, scholars usually apply weighted average or arithmetic average method to construct the index.They choose the method mainly based on the performance on accuracy in the prediction model.In this research, it was found that weighted average method weighted by the correlation coefficient of each network search keywords can produce better predictions.Therefore, this study choose this method, and take the correlation coefficient of each network search keywords as its weight.The formula is as follows: Where a n means the weight of each keywords in the web-searching index.

Data and stationary test
In this study, daily CNY to US dollars central parity rate was selected as the proxy of exchange rate.Data in nontrading days as the arithmetic average price of the The traditional time series model requires the input data to be stable.Therefore, the unit root test for CNY exchange rate and the comprehensive web-searching index were used.ADF test is employed and the result is shown in Table 2.Both CNY exchange rate and comprehensive web-searching index are non-stationary.The ADF test of logarithm of two variables reject the null hypothesis at 1% confidence level, which illustrates that the two variables are first-order single integer series.By the ADF test, it was found out find that the logarithm of variables yt and indext are stable.Hence, log_yt and log_indext to construct models instead of the variables yt and indext will be used.
In order to check if there is a long-term stable relationship between the variables, the study also conducts the co-integration test.There are two kinds of methods for co-integration test.One method is based on the co-integration test of regression coefficients, and the other one is based on the co-integration test of regression residuals.The study applied the Engle-Granger two-step method to test if there is a long-term co-integration relationship between the CNY exchange rate and web-searching index.
After the regression, a unit root test for the regression residuals was conducted.If the independent and dependent variables have a co-integration relationship, the residuals series should be stable and vice versa.The results of co-integration test illustrates that the CNY exchange rate has a long-term co-integration relationship with the web-searching index, which shows that the websearching index has long-term influence on the fluctuations of exchange rates.

CAR model integrated with comprehensive websearching index
As the variables y t and indext are not stable, the study made use of the logarithm to construct models, labeled as log_y t and log_index t .
There are two approaches to construct the CAR model: one is recursive least square estimation, the other one is the least square equation of constructing a combined regression model.In this study, the second method to construct the model was used.Firstly, by calculating the Pearson correlated coefficient between web-searching index in different leg-order and exchange rate series (Table 3a and b), the study found out find that Index series is in three orders leading the exchange rate series.Therefore, the study set the leading-order of websearching index log_index t as third order.Meanwhile, from the graph of autocorrelation and partial autocorrelation, thye study found out that the partial correlation graph of explanatory variable log_y t shows a b  that it has first-order autocorrelation.Therefore, the study included log_y t-1 in the model (Figure 3 left).The autocorrelation graph of residuals was also analyzed as well, and the result shows that the residual has eightorder autocorrelation (Figure 3b).The CAR model is as follows: log_y t = 0.019491-0.0000465*log_indext-3 + 0.989323* log_ y t-1 +u t Among them, log_y t is the logarithm of daily CNY to US dollars central parity rate, log_index t-3 is the third-order lag web-searching index series, u t is the auto-regressive item.The model with EVIEWS was estimated, and the result is shown in Table 3.
In the regression result, it can see that R 2 is 0.995, which means our model is fitted well to the fluctuations of exchange rates.DW statistics is equal to 2, denoting that there is no serial correlation.The coefficient of 3 lagged web-searching index is significantly negative related with CNY exchange rate, illustrating that higher web-searching index is related with exchange rate appreciation.Figure 4 exhibits this model's fitting effect and residual errors.
The study further observe the regressive residual plot and its autocorrelation coefficient, partial autocorrelation coefficient and the Q-stat statistics, which show that each order of the autocorrelation coefficient and partial autocorrelation coefficient is not significantly different from zero.In other words, there is no sequence autocorrelation in residual series.

Comparison of different models under normal circumstances
As is mentioned previously, the study made use of the data from 11 January, 2013 to 13 January, 2015 as training data, and also data from 14 January, 2015 to 21 January, 2015 were also used as testing data.The study compared the results of CAR model integrated with websearching index with other models, including AR model, ARIMA model and GARCH model.The comparison of results is listed in Table 4.
In order to measure the prediction precisions accurately and objectively, the study apply the error indicators,

The comparison of different models under sudden fluctuation of exchange rate
In order to verify the accuracy and robustness of the CAR mode in the prediction of exchange rate fluctuations, another seven days prediction from 8 August, 2015 to 15 August, 2015 was done.It is because during this period the RMB exchange rate against the dollar appeared, and sustained depreciation for several days.Therefore, the  7).The CAR model can make a very accurate prediction for the exchange rate during the sudden fluctuations, and the major error indicators, which are commonly used to measure precision, are much lower than other models.Therefore, the study verify the accuracy and robustness of the CAR model under both the normal circumstances and the sudden fluctuation situations.

CONCLUSIONS
The traditional time series model, which is commonly used in the prediction, ignore a fact that historical prices cannot reflect all the influential factors of change rates, and therefore, their prediction is not accurate.In order to reduce the prediction errors caused by insufficient information, the study construct a comprehensive web-searching index to forecast the short-term fluctuations of CNY to US dollars exchange rates.
With the CAR model integrated with websearching index, the study include large amount of traders' behaviors and expectations in the prediction.
It is expected to further reduce errors in the short-term exchange rate prediction.
In the empirical research, this study employ the CAR integrated with web-searching index to predict the short-term CNY exchange rate, and compare the prediction result with traditional time series models, including AR, ARIMA and GARCH model.The MAPE) of the new CAR model is 1.32%, while the MAPE for ARIMA, AR, GARCH model is 3.11, 1.92, 2.19% respectively, which indicates that the prediction accuracy of the new CAR model is significantly higher than other models.It also exhibits advantage in other prediction indicators.The study also mase some prediction for sudden fluctuation situation for exchange rates (from 8 August, 2015to 15 August, 2015), and conclude that the accuracy of the CAR model integrated with web-searching index is still higher than other models in sudden fluctuation situations.It means the prediction results of the CAR are robust.
Although the CAR model integrated with websearching index can overcome the problem of insufficient information, it still has drawbacks.For example, there are a few explanatory variables in the model, which makes it difficult to explain the mechanism of prediction.Therefore, each prediction model has its own drawback.
The study also tries to composite these models together into an integrated one, and further enhances the forecasting precision by synthesis   different prediction models.Specifically, this study would explore the new methods to combine models, such as arithmetic averages, weighted averages, and principal component analysis.The study, also compare their difference so as to obtain more accurate combination-forecasting model.

Figure 2 .
Figure 2. Linkage between web-search index and exchange rate fluctuations.

Figure 3 .
Figure 3. Graph of autocorrelation and partial autocorrelation.
Note: The absolute error = | predicted values -Actual observations |; relative error = | predicted values -Actual observations | / actual observed value * 100%; MAPE (mean absolute percentage error) = sum [| predictive value -the actual value | * 100 / actual value] / sample size; MAE (mean absolute error) = sum [| predictive value -actual value |] / sample size; Thiel (not equal coefficient) is between (0,1) of the a coefficients, if the coefficient is equal to 1 indicates that the model predicts poor; when the coefficient is 0, that the model predicted and actual values of a complete match a good predictive power.

Figure 5 .
Figure 5. Search index and prediction value under sudden fluctuation of exchange rate.

Table 1 .
The correlation coefficient of part network search keywords.

Table 2 .
ADF test of variable.

Table 3 .
Modeling analysis of CAR model.

Table 4 .
Different time series model.
integrated CAR model illustrates significant advantage over other models.As to the mean absolute percentage error (MAPE) indicator, other models' MAPE are higher than the integrated CAR model.

Table 6 .
Comparation of prediction accuracy between different models.