In order to answer this question, we have to choose some variables to monitor economic conditions, to seek a proper election date in terms of government. The variables we are going to use are the chain-weighted GDP, which clears the inflation effect from GDP, and CPI.

We will predict for the next 8 quarters the CPI and GDP with the ARIMA method from the fable package. In the dataset, we will use the quarters’ values of related variables between 2010-Q1 and 2021-Q2.

```
library(tidyverse)
library(readxl)
library(fable)
library(tsibble)
library(feasts)
#Building and tidying dataset
df <- read_excel("election.xlsx")
options(digits = 11)#for printing the decimal digits of the gdp variables
df %>%
mutate(cpi=round(as.numeric(cpi),2),) %>%
na.omit() -> df
#Creating tsibble variables for CPI and GDP variables
cpi_ts <- df %>%
select(cpi) %>%
ts(start = 2010,frequency = 4) %>%
as_tsibble()
gdp_ts <- df %>%
select(gdp) %>%
ts(start = 2010, frequency = 4) %>%
as_tsibble()
```

```
#Analyzing the GDP data
gdp_ts %>%
autoplot(color="blue",size=1) +
scale_y_continuous(labels = scales::label_number_si()) + #for SI prefix
xlab("") + ylab("") +
theme_minimal()
```

When we examine the above plot, there is no significant indication of a change of variance. Hence, we won’t take the logarithmic transformation of the data. Besides that, it is clearly seen that there is strong seasonality in the data. Therefore, we will first take a seasonal difference of the data. Although the mean of the data seems non-stationary either, if there is strong seasonality in the data, we would first take a seasonal difference; because, after the seasonal differencing, the mean can also be stabilized.

```
#Seasonally differenced of the data
gdp_ts %>%
gg_tsdisplay(difference(value, 4),
plot_type='partial') +
labs(title="Seasonally differenced", y="")
```

To the plot above, the mean seems to be stabilized; we don’t need to take a further difference. In order to find appropriate the ARIMA model, we first would look at the **ACF **graph. There is a spike at lag 4 which leads to a seasonal **MA(1)** component. On the other hand, based on **PACF**, there is a significant spike at lag 4 which also leads to a seasonal **AR(1)** component.

We can start with the below models according to the ACF/PACF graphs mentioned above. And, we will also add the automatic selection process to find the best results. We will set the ** stepwise** parameters to

```
#Modeling the GDP
fit_gdp <- gdp_ts %>%
model(
arima000011 = ARIMA(value ~ pdq(0,0,0) + PDQ(0,1,1)),
arima000110 = ARIMA(value ~ pdq(0,0,0) + PDQ(1,1,0)),
auto = ARIMA(value, stepwise = FALSE)
)
```

```
#Shows all models detailed
fit_gdp %>%
pivot_longer(everything(),
names_to = "Model name",
values_to = "Orders")
# A mable: 3 x 2
# Key: Model name [3]
# `Model name` Orders
# <chr> <model>
#1 arima000011 <ARIMA(0,0,0)(0,1,1)[4] w/ drift>
#2 arima000110 <ARIMA(0,0,0)(1,1,0)[4] w/ drift>
#3 auto <ARIMA(1,0,0)(0,1,1)[4] w/ drift>
```

```
#Ranks the models by AICc criteria
fit_gdp %>%
glance() %>%
arrange(AICc) %>%
select(.model,AICc)
# A tibble: 3 x 2
# .model AICc
# <chr> <dbl>
#1 auto 1518.
#2 arima000011 1520.
#3 arima000110 1524.
```

If we analyze the above results, we would conclude that the auto model is better than the others by AICc results. The model consists of non-seasonal **AR(1)** and seasonal **MA(1)** components in seasonally differenced data which is very similar to our guess based on **ACF**. **[4]** indicates the quarterly time series. Drift indicates the intercept in the model formula.

After finding our model, it would be better to check the residuals to see that if there is any correlation in the data.

```
#Checks the residuals
fit_gdp %>%
select(auto) %>%
gg_tsresiduals(lag=12)
```

We clearly see that there is no spike within threshold limits, which means that the residuals are white noise by the ACF graph above. Now, we can pass on to calculating forecasts with certainty.

```
#Forecasting GDP
(fit_gdp %>%
forecast(h=8) %>%
filter(.model=="auto") %>%
as.data.frame() %>% #to use the select function below, we convert data to a data frame
select(index,.mean) -> f_gdp)
# index .mean
#1 2021 Q3 521226607.35
#2 2021 Q4 526962302.61
#3 2022 Q1 454107505.86
#4 2022 Q2 476945092.38
#5 2022 Q3 536502403.29
#6 2022 Q4 544456726.15
#7 2023 Q1 472308990.28
#8 2023 Q2 495371912.09
```

Now, we will do the same steps for the CPI data.

```
#Analyzing the CPI data
cpi_ts %>%
autoplot(color="red",size=1) +
scale_y_continuous(labels = scales::label_number_si()) + #for SI prefix
xlab("") + ylab("") +
theme_minimal()
```

It is seen a very strong uptrend in the data while no seasonality is detected on the graph. Therefore, we will do the first difference to stabilize the mean. But, because of the exponential uptrend, we would better do log-transforming to stabilize the variance, as well.

```
#The first difference of the log-transformed data
cpi_ts %>%
gg_tsdisplay(difference(log(value)),
plot_type='partial') +
labs(title="", y="")
```

The data seem to be stationary in terms of the mean. As we analyzed the **ACF/PACF** plots, we would see that we have seasonal **AR(1)** and seasonal **MA(1)** components; because both plots have spikes at seasonal lags(**4**).

```
#Modeling the CPI
fit_cpi <- cpi_ts %>%
model(
arima010001 = ARIMA(log(value) ~ pdq(0,1,0) + PDQ(0,0,1)),
arima010100 = ARIMA(log(value) ~ pdq(0,1,0) + PDQ(1,0,0)),
auto = ARIMA(log(value), stepwise = FALSE)
)
#Shows all CPI models detailed
fit_cpi %>%
pivot_longer(everything(),
names_to = "Model name",
values_to = "Orders")
# A mable: 3 x 2
# Key: Model name [3]
# `Model name` Orders
# <chr> <model>
#1 arima010001 <ARIMA(0,1,0)(0,0,1)[4] w/ drift>
#2 arima010100 <ARIMA(0,1,0)(1,0,0)[4] w/ drift>
#3 auto <ARIMA(0,2,1)(1,0,0)[4]>
#Ranks the models by AICc criteria
fit_cpi %>%
glance() %>%
arrange(AICc) %>%
select(.model,AICc)
# A tibble: 3 x 2
# .model AICc
# <chr> <dbl>
#1 arima010100 -246.
#2 arima010001 -245.
#3 auto -240.
```

To the **AICc** results, the best model has seasonal **AR(1)** part with the first difference. In the last step for the fitness of the model, we will check for the residuals.

```
#Checks the residuals
fit_cpi %>%
select(arima010100) %>%
gg_tsresiduals(lag=12)
```

The residuals are within thresholds of the ACF plot, which means no remained information in the data. We can safely forecast with our model.

```
#Forecasting CPI
(fit_cpi %>%
forecast(h=8) %>%
filter(.model=="arima010100") %>%
as.data.frame() %>% #to use the select function below, we convert data to a data frame
select(index,.mean) -> f_cpi)
# index .mean
#1 2021 Q3 552.9687
#2 2021 Q4 572.1427
#3 2022 Q1 590.3115
#4 2022 Q2 608.4114
#5 2022 Q3 624.0214
#6 2022 Q4 642.0021
#7 2023 Q1 659.8191
#8 2023 Q2 677.8676
```

Now, we can investigate the convenient time for the election in terms of ruling parties. In order to do that, a graph might be useful. But first of all, we need a decisive variable. The rate of GDP per CPI would be an effective indicator for people’s perception of the country’s situation because it is related to people’s consumer level and the country’s growth rate either.

```
#Finding the proper time on the plot for the election
tibble(
ratio=f_gdp$.mean/f_cpi$.mean,
date=f_gdp$index
) %>%
ggplot(aes(x=date))+
geom_line(aes(y=ratio),
size=1,
color="green")+
geom_point(aes(x=date[5],y=ratio[5]),#red point
color="red",
size=2)+
ylab("GDP/CPI")+
xlab("")+
theme_minimal()+
theme(axis.title.y = element_text(margin = margin(r=0.5, unit = "cm")))#moves away from the tick numbers
```

When we examine the above plot, a significant decrease is observed from the third quarter of this year to the end first quarter of the next year. It is too risky for an election, but after the first quarter of 2022, an increasing trend is seen. But, from the third quarter of 2022, there is again a big decrease until the quarter, in which officially announced election date. The ideal election calendar seems to be the third quarter of 2022 for the ruling parties.

]]>The countries that we will compare are Greece, Italy, and Turkey, which are all in the same Mediterranean climate zone and had brutal damage of fires. We will compare the hectares(ha) of damaged areas and the number of fires. The data set has annual observations between the 2009-2021 periods. Of course, the data for this year had to be taken until 20 august based on the EFFIS database; but it doesn’t change the underlying concept of this article.

```
library(tidyverse)
library(tsfknn)
library(forecast)
df <- read_excel("fires_statistics.xlsx")
```

For comparison purposes, we will use the second y-axis in our plot. We have to use transformation rate to create a second axis because the ** sec.axis** function, which we are going to use for building the second y-axis, can not build an independent y-axis; it creates related to the main y-axis via some mathematical transformations. The transformation rate could be taken as approximate

```
fun <- function(country="Greece"){
df %>%
.[.$country==country,] %>%
ggplot(aes(x=year))+
geom_bar(aes(y=burnt_areas_ha),stat = "identity",fill="orange")+
#Bar labels(burnt_areas_ha)
geom_text(aes(y=burnt_areas_ha,label=burnt_areas_ha),vjust=-0.2,color="orange")+
#multiplied by transformation rate to tune in with the main y-axis(Burnt Areas)
geom_line(aes(y=fires_number*100),size=2,color="lightblue")+
#Line labels(fires_number)
geom_text(aes(y=fires_number,label=fires_number),vjust=1,color="lightblue")+
scale_x_continuous(breaks = seq(2009,2021,1))+
scale_y_continuous(
#for the main y-axis
labels = scales::label_number_si(),
breaks = scales::pretty_breaks(n=7),
#for the second y-axis
sec.axis = sec_axis(~./100,#divided by transformation rate, in order to be represented based on the first y-axis
name = "Number of Fires",
breaks = scales::pretty_breaks(7)),
)+
xlab("")+ylab("Burnt Areas(ha)")+
ggtitle(paste("The comparison of the volume(ha) and numbers of Forest Fires in",
country))+
theme_minimal()+
theme(
#Main y-axis
axis.title.y = element_text(color = "orange",
#puts distance with the main axis labels
margin = margin(r=0.5,unit = "cm"),
hjust = 1),#sets the main axis title top left
#Second y-axis
axis.title.y.right = element_text(color = "lightblue",
#puts distance with the second axis labels
margin = margin(l=0.5,unit = "cm"),
hjust = 0.01),#sets the second axis title bottom right
#adjusts the date labels to avoid overlap line numbers
axis.text.x = element_text(angle = 30,
vjust = 1,
hjust = 1),
panel.grid.minor = element_blank(),#removes the minor grid of panel
plot.title=element_text(face = "bold.italic",hjust = 0.5)
)
}
```

```
fun("Turkey")
```

```
fun("Italy")
```

When we take a look above plots, the damaged areas have been quite increased compared to the past years for all countries. Probably the underlying reason, that confused Turkish people was the high ratio of the damaged areas to the number of fires. The other two countries also seem to have the same problem, compared to the past years. Undoubtedly, the reason beyond that is global warming. Of course, the insufficient treats of the governments to the fires had also played a part in the issue.

Now let’s forecast the number of fires this year for determining if there is an anomaly in Turkey. Because our data set is too small, it would be convenient to choose the k-nearest neighbor(KNN) algorithm. We will set ** k **to 3 and 4 for optimization because the squared root of the number of observations is approximately 3,6.

Since the PACF order of the pattern are zero, we set the ** lags** parameter as 1 to 5. In order to determine the order of PACF of the model, we first check the stationary of the data.

```
#Checking the pattern if it is stationary
df %>%
filter(country=="Turkey") %>%
ggplot(aes(x=year,y=fires_number))+
geom_line(size=1)+
xlab("")+ylab("")+
theme_minimal()
```

There seems to be exponential growth in the above graph. Hence, we will take the first difference of the data to stabilize the mean.

```
#Making the time series stationary and determining the order of ACF and PACF
df %>%
filter(country=="Turkey") %>%
select(fires_number) %>%
pull() %>% #converts the dataframe column to a vector
ts(start = 2009,frequency = 1) %>% #converts vector to a time series
diff() %>% #gets first difference of the time series
ggtsdisplay()# gets ACF/PACF plots
```

It seems the mean is stabilized, so we don’t have to take a second difference. To the PACF plot, there is no significant spike beyond the limits; that was why we took the order of PACF as zero.

```
#Forecasting with KNN
df %>%
filter(country=="Turkey") %>%
select(fires_number) %>%
pull() %>% #converts the dataframe column to a vector
head(-1) %>% #removes the last observation for prediction
ts(start = 2009,frequency = 1) %>% #converts the vector to a time series
knn_forecasting(h=1,lags = 1:5, k=c(3,4)) %>%
#.$pred (gets prediction value for 2021, which is 424.575)
plot()
#actual value(the last observation)
points(x = 2021,
y = last(df$fires_number[df$country=="Turkey"]),
pch = 16,
col="blue")
legend("topleft",
legend = c("Predicted Value","Actual Value"),
col = c("red","blue"),
pch = c(16,16)
)
```

When we analyze the plot, we see that the predicted value for this year is quite higher than the current value, which shows the speculation about the number of fires in Turkey seems to be baseless.

]]>The dataset we’re going to use for this article is from Our World in Data. What we want to do here is that we will determine the partial and total vaccination rates as a percentage relative to the relevant population. And we will examine the relationship between the number of cases and deaths given in certain monthly periods and the vaccination rates.

```
library(readxl)
library(dplyr)
library(lubridate)
library(rlang)
library(scales)
library(tidyr)
library(stringr)
library(purrr)
library(plotly)
df_raw <- read_excel("covid19_vacc.xlsx")
cement <- function(...){
args <- ensyms(...)
paste(purrr::map(args,as_string))
}
```

The function we’ve built called ** cement** is taken from Advanced R book. The purpose of the function is that saves us from writing our arguments in string form, which sometimes can be quite annoying; with this function, we can write our arguments without in quotes.

```
fun <- function(...){
#Assigning the symbols as a strings
var_loc <- cement(...)[1]
var_toll <- cement(...)[2]
df <- df_raw %>%
.[.$location==var_loc,] %>%
transmute(
date=date %>% as.Date(),
#First phase vaccinated people
partly=people_vaccinated-people_fully_vaccinated,
#Fully vaccinated people
fully=people_fully_vaccinated,
total_vaccinations,
new_cases,
new_deaths,
population
) %>% na.omit() %>%
group_by(month = format(date, "%Y%m")) %>%
mutate(
#The monthly amount
new_cases=cumsum(new_cases),
new_deaths=cumsum(new_deaths),
) %>%
#The last observation for each month
slice(which.max(cumsum(partly & fully))) %>%
ungroup() %>%
select(-month) %>%
#For percentage demonstration
mutate(partly_rate=round(partly/population,4)*100,
fully_rate=round(fully/population,4)*100)
fig <- plot_ly(df)
fig %>%
add_trace(x = ~date, y = ~partly_rate, type = 'bar',
name = 'partly',
marker = list(color = 'ligthblue'),
hoverinfo = "text",
text = ~paste(partly_rate, '%')) %>%
add_trace(x = ~date, y = ~fully_rate, type = 'bar',
name = 'fully',
marker = list(color = 'orange'),
hoverinfo = "text",
text = ~paste(fully_rate, '%')) %>%
add_trace(x=~date, y=~.data[[var_toll]], type='scatter',mode='lines',
name=str_replace(var_toll,"_"," "),
yaxis="y2",
line = list(color = 'red',width=4),
hoverinfo="text",
#.data is the pronoun provided by data masks:https://adv-r.hadley.nz/evaluation.html#pronouns
text=~paste(comma(.data[[var_toll]]))) %>%
layout(barmode = "stack",
legend=list(orientation='h', #horizontal
xanchor="center", #center of x axis
x=0.5),
xaxis=list(title=""),
yaxis=list(side='left',
ticksuffix="%",
title='Total Percentage of Vaccinations'),
yaxis2=list(overlaying="y",
side='right',
automargin=TRUE, #prevents axis title from overlapping the labels
title=paste("The number of",
str_replace(var_toll,"_"," "),
"given monthly interval")))
}
```

```
fun(Turkey,new_cases)
```

```
fun(Germany,new_cases)
```

When we examine the graphs, we see that the total vaccinations are below 50% in Turkey and above 60% in Germany. Both countries peaked in the number of cases in May, but there has been a steady decline in the cases line since then as the proportion of people fully vaccinated has risen. We can clearly see that the vaccine has worked so far. So get vaccinated as soon as possible, please!

]]>In order to check that we have to model inflation rates with some variables. The most common view of the economic authorities is that the variables affecting the rates are currency exchange rates, and CDS(credit default swap). Of course, we will also add the funding rates variable, the president mentioned, to the model to compare with the other explanatory variables.

Because the variables can be highly correlated with each other, we will prefer the random forest model. This algorithm also has a built-in function to compute the feature importance.

**Random Forest**; for **regression**, constructs multiple decision trees and, inferring the average estimation result of each decision tree. This algorithm is more robust to overfitting than the classical decision trees.

The random forest algorithms average these results; that is, it reduces the variation by training the different parts of the train set. This increases the performance of the final model, although this situation creates a small increase in bias.

The random forest uses bootstrap aggregating(**bagging**) algortihms. We would take for training sample, ** X = x_{1}, …, x_{n}** and,

The unseen values, * x’*, would be fitted by

The **permutation feature importance** method would be used to determine the effects of the variables in the random forest model. This method calculates the increase in the prediction error(**MSE**) after permuting the feature values. If the permuting wouldn’t change the model error, the related feature is considered unimportant. We will use the ** varImp** function to calculate variable importance. It has a default parameter,

The data we are going to use can be download here. The variables we will examine:

: The annual consumer price index. It is also called the inflation rate. This is our target variable.*cpi*: The one-week repo rate, which determined by the Turkish Central Bank. It is also called the political rate.*funding_rate*: The currency exchange rates between Turkish Liras and American dollars.*exchange_rate*: The credit defaults swap. It kind of measures the investment risk of a country or company. It is mostly affected by foreign policy developments in Turkey. Although the most popular CDS is 5year, I will take CDS of 1 year USD in this article.*CDS*

```
library(readxl)
library(dplyr)
library(ggplot2)
library(randomForest)
library(varImp)
#Create the dataframe
df <- read_excel("cpi.xlsx")
#Random Forest Modelling
model <- randomForest(CPI ~ funding_rate+exchange_rate,
data = df, importance=TRUE)
#Conditional=True, adjusts for correlations between predictors.
i_scores <- varImp(model, conditional=TRUE)
#Gathering rownames in 'var' and converting it to the factor
#to provide 'fill' parameter for the bar chart.
i_scores <- i_scores %>% tibble::rownames_to_column("var")
i_scores$var<- i_scores$var %>% as.factor()
#Plotting the bar and polar charts for comparing variables
i_bar <- ggplot(data = i_scores) +
geom_bar(
stat = "identity",#it leaves the data without count and bin
mapping = aes(x = var, y=Overall, fill = var),
show.legend = FALSE,
width = 1
) +
labs(x = NULL, y = NULL)
i_bar + coord_polar() + theme_minimal()
i_bar + coord_flip() + theme_minimal()
```

When we examine the charts above, we can clearly see that the funding and exchange rates have similar effects on the model and, CDS has significant importance in the behavior of the CPI. So the theory of the president seems to fall short of what he claims.

**References**

**Neural Network Structure**

The neural network consists of three layers. The predictors form the bottom layer, the predicted output form the top layer and, the ones of the middle between them form the intermediate layers called the hidden neurons.

As we can see from the above graph, each layer of intermediate nodes takes the input from the previous layer. Each node and corresponding weight coefficients are combined in linear regression. Because of the hidden neurons(j), the output is recalculated by a nonlinear function. This is called the **multilayer feed-forward network**.

As mentioned before, the output is recalculated by the sigmoid function, which is nonlinear.

The modified output is given to the next layer as an input. This process is useful for the network to be robust to outliers. The weights () are randomly chosen and then learn from the observations by minimizing the cost function (e.g. MSE).

In time series regression, lagged values of data are used as input by a neural network, which is called the **neural network autoregression(NNAR)**. We will use feed-forward with one hidden layer, which is denoted NNAR(p, k). **p **indicates lagged values, and **k **denotes the nodes in the hidden layer.

For seasonal time series, it is better to include that the last observation from the same term from the previous year. In this case, it is denoted as indicates that:

- for the lagged inputs(predictors).
- for the seasonal terms.

For non-seasonal time series data, the default **p **value is determined by the optimal number of lagged inputs according to the Akaike’s information criterion.

**T**is the number of observations.**SSE**: sum of the squared of errors.**k**is the number of predictors.

If k is not selected at first, it is determined as . The network repeats 20 times as default and is trained for starting points which are different random weights. All the results are averaged for ultimate output. The process uses lagged values as inputs for a one-step forecast. This forecast along with the lagged values is used to predict a two-step forecast. This process continues until it meets the required forecasts.

We will prefer the non-seasonal approach because the data we will be using is less than the two periodic cycles required for detecting seasonality. We are going to predict daily cases in the UK for the next 30 days. The data file(**covid_uk.csv**) can be downloaded from here.

```
library(tidyverse)
library(forecast)
library(purrr)
library(lubridate)
#Building the dataframe
df <- read_csv("covid_uk.csv",
col_types = list(col_character(),
col_date(format = "%d/%m/%Y"),
col_double()))
#Converting to the time series data
inds <- seq(as.Date("2020-01-31"), as.Date("2021-05-21"), by = "day")
df_ts <- df %>%
.$new_cases %>%
ts(start = c(2020, as.numeric(format(inds[1], "%j"))),frequency = 365)
```

Before modeling the data, we will use bootstrap aggregating to prevent overfitting and improve accuracy. This process has been detailed in one of the previous articles. We will make a function to predict, and show the results in a plot, but before doing that, we create a date transform variable to be used for readable dates.

```
#The function to convert decimal date to the proper format
date_transform <- function(x) {format(date_decimal(x), "%b %Y")}
#Forecasting with Simulated Neural Network and plotting the results
sim_nn <- function(data, freq, h, n=100){
#For reproducible results
set.seed(123)
#Bootstraping the time series
sim <- bld.mbb.bootstrap(data, n)
#Bagged Neural Network
future <- sim %>%
map(function(x){simulate(nnetar(x, P=0),nsim=h)}) %>%
unlist() %>%
matrix(ncol = h, nrow = n, byrow = TRUE) %>%
pmax(0)#preventing the negative values
#The beginning date of the prediction
start <- tsp(data)[2]+1/freq
#Forecast object with prediction intervals
#at %5 significance level
sim_fc <- structure(list(
mean = ts(colMeans(future), start=start, frequency=freq),
lower = future %>% as.data.frame() %>%
map_dbl(quantile, prob = 0.025) %>%
ts(start = start,frequency = freq),
upper = future %>% as.data.frame() %>%
map_dbl(quantile, prob = 0.975) %>%
ts(start = start,frequency = freq),
level=95),
class="forecast")
#Making predictions global variable for easy access
assign("simfc",simfc,envir = .GlobalEnv)
#Plotting the prediction results with the observations
sim_plot <- autoplot(data)+
autolayer(simfc)+
scale_x_continuous(labels = date_transform,breaks = seq(2020,2022,0.2))+
ylab("")+xlab("")+
ggtitle("Daily Cases of United Kingdom")+
theme_light()+
theme(plot.title = element_text(hjust = 0.5))
#Converting time series with decimal date
#to the dataframe for proper demonstration
start <- start %>% as.numeric() %>% date_decimal() %>% as.Date()
h_date <- seq(start, start+29, by = "day")
simfc_df <- simfc %>% data.frame()
rownames(simfc_df)<- h_date
#The output list
list(simfc_df,sim_plot)
}
```

```
sim_nn(df_ts,365,30)
```

**References**

- Forecasting: Principles and Practice,
*Rob J Hyndman and George Athanasopoulos* - Our World in Data

In order to that, we will compare the performance of some developed countries; these are France, Germany, the United Kingdom, and the United States. The dataset for this article can be found here.

In the dataset, we have the number of new cases and new deaths per day, and the population number of related countries. In the below code block, we are replacing blank and NA data with zeros for new cases and new deaths variables. We will limit the data to only this year.

```
library(tidyverse)
library(lubridate)
library(readxl)
library(plotly)
library(scales)
#Preparing the dataset
data <- read_excel("covid-19_comparing.xlsx")
data %>%
mutate(
date=as.Date(.$date),
new_deaths=if_else(is.na(.$new_deaths), 0, .$new_deaths),
new_cases=if_else(is.na(.$new_cases), 0 , .$new_cases)
) %>%
filter(.$date >= "2021-01-01", .$date <= "2021-04-26" ) -> df
```

First, we will compare the number of cases per day(new_cases) but we have to scale the related variable because each country has a different population. The scaling metric we are going to use is the normalized function. I preferred the ** ggplotly** function for ease of analysis on the graph because we can click on each legend to extract the related countries from the plot, and re-click to put it on back.

```
#Normalizing function
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
#Draw normalized results for comparising
df %>%
group_by(location) %>%
mutate(new_cases=normalize(new_cases)) %>%
ggplot(aes(date,new_cases,color=location))+
geom_line(size=1, alpha = 0.7) +
scale_y_continuous(limits = c(0, 1))+
guides(color=guide_legend(title="Countries"))+
labs(x="", y="New Cases")+
theme_minimal() -> cases_norm
ggplotly(cases_norm)
```

Since the beginning of the year, the spread of the disease has decreased dramatically in the UK and the USA, but the opposite has happened in France. As for Germany, it declined until March, but then their numbers rose. But in April, all countries declined radically.

To confirm the above trends for all countries, this time we will draw the number of daily cases per population on the graph. It looks like the same trends for the countries when we examine the below plot.

```
#New cases per population
df %>%
group_by(location) %>%
mutate(casesPerPop=new_cases/population) %>%
ggplot(aes(date,casesPerPop,color=location))+
geom_line(size=1, alpha = 0.7) +
scale_y_continuous(limits = c(0, 0.002),labels = percent_format())+
guides(color=guide_legend(title="Countries"))+
labs(x="",y="",title = "New Cases per Population")+
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5)) -> cases_perpop
ggplotly(cases_perpop)
```

I also want to know that the death rate, in order to shed on light the health system for the related countries. To do that we will plot the death toll per population and per case amount.

```
#Deaths per Population
df %>%
group_by(location) %>%
mutate(deathsPerPop=sum(new_deaths)/population) %>%
ggplot(aes(location,deathsPerPop,color=location))+
geom_bar(stat = "identity",position = "identity",fill=NA)+
scale_y_continuous(limits = c(0, NA),labels = percent_format())+
labs(x="",y="",title = "Deaths per Population")+
theme_minimal()+
theme(legend.position = "none", plot.title = element_text(hjust = 0.5))
```

When we examine the below graph, despite the UK is better for the spread of the virus, it looks worst for the death rate per population.

```
#Deaths per Case
df %>%
group_by(location) %>%
mutate(deathsPerCase=sum(new_deaths)/sum(new_cases)) %>%
ggplot(aes(location,deathsPerCase,color=location))+
geom_bar(stat = "identity",position = "identity",fill=NA)+
scale_y_continuous(limits = c(0, NA),labels = percent_format())+
labs(x="",y="",title = "Deaths per Case")+
theme_minimal()+
theme(legend.position = "none", plot.title = element_text(hjust = 0.5))
```

This time Germany performs worse than others. It should be noted that despite France less performed for the daily cases, it seems to have best results for both death rates.

While the effect of the pandemic decreases in developed countries, we had to be a full closure in Turkey by the government, unfortunately. As much as the virus is dangerous, bad management is also dangerous.

**References**

*The data source*: Our World in Data- R for Data Science:
*Hadley Wickham*& Garrett Grolemund - Predicting injuries for Chicago traffic crashes:
*Julia Silge*

Time series modeling, most of the time, uses past observations as predictor variables. But sometimes, we need external variables that affect the target variables. To include those variables, we have to use regression models. However, we are going to use **dynamic regression** to capture elaborated patterns; the difference from the orthodox regression models is that residuals are not white noise and are modeled by** ARIMA**.

The residuals we mentioned above, have autocorrelation, which means contain information. To indicate that, we will show as . In this way, the residuals term can be modeled by ARIMA. For instance, dynamic regression with ARIMA(1,1,1) as described:

denotes the white noise and **B**, the backshift notation. As we can see above equation, There two error terms: the one from regression model, , and the other from ARIMA model, .

In the previous article, we have created the dataset variable,* df_xautry*. We will transform it into the multivariate time series and split it as a test and training set. Finally, we will model the training data.

```
library(dplyr)
library(forecast)
#Building the multivariate time series
df <- df_xautry[-1]
df_mts <- df %>% ts(start=c(2013,1),frequency=12)
#Split the dataset
train <- df_mts %>% window(end=c(2020,12))
test <- df_mts %>% window(start=2021)
#Modeling the training data
fit_dynamic <- auto.arima(train[,"xau_try_gram"], xreg =train[,c(1,2)])
#Series: train[, "xau_try_gram"]
#Regression with ARIMA(1,0,2) errors
#Coefficients:
# ar1 ma1 ma2 intercept xe xau_usd_ounce
# 0.9598 -0.0481 0.4003 -150.8309 43.8402 0.1195
#s.e. 0.0390 0.0992 0.1091 23.7781 2.1198 0.0092
#sigma^2 estimated as 27.15: log likelihood=-293.31
#AIC=600.62 AICc=601.89 BIC=618.57
```

Based on the above results, we have ARIMA(1,0,2) model as described below:

Now, we will do forecasting and then calculate accuracy. The accuracy for xgboost will be calculated from the *forecast_xautrygram* variable.

```
#Forecasting
fcast_dynamic <- forecast(fit_dynamic, xreg = test[,1:2])
#Accuracy
acc_dynamic <- fcast_dynamic %>%
accuracy(test[,3]) %>%
.[,c("RMSE","MAPE")]
acc_xgboost <- forecast_xautrygram %>%
accuracy(test[,3]) %>%
.[,c("RMSE","MAPE")]
```

In order to visualize the accuracy results, we’re going to build the data frame and prepare it for a suitable bar chart.

```
#Tidying the dataframe
df_comparison <- data.frame(
"dynamic"=acc_dynamic,
"xgboost"=acc_xgboost
)
df_comparison
# dynamic.RMSE dynamic.MAPE xgboost.RMSE xgboost.MAPE
#Training set 5.044961 2.251683 0.001594868 0.000805107
#Test set 10.695489 2.501123 11.038134819 2.060825426
library(tidyr)
library(tibble)
df_comparison %>%
rownames_to_column(var = "data") %>%
gather(`dynamic.RMSE`, `dynamic.MAPE`,`xgboost.RMSE`, `xgboost.MAPE`,
key = "models", value="score") -> acc_comparison
acc_comparison
# data models score
#1 Training set dynamic.RMSE 5.044960948
#2 Test set dynamic.RMSE 10.695489161
#3 Training set dynamic.MAPE 2.251682989
#4 Test set dynamic.MAPE 2.501122965
#5 Training set xgboost.RMSE 0.001594868
#6 Test set xgboost.RMSE 11.038134819
#7 Training set xgboost.MAPE 0.000805107
#8 Test set xgboost.MAPE 2.060825426
```

```
#Plotting comparing models
ggplot(acc_comparison,aes(x=data,y=score,fill = models)) +
geom_bar(stat = "identity",position = "dodge") +
theme_bw()
```

**Conclusion**

When we examined the above results and the bar chart for unseen data, we are seeing some interesting results. For the training set, the xgboost model has near-zero accuracy rates which can lead to overfitting. The dynamic model looks slightly better for the RMSE but vice versa for the MAPE criteria.

This shed light that while the first-month dynamic model is better, for the second month the xgboost looks closer to the actual observation. Of course, the reason for that is maybe we have few test data but I wanted to predict the first two months of the current year. Maybe next time, if we have more test data, we can try again.

**References**

- Forecasting: Principles and Practice,
*Rob J Hyndman and George Athanasopoulos*

In recent years, XGBoost is an uptrend machine learning algorithm in time series modeling. XGBoost (**Extreme Gradient Boosting**) is a supervised learning algorithm based on boosting tree models. This kind of algorithms can explain how relationships between features and target variables which is what we have intended. We will try this method for our time series data but first, explain the mathematical background of the related tree model.

- K represents the number of tree
- represents the basic tree model.

We need a function that trains the model by measuring how well it fits the training data. This is called** the objective function**.

- represents the loss function which is the error between the predicted values and observed values.
- is
**the regularization function**to prevent overfitting.

- T represents the leaves, the number of
**leaf nodes**of each tree. Leaf node or terminal node means that has no child nodes. - represents the score (weight) of the leaves of each tree, so it is calculated in euclidean norm.
- represents the learning rate which is also called the shrinkage parameter. With shrinking the weights, the model is more robust against the closeness to the observed values. This prevents overfitting. It is between 0 and 1. The lower values mean that the more trees, the better performance, and the longer training time.
- represents the splitting threshold. The parameter is used to prevent the growth of a tree so the model is less complex and more robust against overfitting. The leaf node would split if the information gain less than . Its range is at

Now, we can start to examine our case we mention at the beginning of the article. In order to do that, we are downloading the dataset we are going to use, from here.

```
#Building data frame
library(readxl)
df_xautry <- read_excel("datasource/xautry_reg.xlsx")
df_xautry$date <- as.Date(df_xautry$date)
#Splitting train and test data set
train <- df_xautry[df_xautry$date < "2021-01-01",]
test <- df_xautry[-(1:nrow(train)),]
```

We will transform the train and test dataset to the DMatrix object to use in the xgboost process. And we will get the target values of the train set in a different variable to use in training the model.

```
#Transform train and test data to DMatrix form
library(dplyr)
library(xgboost)
train_Dmatrix <- train %>%
dplyr::select(xe, xau_usd_ounce) %>%
as.matrix() %>%
xgb.DMatrix()
pred_Dmatrix <- test %>%
dplyr::select(xe, xau_usd_ounce) %>%
as.matrix() %>%
xgb.DMatrix()
targets <- train$xau_try_gram
```

We will execute the cross-validation to prevent overfitting, and set the parallel computing parameters enable because the xgboost algorithm needs it. We will adjust all the parameter we’ve just mentioned above with * trainControl* function in caret package.

We also will make a list of parameters to train the model. Some of them are:

: A maximum number of iterations. It was shown by**nrounds**at the tree model equation. We will set a vector of values. It executes the values separately to find the optimal result. Too large values can lead to overfitting however, too small values can also lead to underfitting.**t**: The maximum number of trees. The greater the value of depth, the more complex and robust the model is but also the more likely it would be overfitting.*max_depth*As we mentioned before with**min_child_weight:**in the objective function, it determines the minimum sum of weights of leaf nodes to prevent overfitting.*w*: It is by subsetting the train data before the boosting tree process, it prevents overfitting. It is executed once at every iteration.*subsample*

```
#Cross-validation
library(caret)
xgb_trcontrol <- trainControl(
method = "cv",
number = 10,
allowParallel = TRUE,
verboseIter = FALSE,
returnData = FALSE
)
#Building parameters set
xgb_grid <- base::expand.grid(
list(
nrounds = seq(100,200),
max_depth = c(6,15,20),
colsample_bytree = 1,
eta = 0.5,
gamma = 0,
min_child_weight = 1,
subsample = 1)
)
```

Now that all the parameters and needful variables are set, we can build our model.

```
#Building the model
model_xgb <- caret::train(
train_Dmatrix,targets,
trControl = xgb_trcontrol,
tuneGrid = xgb_grid,
method = "xgbTree",
nthread = 10
)
```

We can also see the best optimal parameters.

```
model_xgb$bestTune
# nrounds max_depth eta gamma colsample_bytree min_child_weight #subsample
#1 100 6 0.5 0 1 1 1
```

To do some visualization in the ** forecast** function, we have to transform the predicted results into the

```
#Making the variables used in forecast object
fitted <- model_xgb %>%
stats::predict(train_Dmatrix) %>%
stats::ts(start = c(2013,1),frequency = 12)
ts_xautrygram <- ts(targets,start=c(2013,1),frequency=12)
forecast_xgb <- model_xgb %>% stats::predict(pred_Dmatrix)
forecast_ts <- ts(forecast_xgb,start=c(2021,1),frequency=12)
#Preparing forecast object
forecast_xautrygram <- list(
model = model_xgb$modelInfo,
method = model_xgb$method,
mean = forecast_ts,
x = ts_xautrygram,
fitted = fitted,
residuals = as.numeric(ts_xautrygram) - as.numeric(fitted)
)
class(forecast_xautrygram) <- "forecast"
```

We will show train, unseen, and predicted values for comparison in the same graph.

```
#The function to convert decimal time label to wanted format
library(lubridate)
date_transform <- function(x) {format(date_decimal(x), "%Y")}
#Making a time series varibale for observed data
observed_values <- ts(test$xau_try_gram,start=c(2021,1),frequency=12)
#Plot forecasting
library(ggplot2)
library(forecast)
autoplot(forecast_xautrygram)+
autolayer(forecast_xautrygram$mean,series="Predicted",size=0.75) +
autolayer(forecast_xautrygram$x,series ="Train",size=0.75 ) +
autolayer(observed_values,series = "Observed",size=0.75) +
scale_x_continuous(labels =date_transform,breaks = seq(2013,2021,2) ) +
guides(colour=guide_legend(title = "Time Series")) +
ylab("Price") + xlab("Time") +
ggtitle("") +
theme_bw()
```

To satisfy that curiosity we mentioned at the very beginning of the article, we will find the ratio that affects the target variable of each explanatory variable separately.

```
#Feature importance
library(Ckmeans.1d.dp)
xgb_imp <- xgb.importance(
feature_names = colnames(train_Dmatrix),
model = model_xgb$finalModel)
xgb.ggplot.importance(xgb_imp,n_clusters = c(2))+
ggtitle("") +
theme_bw()+
theme(legend.position="none")
xgb_imp$Importance
#[1] 0.92995147 0.07004853
```

**Conclusion**

When we examine the above results and plot, contrary to popular belief, it is seen that the exchange rate has a more dominant effect than the price of ounce gold. In the next article, we will compare this method with the dynamic regression ARIMA model.

**References**

The reason for this statement was the pressure of the Istanbul Metropolitan Mayor. He has said that according to data released by the cemetery administration, a municipal agency, the daily number of infected deaths were nearly that two times the daily number of the death tolls explained by the ministry.

So, I decided to check the mayor’s claims. To do that, I have to do some predictions; but, not for the future, for the past. Fortunately, there is a method for this that is called **Backcasting**. Let’s take a vector of time series and estimate .

- One-step estimation for
**backcasting**: with .

- One- step estimation for
**forecasting**, with

As you can see above, the backcasting coefficients are the same as the forecasting coefficients(). For instance, in this case, the model for new cases is **ARIMA(0, 1, 2) with drift**:

- For
**forecasting**: - For
**backcasting**:

```
#Function to reverse the time series
reverse_ts <- function(y)
{
y %>%
rev() %>%
ts(start=tsp(y)[1L], frequency=frequency(y))
}
#Function to reverse the forecast
reverse_forecast <- function(object)
{
h <- object[["mean"]] %>% length()
f <- object[["mean"]] %>% frequency()
object[["x"]] <- object[["x"]] %>% reverse_ts()
object[["mean"]] <- object[["mean"]] %>% rev() %>%
ts(end=tsp(object[["x"]])[1L]-1/f, frequency=f)
object[["lower"]] <- object[["lower"]][h:1L,]
object[["upper"]] <- object[["upper"]][h:1L,]
return(object)
}
```

We would first reverse the time series and then make predictions and again reverse the forecast results. The data that we are going to model is the number of daily new cases and daily new deaths, between the day the health minister’s explanation was held and the day the vaccine process in Turkey has begun. We will try to predict the ten days before the date 26-11-2020.

```
#Creating datasets
df <- read_excel("datasource/covid-19_dataset.xlsx")
df$date <- as.Date(df$date)
#The data after the date 25-11-2020:Train set
df_after<- df[df$date > "2020-11-25",]
#The data between 15-11-2020 and 26-11-2020:Test set
df_before <- df[ df$date > "2020-11-15" & df$date < "2020-11-26",]
#Creating dataframes for daily cases and deaths
df_cases <- bc_cases %>% data.frame()
df_deaths <- bc_deaths %>% data.frame()
#Converting the numeric row names to date object
options(digits = 9)
date <- df_cases %>%
rownames() %>%
as.numeric() %>%
date_decimal() %>%
as.Date()
#Adding date object created above to the data frames
df_cases <- date %>% cbind(df_cases) %>% as.data.frame()
colnames(df_cases)[1] <- "date"
df_deaths <- date %>% cbind(df_deaths) %>% as.data.frame()
colnames(df_deaths)[1] <- "date"
#Convert date to numeric to use in ts function
n <- as.numeric(as.Date("2020-11-26")-as.Date("2020-01-01")) + 1
#Creating time series variables
ts_cases <- df_after$new_cases %>%
ts(start = c(2020, n),frequency = 365 )
ts_deaths <- df_after$new_deaths %>%
ts(start = c(2020, n),frequency = 365 )
#Backcast variables
ts_cases %>%
reverse_ts() %>%
auto.arima() %>%
forecast(h=10) %>%
reverse_forecast() -> bc_cases
ts_deaths %>%
reverse_ts() %>%
auto.arima() %>%
forecast(h=10) %>%
reverse_forecast() -> bc_deaths
```

It might be very useful to make a function to plot the comparison for backcast values and observed data.

```
#Plot function for comparison
plot_fun <- function(data,column){
ggplot(data = data,aes(x=date,y=Point.Forecast))+
geom_line(aes(color="blue"))+
geom_line(data = df_before,aes(x=date,y=.data[[column]],color="red"))+
geom_line(data = df_after,aes(x=date,y=.data[[column]],color="black"))+
geom_ribbon(aes(ymin=Lo.95, ymax=Hi.95), linetype=2,alpha=0.1,fill="blue")+
geom_ribbon(aes(ymin=Lo.80, ymax=Hi.80), linetype=2, alpha=0.1,fill="blue")+
scale_color_identity(name = "Lines",
breaks = c("black", "red", "blue"),
labels = c("After", "Real", "Backcast"),
guide = "legend")+
ylab(str_replace(column,"_"," "))+
theme_light()
}
```

```
plot_fun(df_cases, "new_cases")
```

```
plot_fun(df_deaths, "new_deaths")
```

**Conclusion**

When we examine the graph, the difference in death toll seems relatively close. However, the levels of daily cases are significantly different from each other. Although this estimate only covers ten days, it suggests that there is inconsistency in the numbers given.

**References**

- Forecasting: Principles and Practice,
*Rob J Hyndman and George Athanasopoulos* *Joan Bruna*, Stat 153: Introduction to Time Series- Our World in Data

Modeling time series data is difficult because the data are autocorrelated. In this case, **moving block bootstrap (MBB)** should be preferred because MBB resamples the data inside overlapping blocks to imitate the autocorrelation in the data. If the length of a time series, **n**, and the block size **l**, the number of overlapping blocks are found as below:

What we mean by overlapping block is that observation 1 to l would be block 1, observation 2 to l+1 would be block 2, etc. We should use a block size for at least two years(**l=24**) for monthly data because we have to be certain whether there is any remaining seasonality in the block.

From these **n-l+1** blocks, **n/l** blocks will be selected randomly and they will be gathered in order, to build the bootstrap observations. The time series values can be repetitive in different blocks.

This bootstrap process would be exercised to the remainder component after the time series decomposition. If there is seasonality it is used the stl function(trend, seasonal, remainder) otherwise the loess function(trend, remainder) is chosen for the decomposition. It should not be forgotten that the data has to be stationary in the first place.

Box-Cox transformation is made at the beginning but back-transformed at the end of the process; as we mentioned before, when we do average all the bootstrapped series, which is called **bagging**, we could handle the non-stability data problem and improve accuracy compared to the original series.

As we remembered from the previous two articles, we have tried to model gold prices per gram in Turkey. We have determined the ARIMA model the best for forecasting. This time, we will try to improve using the bagging mentioned above.

In order to that, we will create a function that makes bootstrapping simulations and builds the prediction intervals we want. We will adjust the simulation number, model, and confidence level as default. We will use the assign function to make the **bagged **data(**simfc**) as a global variable, so we will able to access it outside the function as well.

```
#Simulation function
library(purrr)
library(forecast)
sim_forecast <- function(data, nsim=100L, h, mdl=auto.arima, level=95){
sim <- bld.mbb.bootstrap(data, nsim)
h <- as.integer(h)
future <- matrix(0, nrow=nsim, ncol=h)
future <- sim %>% map(function(x){simulate(mdl(x),nsim=h)}) %>%
unlist() %>% matrix(ncol = h, nrow = nsim, byrow = TRUE)
start < - tsp(data)[2]+1/12
simfc <- structure(list(
mean = future %>% colMeans() %>% ts(start = start, frequency = 12),
lower = future %>% as.data.frame() %>%
map_dbl(quantile, prob = (1-level/100)/2) %>%
ts(start = start,frequency = 12),
upper = future %>% as.data.frame() %>%
map_dbl(quantile, prob = (1-level/100)/2+level/100) %>%
ts(start = start,frequency = 12),
level=level),
class="forecast")
assign("simfc",simfc,envir = .GlobalEnv)
simfc
}
```

Because of the averaging part of the bagging, we don’t use the lambda parameter of Box-Cox transformation for the stability of the variance. We can see the forecasting results for 18 months in %95 confidence interval for training set below. We can also change the model type or confidence level if we want.

```
sim_forecast(train, h=18)
Point Forecast Lo 95 Hi 95
#Mar 2019 242.3121 215.5464 268.2730
#Apr 2019 243.4456 206.4015 274.5155
#May 2019 249.9275 216.8712 283.4226
#Jun 2019 252.8518 219.7168 283.0535
#Jul 2019 259.0699 216.7776 302.4991
#Aug 2019 267.2599 219.5771 310.7458
#Sep 2019 270.8745 214.1733 324.4255
#Oct 2019 272.0894 215.1619 333.2733
#Nov 2019 275.5566 213.8802 337.9301
#Dec 2019 280.3914 219.2063 349.1284
#Jan 2020 291.4792 215.9117 364.1899
#Feb 2020 296.3475 221.9117 380.2887
#Mar 2020 302.0706 219.0779 399.1135
#Apr 2020 304.4595 217.5600 400.7724
#May 2020 310.8251 217.5561 420.6515
#Jun 2020 315.5942 221.5791 431.9727
#Jul 2020 322.4536 220.4798 452.4229
#Aug 2020 331.1163 223.3746 465.2015
```

We will create the Arima model as same as we did before and compare it with a bagged version of it in a graph.

```
arimafc <- train %>%
auto.arima(stepwise = FALSE,approximation = FALSE,
seasonal = FALSE, lambda = "auto") %>%
forecast(h=h,level=95)
autoplot(train) +
ggtitle("Monthly Golden Price per Gram") +
xlab("Year") + ylab("Price") +
autolayer(test, series="Real values",PI=FALSE) +
autolayer(simfc, series="Bagged ARIMA",PI=FALSE) +
autolayer(arimafc, series="ARIMA",PI=FALSE)+
theme_light()
```

When we examine the above plot, we can see that the bagged Arima model is smoother and more accurate compared to the classic version; but it is seen that when the forecasting horizon increases, both models are failed to capture the uptrend.

In the below, we are comparing the accuracy of models in numeric. We can easily see the difference in the accuracy level that we saw in the plot. The reason **NaN** values of the simulated version is that there is no estimation of fitted values(one-step forecasts) in the training set.

```
#Accuracy comparison
acc_arimafc <- arimafc %>%accuracy(test)
acc_arimafc[,c("RMSE","MAPE")]
# RMSE MAPE
#Training set 9.045056 3.81892
#Test set 67.794358 14.87034
acc_simu <- simfc %>% accuracy(test)
acc_simu[,c("RMSE","MAPE")]
# RMSE MAPE
#Training set NaN NaN
#Test set 54.46326 8.915361
```

**Conclusion**

When we examine the results we have found, it is seen that bootstrapping simulation with averaging (bagging) improves the accuracy significantly. Besides that, due to the simulation process, it can be very time-consuming.

**References**

- Forecasting: Principles and Practice,
*Rob J Hyndman and George Athanasopoulos* - Bagging Exponential Smoothing Methods usingSTL Decomposition and Box-Cox Transformation,
*Christoph Bergmeira, Rob J Hyndmanb, Jos ́e M Benitez* - Block Bootstrap, Wikipedia
- Bootstrap aggregating, Wikipedia