1.5 Mathematics of Machine Learning and Deep Learning for Time Series

Why time series matter

A time series is data ordered by time.

Examples:

Domain	Time series example
Weather	hourly temperature
Finance	stock prices
Energy	electricity demand
Healthcare	heart-rate signal
Industry	vibration sensor readings
Language/audio	waveform samples
Traffic	vehicle count per minute
AI systems	GPU utilization over time
Video	frames evolving through time

A time series is different from an ordinary table because order matters.

If we observe:

$$ 10,; 12,; 15,; 14,; 18 $$

this is not the same as:

$$ 18,; 10,; 15,; 12,; 14. $$

The values are the same, but the temporal structure is different.

Time-series analysis studies exactly this structure: dependence, trend, seasonality, cycles, shocks, noise, regime changes, long memory, and uncertainty. MIT’s graduate time-series course covers stationarity, lag operators, ARMA models, covariance structure, spectral analysis, GMM, VARs, structural breaks, and related econometric theory. (MIT OpenCourseWare)

Part I — Beginner foundation

A simple time series

Suppose we record daily temperature:

Day	Temperature
1	20
2	21
3	19
4	22
5	24
6	23

We write:

$$ y_1=20,\quad y_2=21,\quad y_3=19,\quad y_4=22,\quad y_5=24,\quad y_6=23. $$

The full time series is:

$$ y_{1:6} = (y_1,y_2,y_3,y_4,y_5,y_6). $$

In general:

$$ y_{1:T}=(y_1,y_2,\dots,y_T). $$

Here:

$t$ is the time index,
$T$ is the total number of observations,
$y_t$ is the value at time $t$.

If the values are scalar, then:

$$ y_t\in\mathbb{R}. $$

This is a univariate time series.

Multivariate time series

Sometimes we observe many variables at each time.

For example, a weather station may record:

Time	Temperature	Humidity	Wind speed
1	20	60	5
2	21	58	6
3	19	70	4

At time $t$, we have a vector:

$$ y_t = \begin{bmatrix} \text{temperature}_t\ \text{humidity}_t\ \text{wind}_t \end{bmatrix} \in\mathbb{R}^{3}. $$

The whole series is:

$$ y_{1:T} = (y_1,y_2,\dots,y_T). $$

As a matrix:

$$ Y = \begin{bmatrix} y_1^\top \\ y_2^\top \\ \vdots \\ y_T^\top \end{bmatrix} \in\mathbb{R}^{T\times d}. $$

Here:

$T$ is sequence length,
$d$ is number of variables.

For a batch of $B$ time series:

$$ Y\in\mathbb{R}^{B\times T\times d}. $$

So time-series deep learning is tensor learning.

Forecasting

The most common time-series task is forecasting.

Given past values:

$$ y_1,y_2,\dots,y_T, $$

predict future values:

$$ y_{T+1},y_{T+2},\dots,y_{T+H}. $$

Here:

$T$ is the history length,
$H$ is the forecast horizon.

We write:

$$ \hat{y}_{T+1:T+H} {}={} f_\theta(y_{1:T}). $$

For one-step forecasting:

$$ \hat{y}_{T+1}=f_\theta(y_{1:T}). $$

For multi-step forecasting:

$$ \hat{y}_{T+1:T+H}=f_\theta(y_{1:T}). $$

Hyndman and Athanasopoulos’ Forecasting: Principles and Practice is a standard open textbook that introduces forecasting methods and practical modeling choices, including time-series features, exponential smoothing, ARIMA, dynamic regression, hierarchical forecasting, and practical evaluation. (OTexts: Online, open-access textbooks)

Input window and output horizon

In machine learning, we often convert a time series into supervised examples.

Suppose:

$$ y_1,y_2,\dots,y_{100}. $$

Choose input length:

$$ L=5 $$

and forecast horizon:

$$ H=2. $$

Then one training example is:

$$ x_1= \begin{bmatrix} y_1\ y_2\ y_3\ y_4\ y_5 \end{bmatrix}, \qquad \text{target}= \begin{bmatrix} y_6\ y_7 \end{bmatrix}. $$

Another example is:

$$ x_2= \begin{bmatrix} y_2\ y_3\ y_4\ y_5\ y_6 \end{bmatrix}, \qquad \text{target}= \begin{bmatrix} y_7\ y_8 \end{bmatrix}. $$

This is called a sliding window construction.

In general:

$$ x_t=(y_t,y_{t+1},\dots,y_{t+L-1}), $$

and:

$$ z_t=(y_{t+L},\dots,y_{t+L+H-1}). $$

The model learns:

$$ f_\theta:x_t\mapsto z_t. $$

Time-series tasks

Time-series machine learning is not only forecasting.

It includes:

Task	Goal
Forecasting	predict future values
Classification	assign a label to a sequence
Regression	predict a continuous target from a sequence
Anomaly detection	find unusual time points or segments
Imputation	fill missing values
Segmentation	divide sequence into regimes
Clustering	group similar time series
Representation learning	learn useful embeddings
Simulation/generation	generate realistic future trajectories
Control	choose actions over time

The UCR and UEA archives became major public benchmarks for time-series classification; the UEA 2018 multivariate archive was introduced to improve rigorous evaluation for multivariate time-series classification, where each example may contain multiple channels. (arXiv)

Part II — Core statistical time-series mathematics

Deterministic signal plus noise

A useful beginner model is:

$$ y_t = s_t+\varepsilon_t. $$

Here:

$s_t$ is the true signal,
$\varepsilon_t$ is noise.

For example:

$$ y_t = \text{trend}_t + \text{seasonality}_t + \text{noise}_t. $$

So:

$$ y_t = T_t + S_t + R_t. $$

Here:

$T_t$ is trend,
$S_t$ is seasonal pattern,
$R_t$ is residual noise.

This decomposition is central in classical forecasting and remains useful in modern deep learning. Hyndman and Athanasopoulos present decomposition as a core tool for understanding trend and seasonal structure before modeling. (OTexts: Online, open-access textbooks)

Trend

A trend is a long-term direction.

A linear trend can be written:

$$ T_t = a+bt. $$

Here:

$a$ is intercept,
$b$ is slope.

If $b>0$, the series increases.

If $b<0$, the series decreases.

For example:

$$ y_t = 10+0.5t+\varepsilon_t. $$

This means the average value grows by 0.5 per time step.

Seasonality

Seasonality is a repeated pattern.

For daily data with weekly seasonality, the period is:

$$ m=7. $$

For monthly data with yearly seasonality:

$$ m=12. $$

A simple seasonal model is:

$$ y_t = T_t + S_{t \bmod m}+\varepsilon_t. $$

Fourier terms can model smooth seasonality:

$$ S_t = \sum_{k=1}^{K} \left[ a_k\cos\left(\frac{2\pi kt}{m}\right) + b_k\sin\left(\frac{2\pi kt}{m}\right) \right]. $$

This connects time series to frequency-domain mathematics.

Lag operator

The lag operator $L$ shifts a series backward:

$$ Ly_t=y_{t-1}. $$

Then:

$$ L^2y_t=y_{t-2}. $$

MIT’s graduate time-series notes begin with stationarity, lag operators, ARMA models, and covariance structure, because lag notation is the compact language of classical time-series theory. (MIT OpenCourseWare)

For example:

$$ y_t = 0.8y_{t-1}+\varepsilon_t $$

can be written:

$$ y_t = 0.8Ly_t+\varepsilon_t. $$

So:

$$ (1-0.8L)y_t=\varepsilon_t. $$

Stationarity

A time series is stationary if its statistical properties do not change over time.

Weak stationarity means:

constant mean:

$$ \mathbb{E}[y_t]=\mu, $$

constant variance:

$$ \operatorname{Var}(y_t)=\sigma^2, $$

autocovariance depends only on lag, not absolute time:

$$ \operatorname{Cov}(y_t,y_{t-k})=\gamma(k). $$

Stationarity matters because many classical time-series models assume stable dependence structure. MIT’s time-series notes treat stationarity as foundational for ARMA models and covariance analysis. (MIT OpenCourseWare)

Autocovariance

The autocovariance at lag $k$ is:

$$ \gamma(k) {}={} \operatorname{Cov}(y_t,y_{t-k}) \mathbb{E}[(y_t-\mu)(y_{t-k}-\mu)]. $$

For $k=0$:

$$ \gamma(0)=\operatorname{Var}(y_t). $$

Autocovariance measures how values separated by $k$ time steps move together.

Autocorrelation

Autocorrelation normalizes autocovariance:

$$ \rho(k)=\frac{\gamma(k)}{\gamma(0)}. $$

So:

$$ -1\leq \rho(k)\leq 1. $$

If:

$$ \rho(1)>0, $$

then nearby values tend to move together.

If:

$$ \rho(1)<0, $$

then high values tend to be followed by low values.

If:

$$ \rho(k)\approx 0, $$

then values $k$ steps apart are weakly linearly related.

White noise

A white-noise process satisfies:

$$ \mathbb{E}[\varepsilon_t]=0, $$$$ \operatorname{Var}(\varepsilon_t)=\sigma^2, $$$$ \operatorname{Cov}(\varepsilon_t,\varepsilon_s)=0 \quad \text{for } t\neq s. $$

If:

$$ \varepsilon_t\sim\mathcal{N}(0,\sigma^2) $$

independently, then it is Gaussian white noise.

White noise is the basic random innovation in ARMA, state-space, Kalman filtering, SDEs, and probabilistic forecasting.

Random walk

A random walk is:

$$ y_t=y_{t-1}+\varepsilon_t. $$

Equivalently:

$$ \Delta y_t = y_t-y_{t-1}=\varepsilon_t. $$

The variance grows over time.

If:

$$ y_0=0, $$

then:

$$ y_t=\sum_{i=1}^{t}\varepsilon_i. $$

If the innovations have variance $\sigma^2$, then:

$$ \operatorname{Var}(y_t)=t\sigma^2. $$

So the process is not stationary.

Random walks are important in finance, stochastic processes, diffusion limits, and non-stationary forecasting.

Part III — Classical forecasting models

Autoregressive model AR$p$

An autoregressive model predicts the present from past values.

An AR(1) model is:

$$ y_t = c+\phi y_{t-1}+\varepsilon_t. $$

An AR$p$ model is:

$$ y_t {}={} c+ \phi_1y_{t-1} + \phi_2y_{t-2} + \cdots + \phi_py_{t-p} + \varepsilon_t. $$

Using lag notation:

$$ \phi(L)y_t=c+\varepsilon_t, $$

where:

$$ \phi(L)=1-\phi_1L-\phi_2L^2-\cdots-\phi_pL^p. $$

AR models are linear memory models.

They say:

the future depends linearly on the past.

Moving-average model MA$q$

A moving-average model uses past noise terms.

An MA(1) model is:

$$ y_t=\mu+\varepsilon_t+\theta\varepsilon_{t-1}. $$

An MA$q$ model is:

$$ y_t {}={} \mu+ \varepsilon_t+ \theta_1\varepsilon_{t-1} + \cdots + \theta_q\varepsilon_{t-q}. $$

This means shocks can influence future observations for several time steps.

ARMA model

An ARMA$p,q$ model combines AR and MA terms:

$$ y_t {}={} c+ \sum_{i=1}^{p}\phi_i y_{t-i} + \varepsilon_t + \sum_{j=1}^{q}\theta_j\varepsilon_{t-j}. $$

This is useful for stationary time series.

MIT’s graduate time-series material covers ARMA and covariance structure as central tools for stationary time-series analysis. (MIT OpenCourseWare)

ARIMA model

Many real time series are non-stationary.

ARIMA handles non-stationarity using differencing.

The first difference is:

$$ \nabla y_t = y_t-y_{t-1}. $$

If the differenced series is stationary, we can apply ARMA to:

$$ \nabla^d y_t. $$

An ARIMA$p,d,q$ model means:

$p$: autoregressive order,
$d$: differencing order,
$q$: moving-average order.

Hyndman and Athanasopoulos present ARIMA as a core classical forecasting model, alongside exponential smoothing and dynamic regression. (OTexts: Online, open-access textbooks)

Seasonal ARIMA

Seasonal ARIMA adds seasonal lags.

A common notation is:

$$ \operatorname{ARIMA}(p,d,q)(P,D,Q)_m. $$

Here:

$m$ is seasonal period,
$P,D,Q$ are seasonal AR, differencing, and MA orders.

For monthly data with yearly seasonality:

$$ m=12. $$

For hourly data with daily seasonality:

$$ m=24. $$

Seasonal models are important because many time series contain repeating patterns.

Exponential smoothing

Simple exponential smoothing updates a level estimate:

$$ \ell_t=\alpha y_t+(1-\alpha)\ell_{t-1}. $$

Here:

$$ 0<\alpha<1. $$

The forecast is:

$$ \hat{y}_{t+h|t}=\ell_t. $$

Holt’s method adds trend:

$$ \ell_t=\alpha y_t+(1-\alpha)(\ell_{t-1}+b_{t-1}), $$$$ b_t=\beta(\ell_t-\ell_{t-1})+(1-\beta)b_{t-1}. $$

Holt-Winters methods add seasonality.

ETS models formalize exponential smoothing in terms of error, trend, and seasonal components. Hyndman’s forecasting material gives the standard treatment of ETS, ARIMA, and their state-space relationships. (Rob J Hyndman)

Part IV — State-space models and Kalman filtering

Why state-space models matter

Sometimes the observed time series is only a noisy measurement of a hidden state.

For example:

observed GPS position is noisy,
true physical position is hidden,
sensor readings contain measurement error,
economic indicators reflect hidden market state.

A state-space model separates:

hidden state dynamics,
observation process.

Linear Gaussian state-space model

A standard model is:

$$ x_t = A x_{t-1}+w_t, $$$$ y_t = C x_t+v_t. $$

Here:

$x_t$ is hidden state,
$y_t$ is observation,
$A$ is state transition matrix,
$C$ is observation matrix,
(w_t\sim\mathcal{N}(0,Q)) is process noise,
(v_t\sim\mathcal{N}(0,R)) is observation noise.

Stanford and MIT lecture notes present state-space models and Kalman filtering as efficient algorithms for state estimation in linear Gaussian systems. (Stanford University)

Kalman prediction step

Suppose we have an estimate:

$$ \hat{x}_{t-1|t-1} $$

with covariance:

$$ P_{t-1|t-1}. $$

Prediction:

$$ \hat{x}_{t|t-1}=A\hat{x}_{t-1|t-1}, $$$$ P_{t|t-1}=AP_{t-1|t-1}A^\top+Q. $$

This predicts the next hidden state before seeing $y_t$.

Kalman update step

The predicted observation is:

$$ \hat{y}_{t|t-1}=C\hat{x}_{t|t-1}. $$

The innovation is:

$$ r_t=y_t-\hat{y}_{t|t-1}. $$

The innovation covariance is:

$$ S_t=CP_{t|t-1}C^\top+R. $$

The Kalman gain is:

$$ K_t=P_{t|t-1}C^\top S_t^{-1}. $$

Update:

$$ \hat{x}_{t|t} {}={} \hat{x}_{t|t-1}+K_t r_t, $$$$ P_{t|t} {}={} (I-K_tC)P_{t|t-1}. $$

The Kalman filter is important because it is a mathematically exact recursive estimator for linear Gaussian state-space models.

Hidden Markov models

A hidden Markov model, or HMM, uses a discrete hidden state:

$$ z_t\in{1,\dots,K}. $$

The hidden state evolves as a Markov chain:

$$ p(z_t\mid z_{t-1}). $$

The observation is generated from:

$$ p(y_t\mid z_t). $$

So the joint distribution is:

$$ p(z_{1:T},y_{1:T}) {}={} p(z_1)p(y_1\mid z_1) \prod_{t=2}^{T} p(z_t\mid z_{t-1})p(y_t\mid z_t). $$

Rabiner’s classic tutorial formalized the three central HMM problems: likelihood evaluation, decoding the most likely hidden state sequence, and parameter estimation. (Computer Science at UBC)

Part V — Spectral and frequency-domain mathematics

Time domain versus frequency domain

A time series can be studied in the time domain:

$$ y_t $$

or in the frequency domain.

Frequency analysis asks:

which oscillations are present in the signal?

For example, electricity demand may have:

daily frequency,
weekly frequency,
annual frequency.

MIT’s time-series course includes spectrum and spectrum estimation as core topics after stationarity and ARMA modeling. (MIT OpenCourseWare)

Discrete Fourier transform

For a finite sequence:

$$ y_0,y_1,\dots,y_{T-1}, $$

the discrete Fourier transform is:

$$ Y_k {}={} \sum_{t=0}^{T-1} y_t e^{-2\pi i kt/T}. $$

The inverse transform is:

$$ y_t {}={} \frac{1}{T} \sum_{k=0}^{T-1} Y_k e^{2\pi i kt/T}. $$

The magnitude:

$$ |Y_k| $$

shows the strength of frequency $k$.

Spectral density

For a stationary process with autocovariance $\gamma(k)$, the spectral density is:

$$ f(\omega) {}={} \frac{1}{2\pi} \sum_{k=-\infty}^{\infty} \gamma(k)e^{-i\omega k}. $$

This is the Fourier transform of the autocovariance function.

So the time-domain dependence structure and frequency-domain power distribution are mathematically connected.

Part VI — Machine learning formulation

Supervised time-series learning

In machine learning, we define input-output pairs:

$$ (x_i,z_i)_{i=1}^{N}. $$

For forecasting:

$$ x_i = y_{i:i+L-1}, $$$$ z_i = y_{i+L:i+L+H-1}. $$

The model is:

$$ \hat{z}*i=f*\theta(x_i). $$

A standard mean squared error loss is:

$$ \mathcal{L}(\theta) {}={} \frac{1}{N} \sum_{i=1}^{N} |\hat{z}_i-z_i|_2^2. $$

For scalar one-step forecasting:

$$ \mathcal{L}(\theta) {}={} \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_{i+1}-y_{i+1})^2. $$

Time-aware train-test split

Ordinary random splitting can leak future information into training.

For time series, we usually split chronologically:

$$ \text{train}: y_1,\dots,y_{T_{\text{train}}}, $$$$ \text{test}: y_{T_{\text{train}}+1},\dots,y_T. $$

This respects causality.

A model should not learn from the future when evaluated on the past.

Direct versus recursive forecasting

For multi-step forecasting, there are several strategies.

Recursive forecasting

Train one-step model:

$$ \hat{y}_{t+1}=f_\theta(y_{t-L+1:t}). $$

Then feed predictions back:

$$ \hat{y}_{t+2}=f_\theta(y_{t-L+2:t},\hat{y}_{t+1}). $$

Problem: errors accumulate.

Direct forecasting

Train separate models:

$$ \hat{y}_{t+h}=f_{\theta_h}(y_{t-L+1:t}) $$

for each horizon $h$.

Problem: many models.

Multi-output forecasting

Train one model:

$$ \hat{y}_{t+1:t+H}=f_\theta(y_{t-L+1:t}). $$

This is common in deep learning.

Point forecasting and probabilistic forecasting

A point forecast gives one value:

$$ \hat{y}_{t+h}. $$

A probabilistic forecast gives a distribution:

$$ p(y_{t+h}\mid y_{1:t}). $$

For decision-making, probabilistic forecasts are often more useful.

For example, energy planning needs:

$$ P(y_{t+h}> \text{capacity}). $$

DeepAR explicitly frames forecasting as probabilistic forecasting: estimating future distributions given the past, using an autoregressive recurrent neural network trained across many related time series. (arXiv)

Part VII — Probabilistic forecasting losses

Gaussian likelihood

Suppose the model predicts:

$$ \mu_\theta(x) $$

and:

$$ \sigma_\theta(x)>0. $$

Assume:

$$ y\mid x\sim\mathcal{N}(\mu_\theta(x),\sigma_\theta(x)^2). $$

The negative log-likelihood is:

$$ -\log p(y\mid x) {}={} \frac{1}{2} \log(2\pi\sigma_\theta^2) + \frac{(y-\mu_\theta)^2}{2\sigma_\theta^2}. $$

This teaches the model both center and uncertainty.

If uncertainty is high, $\sigma_\theta$ can be large.

But predicting large $\sigma_\theta$ is penalized by the log term.

Quantile loss

A quantile forecast predicts $q_\tau(x)$, the $\tau$-quantile.

The pinball loss is:

$$ \ell_\tau(y,q) {}={} \max \left( \tau(y-q), (\tau-1)(y-q) \right). $$

For example:

$\tau=0.5$ gives median forecasting,
$\tau=0.9$ gives upper quantile forecasting.

Multiple quantiles form prediction intervals.

CRPS

The continuous ranked probability score compares a full predictive CDF $F$ with observation $y$:

$$ \operatorname{CRPS}(F,y) {}={} \int_{-\infty}^{\infty} \left( F(z)-\mathbf{1}{y\leq z} \right)^2dz. $$

Gneiting and Raftery describe CRPS as a strictly proper scoring rule for probabilistic forecasts and note that it generalizes absolute error to predictive distributions. (Statistical Consulting Service)

Proper scoring rules

A scoring rule is proper if the best expected score is achieved by reporting the true distribution.

This matters because we do not want a model to cheat by giving overconfident or underconfident forecasts.

For probabilistic forecasting, common proper scoring rules include:

negative log-likelihood,
CRPS,
energy score,
variogram score.

Part VIII — RNNs, LSTMs, and GRUs

Recurrent neural networks

A recurrent neural network updates a hidden state:

$$ h_t = \phi(W_xx_t+W_hh_{t-1}+b). $$

The output may be:

$$ \hat{y}_{t+1}=W_oh_t+c. $$

Here:

$x_t$ is input at time $t$,
$h_t$ is memory state,
$W_h$ controls recurrence.

RNNs are natural for time series because they process data sequentially.

Vanishing and exploding gradients

Backpropagation through time multiplies many Jacobians.

A simplified gradient contains products like:

$$ \prod_{t=1}^{T} W_h^\top D_t. $$

If eigenvalues are small, gradients vanish.

If eigenvalues are large, gradients explode.

This makes long-term dependency learning difficult.

LSTM

LSTM was introduced to address long-term dependency problems in recurrent training. The original LSTM paper explicitly discusses insufficient error backflow and introduces memory cells and gates to improve long-duration information storage. (bioinf.jku.at)

An LSTM uses gates:

$$ f_t=\sigma(W_fx_t+U_fh_{t-1}+b_f), $$$$ i_t=\sigma(W_ix_t+U_ih_{t-1}+b_i), $$$$ o_t=\sigma(W_ox_t+U_oh_{t-1}+b_o), $$

candidate memory:

$$ \tilde{c}_t=\tanh(W_cx_t+U_ch_{t-1}+b_c), $$

cell update:

$$ c_t=f_t\odot c_{t-1}+i_t\odot \tilde{c}_t, $$

hidden state:

$$ h_t=o_t\odot\tanh(c_t). $$

The forget gate $f_t$ decides what to keep.

The input gate $i_t$ decides what to write.

The output gate $o_t$ decides what to expose.

GRU

A GRU is a simpler gated recurrent unit.

It uses update and reset gates:

$$ z_t=\sigma(W_zx_t+U_zh_{t-1}+b_z), $$$$ r_t=\sigma(W_rx_t+U_rh_{t-1}+b_r). $$

Candidate state:

$$ \tilde{h}*t= \tanh(W_hx_t+U_h(r_t\odot h_{t-1})+b_h). $$

Update:

$$ h_t=(1-z_t)\odot h_{t-1}+z_t\odot \tilde{h}_t. $$

GRUs were introduced in neural sequence modeling and later compared empirically with LSTMs as gated recurrent architectures. (arXiv)

Part IX — Temporal convolutional networks

Causal convolution

A causal convolution ensures the output at time $t$ uses only current and past inputs.

For kernel size $K$:

$$ h_t= \sum_{k=0}^{K-1} w_k x_{t-k}. $$

No future value $x_{t+1}$ is used.

This is necessary for forecasting.

Dilated convolution

A dilated convolution skips time steps:

$$ h_t= \sum_{k=0}^{K-1} w_k x_{t-dk}. $$

Here $d$ is dilation.

With increasing dilations:

$$ 1,2,4,8,\dots $$

the receptive field grows quickly.

WaveNet used dilated causal convolutions to obtain very large receptive fields without huge computational cost. (arXiv)

TCN

Temporal convolutional networks combine:

causal convolutions,
dilation,
residual blocks,
stable parallel training.

A large empirical comparison found that simple convolutional sequence models can outperform canonical recurrent models such as LSTMs on several sequence modeling tasks while showing longer effective memory. (arXiv)

A TCN block may be written:

$$ H^{(\ell+1)} {}={} H^{(\ell)} + \operatorname{Conv}_{\text{causal,dilated}} ( \phi(H^{(\ell)}) ). $$

This is similar to residual deep learning, but adapted to time.

Part X — Attention and transformers for time series

Why attention helps

RNNs compress the past into a hidden state.

Attention directly compares time points.

Given:

$$ X\in\mathbb{R}^{T\times d}, $$

we compute:

$$ Q=XW_Q, $$$$ K=XW_K, $$$$ V=XW_V. $$

Attention is:

$$ \operatorname{Attention}(Q,K,V) {}={} \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right)V. $$

Transformers were originally introduced for sequence transduction using attention alone, avoiding recurrence and convolution. (arXiv)

Attention as temporal dependency learning

The attention matrix is:

$$ A= \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right). $$

Here:

$$ A_{ij} $$

measures how much time point $i$ attends to time point $j$.

For forecasting, causal masking may be used:

$$ A_{ij}=0 \quad \text{if} \quad j>i. $$

For encoder-style forecasting, the model may attend over the whole observed context.

Quadratic complexity

Standard attention forms a $T\times T$ matrix.

So memory and compute scale like:

$$ O(T^2). $$

This is a problem for long time series.

If:

$$ T=10{,}000, $$

then:

$$ T^2=100{,}000{,}000. $$

So many time-series transformer papers modify tokenization, attention, decomposition, or variable representation to reduce cost or improve inductive bias.

Temporal Fusion Transformer

Temporal Fusion Transformer combines recurrent layers, attention, gating, static covariate encoders, known future inputs, and variable selection for multi-horizon forecasting. Its paper emphasizes both high-performance multi-horizon forecasting and interpretability over temporal dynamics. (ScienceDirect)

A simplified multi-horizon forecast is:

$$ \hat{y}_{t+1:t+H} {}={} f_\theta ( \text{past observed}, \text{known future covariates}, \text{static features} ). $$

This is important because real forecasting often includes:

static covariates,
known future calendar features,
observed historical covariates,
target history.

PatchTST

PatchTST treats time-series segments as patches.

Instead of tokenizing each time point, it tokenizes subseries:

$$ p_i = [y_{s_i},y_{s_i+1},\dots,y_{s_i+P-1}]. $$

Then patches become transformer tokens.

PatchTST argues that patching retains local semantic information, reduces attention cost quadratically for a fixed lookback window, and allows longer histories; it also uses channel independence, sharing embedding and transformer weights across univariate channels. (arXiv)

Mathematically:

$$ Y\in\mathbb{R}^{T\times d} \quad \longrightarrow \quad P\in\mathbb{R}^{N_p\times P\times d}. $$

Then each patch is embedded into:

$$ z_i\in\mathbb{R}^{d_{\text{model}}}. $$

iTransformer

Traditional transformers often treat each timestamp as a token.

iTransformer inverts this idea.

It treats each variable as a token whose feature vector is the lookback history.

If:

$$ Y\in\mathbb{R}^{T\times d}, $$

then for variable $j$:

$$ v_j= [y_{1j},y_{2j},\dots,y_{Tj}] \in\mathbb{R}^{T}. $$

The model embeds variate tokens, and attention captures multivariate correlations.

The iTransformer paper argues that standard timestamp-token transformers may mix delayed events and distinct physical variables poorly, and shows that applying transformer components on inverted dimensions can improve forecasting and generalization. (arXiv)

TimesNet

TimesNet transforms 1D time series into 2D tensors based on discovered periods.

If a period is $p$, a sequence can be reshaped into:

$$ \text{rows}=\text{number of periods}, \qquad \text{columns}=p. $$

This separates:

intraperiod variation,
interperiod variation.

TimesNet motivates this by multi-periodicity and models temporal variation in 2D space using parameter-efficient 2D kernels. It reports results across forecasting, imputation, classification, and anomaly detection. (arXiv)

TimeMixer

TimeMixer uses multiscale decomposition.

It builds representations at multiple sampling scales:

$$ Y^{(1)},Y^{(2)},\dots,Y^{(S)}. $$

Then it decomposes each scale into trend and seasonal parts:

$$ Y^{(s)}=T^{(s)}+S^{(s)}. $$

TimeMixer proposes past-decomposable mixing and future-multipredictor mixing to combine microscopic seasonal information and macroscopic trend information across scales. (ICLR Proceedings)

Part XI — Deep forecasting architectures

DeepAR

DeepAR is a probabilistic autoregressive RNN.

At each time:

$$ h_t=\operatorname{RNN}_\theta(h_{t-1},y_{t-1},x_t), $$

then:

$$ \theta_t = g_\theta(h_t), $$

where $\theta_t$ parameterizes a probability distribution:

$$ p(y_t\mid y_{For Gaussian output:

$$ y_t\mid h_t\sim\mathcal{N}(\mu_t,\sigma_t^2). $$

Training maximizes likelihood:

$$ \sum_t \log p_\theta(y_t\mid y_{DeepAR was designed to train one global probabilistic model across many related time series, allowing information sharing across items. (arXiv)

N-BEATS

N-BEATS uses fully connected residual blocks for univariate forecasting.

Each block produces:

a backcast,
a forecast.

Let the input window be:

$$ x\in\mathbb{R}^{L}. $$

A block computes:

$$ \theta_b = f_b(x), $$

then:

$$ \hat{x}_b = B_b\theta_b, $$$$ \hat{y}_b = F_b\theta_b. $$

The residual input to the next block is:

$$ x \leftarrow x-\hat{x}_b. $$

The total forecast is:

$$ \hat{y}=\sum_b \hat{y}_b. $$

N-BEATS introduced neural basis expansion with backward and forward residual links, and it reported strong results on M3, M4, and TOURISM datasets. (arXiv)

N-HiTS

N-HiTS extends this idea using hierarchical interpolation and multi-rate sampling.

It is designed for long-horizon forecasting, where naive deep models can become computationally expensive and volatile.

N-HiTS reports that hierarchical interpolation can approximate long horizons efficiently under smoothness assumptions and shows strong accuracy and runtime advantages in long-horizon experiments. (arXiv)

Part XII — State-space deep sequence models

Continuous-time state-space model

A linear state-space model can be written:

$$ \frac{dh(t)}{dt}=Ah(t)+Bu(t), $$$$ y(t)=Ch(t)+Du(t). $$

Here:

$u(t)$ is input,
$h(t)$ is hidden state,
$y(t)$ is output.

This bridges control theory, signal processing, differential equations, and sequence modeling.

S4

S4 uses structured state-space models to model long sequences efficiently.

The S4 paper starts from continuous-time state-space equations and develops a structured parameterization allowing efficient computation while preserving long-range dependency modeling. (arXiv)

A discretized form is:

$$ h_{t+1}=\bar{A}h_t+\bar{B}x_t, $$$$ y_t=Ch_t+Dx_t. $$

The impulse response defines a convolution kernel:

$$ K_k=C\bar{A}^k\bar{B}. $$

Then:

$$ y_t=\sum_{k=0}^{t}K_kx_{t-k}. $$

So SSMs can be viewed as recurrent models or convolutional models.

Mamba and selective state spaces

A limitation of static SSMs is that their dynamics may not depend enough on the current input.

Mamba introduces selective state-space models where SSM parameters depend on the input, enabling selective propagation or forgetting along the sequence. The paper emphasizes linear scaling in sequence length and hardware-aware parallel algorithms. (arXiv)

A simplified selective update is:

$$ h_t=A(x_t)h_{t-1}+B(x_t)x_t, $$$$ y_t=C(x_t)h_t. $$

So the model decides what to remember based on the current input.

Part XIII — Neural differential equations for time series

Neural ODE

A neural ODE models hidden dynamics continuously:

$$ \frac{dh(t)}{dt}=f_\theta(h(t),t). $$

Given initial state $h(t_0)$, the solution is:

$$ h(t_1)=h(t_0)+\int_{t_0}^{t_1}f_\theta(h(t),t),dt. $$

Neural ODEs use black-box ODE solvers and support continuous-depth models, adaptive computation, and latent time-series models. (arXiv)

Latent ODE for irregular time series

Suppose observations occur at irregular times:

$$ t_1,t_2,\dots,t_n. $$

A latent ODE defines:

$$ z(t)=\operatorname{ODESolve}(z(t_0),f_\theta,t_0,t). $$

Then observations are generated by:

$$ y_i\sim p_\theta(y_i\mid z(t_i)). $$

This is useful when data is not sampled at regular intervals.

Neural CDE

A neural controlled differential equation is:

$$ d h_t = f_\theta(h_t),dX_t. $$

Here:

$X_t$ is an interpolated input path,
$h_t$ is hidden state.

Neural CDEs were introduced as a continuous-time limit of RNNs and are directly applicable to irregular, partially observed multivariate time series. (arXiv)

A more explicit integral form is:

$$ h_t {}={} h_0+ \int_0^t f_\theta(h_s),dX_s. $$

This is advanced mathematics because it uses controlled differential equations, rough path ideas, and continuous-time sequence modeling.

Part XIV — Gaussian processes for time series

Gaussian process

A Gaussian process is a distribution over functions:

$$ f(t)\sim\mathcal{GP}(m(t),k(t,t')). $$

This means any finite collection:

$$ [f(t_1),\dots,f(t_n)] $$

has a multivariate Gaussian distribution.

Rasmussen and Williams’ Gaussian Processes for Machine Learning is a standard MIT Press reference that presents GPs as a principled probabilistic approach to regression and classification with covariance kernels. (the Gaussian Process web site)

Kernel

The kernel controls similarity:

$$ k(t,t')=\operatorname{Cov}(f(t),f(t')). $$

A squared exponential kernel is:

$$ k(t,t') {}={} \sigma_f^2 \exp \left( -\frac{(t-t')^2}{2\ell^2} \right). $$

A periodic kernel is:

$$ k(t,t') {}={} \sigma_f^2 \exp \left( -\frac{2\sin^2(\pi |t-t'|/p)}{\ell^2} \right). $$

Kernels encode assumptions about smoothness, periodicity, and long-range correlation.

GP prediction

Let:

$$ y=f(X)+\varepsilon, \qquad \varepsilon\sim\mathcal{N}(0,\sigma_n^2I). $$

For test inputs $X_*$, the predictive distribution is Gaussian:

$$ p(f_*\mid X,y,X_*)= \mathcal{N}(\mu_*,\Sigma_*). $$

The mean is:

$$ \mu_* {}={} K_{*X} (K_{XX}+\sigma_n^2I)^{-1}y. $$

The covariance is:

$$ \Sigma_* {}={} K_{**} - K_{*X} (K_{XX}+\sigma_n^2I)^{-1} K_{X*}. $$

This gives both prediction and uncertainty.

Part XV — Self-supervised and representation learning

Why representation learning matters

Sometimes labels are scarce.

Instead of training directly for forecasting or classification, we learn representations:

$$ z_t = f_\theta(y_{1:T})_t. $$

Then use $z_t$ for:

classification,
anomaly detection,
forecasting,
clustering,
imputation.

Contrastive learning

A contrastive objective pulls related views together and pushes unrelated views apart.

Let:

$$ z_i $$

and:

$$ z_i^+ $$

be two views of the same time series.

Let:

$$ z_j^- $$

be a negative example.

An InfoNCE-style loss is:

$$ \mathcal{L} {}={} -\log \frac{ \exp(\operatorname{sim}(z_i,z_i^+)/\tau) }{ \exp(\operatorname{sim}(z_i,z_i^+)/\tau) + \sum_j \exp(\operatorname{sim}(z_i,z_j^-)/\tau) }. $$

TS2Vec learns time-series representations using hierarchical contrastive learning over augmented context views and reports strong results across UCR/UEA classification, forecasting, and anomaly detection settings. (arXiv)

Part XVI — Imputation and missing data

Missing values

A time series may have missing observations.

Let:

$$ m_t= \begin{cases} 1, & y_t \text{ observed},\ 0, & y_t \text{ missing}. \end{cases} $$

The observed data is:

$$ y_t^{\text{obs}}=m_ty_t. $$

The goal is to estimate missing values:

$$ p(y_{\text{miss}}\mid y_{\text{obs}}). $$

This is a conditional distribution.

Simple imputation methods

Basic methods include:

forward fill,
mean fill,
interpolation,
seasonal average.

But these may fail when the dynamics are complex.

Diffusion-based imputation

CSDI uses conditional score-based diffusion for probabilistic time-series imputation, conditioning on observed values and learning a conditional distribution over missing values. It reports improvements on healthcare and environmental datasets over earlier probabilistic imputation methods. (arXiv)

Mathematically, the goal is:

$$ p_\theta(y_{\text{miss}}\mid y_{\text{obs}},m). $$

A diffusion model gradually denoises missing values while respecting observed values.

Part XVII — Anomaly detection

What is an anomaly?

An anomaly is a time point or segment that does not fit the expected pattern.

Examples:

sudden spike in traffic,
abnormal heart rhythm,
machine vibration before failure,
cyberattack in network traffic,
sensor malfunction.

Reconstruction-based anomaly score

Train an autoencoder:

$$ \hat{y}_{1:T}=g_\theta(f_\theta(y_{1:T})). $$

Compute reconstruction error:

$$ s_t = |y_t-\hat{y}_t|^2. $$

If:

$$ s_t>\tau, $$

then $t$ is anomalous.

Forecasting-based anomaly score

Train a forecasting model:

$$ \hat{y}*t=f*\theta(y_{Define error:

$$ e_t=y_t-\hat{y}_t. $$

An anomaly score is:

$$ s_t=|e_t| $$

or:

$$ s_t=e_t^\top\Sigma_t^{-1}e_t. $$

Large forecast error suggests unusual behavior.

Transformer-based anomaly detection

Anomaly Transformer introduces association discrepancy: anomalies tend to have different attention-association patterns from normal points. The method uses an anomaly-attention mechanism and a minimax strategy to distinguish normal and abnormal temporal associations. (arXiv)

Part XVIII — Diffusion models for time series

Probabilistic generation

A diffusion model defines a noising process:

$$ q(x_t\mid x_{t-1}) $$

and learns a reverse process:

$$ p_\theta(x_{t-1}\mid x_t,\text{condition}). $$

For forecasting, the condition is historical context:

$$ y_{1:T}. $$

The target is future trajectory:

$$ y_{T+1:T+H}. $$

TimeGrad

TimeGrad combines autoregressive forecasting with denoising diffusion.

At each future time step, it samples from the conditional distribution by estimating gradients of the data distribution.

TimeGrad is an autoregressive denoising diffusion model for multivariate probabilistic time-series forecasting. (arXiv)

A simplified objective is:

$$ \mathbb{E} \left[ |\epsilon-\epsilon_\theta(x_t,t,\text{history})|^2 \right]. $$

This is similar to image diffusion, but the conditioning is temporal history.

Diffusion forecasting survey view

Recent surveys organize diffusion-based time-series forecasting by how conditional information is supplied and how it is integrated into the denoising network. They emphasize that diffusion models provide probabilistic forecasts by sampling future trajectories through iterative denoising. (arXiv)

Part XIX — Time-series foundation models

Why foundation models for time series?

Traditional forecasting often trains one model per dataset.

Foundation models try to pretrain on many time series from many domains and then generalize to new series.

This is analogous to language models, but the data is numerical temporal data rather than text.

A 2025 survey identifies transformer-based time-series foundation models as a major emerging paradigm for forecasting, anomaly detection, classification, trend analysis, and related tasks, and categorizes them by architecture, prediction type, scale, data type, and training objective. (arXiv)

Tokenization-based foundation models

Chronos converts time-series values into discrete tokens using scaling and quantization, then trains language-model architectures with cross-entropy loss. It reports strong zero-shot performance across many datasets. (arXiv)

Mathematically:

$$ y_t \longrightarrow \tilde{y}_t {}={} \frac{y_t-\mu}{s} $$

then quantize:

$$ z_t=Q(\tilde{y}_t)\in{1,\dots,V}. $$

Train:

$$ p_\theta(z_{t+1}\mid z_{\leq t}). $$

Then sample token futures and dequantize.

Decoder-only foundation forecasting

TimesFM is a decoder-only foundation model for time-series forecasting and was designed for zero-shot forecasting across public datasets. Its paper reports that out-of-the-box zero-shot performance can approach supervised models trained specifically on individual datasets. (arXiv)

A decoder-only forecasting model learns:

$$ p_\theta(y_{T+1:T+H}\mid y_{1:T}). $$

Depending on design, outputs may be:

point forecasts,
quantiles,
distribution parameters,
sampled trajectories.

Lag-based foundation forecasting

Lag-Llama is a decoder-only transformer foundation model for univariate probabilistic forecasting that uses lagged values as covariates and is pretrained on diverse time-series data. (arXiv)

Lag features may be:

$$ [y_{t-1},y_{t-7},y_{t-24},y_{t-168}] $$

for daily, hourly, or weekly structures.

The model learns:

$$ p_\theta(y_t\mid y_{t-\ell_1},y_{t-\ell_2},\dots,\text{context}). $$

Universal forecasting models

Moirai frames universal forecasting as pretraining one model across many time-series datasets to handle diverse downstream forecasting tasks. It supports probabilistic zero-shot forecasting and exogenous features. (arXiv)

The mathematical goal is no longer:

$$ \text{one dataset} \rightarrow \text{one model}. $$

It becomes:

$$ \text{many datasets} \rightarrow \text{one general model} \rightarrow \text{zero-shot or fine-tuned forecasting}. $$

General-purpose time-series foundation models

MOMENT introduces open time-series foundation models for general-purpose tasks, including forecasting, classification, anomaly detection, and imputation, and discusses challenges such as lack of a large cohesive public repository and heterogeneous time-series characteristics. (arXiv)

So foundation models must handle:

different frequencies,
different units,
different scales,
different lengths,
missing values,
multivariate inputs,
diverse downstream tasks.

Continuous-valued foundation models

Some recent models avoid discrete tokenization.

Sundial proposes TimeFlow Loss based on flow matching to predict next-patch distributions directly for continuous-valued time series, enabling probabilistic zero-shot forecasting with generated sample trajectories. (arXiv)

This connects time-series foundation models to:

continuous probability flows,
generative modeling,
stochastic forecasting,
distribution learning.

A flow-matching-style objective learns a vector field:

$$ v_\theta(x,t) \approx v^\star(x,t) $$

that transports a simple distribution toward the data distribution.

Part XX — Evaluation metrics

Point forecast metrics

Mean absolute error:

$$ \operatorname{MAE} {}={} \frac{1}{N} \sum_{i=1}^{N} |y_i-\hat{y}_i|. $$

Mean squared error:

$$ \operatorname{MSE} {}={} \frac{1}{N} \sum_{i=1}^{N} (y_i-\hat{y}_i)^2. $$

Root mean squared error:

$$ \operatorname{RMSE} {}={} \sqrt{\operatorname{MSE}}. $$

Mean absolute percentage error:

$$ \operatorname{MAPE} {}={} \frac{100}{N} \sum_i \left| \frac{y_i-\hat{y}_i}{y_i} \right|. $$

MAPE is problematic when $y_i$ is near zero.

Scale-free metric: MASE

Mean absolute scaled error compares forecast error to a naive baseline:

$$ \operatorname{MASE} {}={} \frac{ \frac{1}{N}\sum_i |y_i-\hat{y}*i| }{ \frac{1}{T-1}\sum_{t=2}^{T}|y_t-y_{t-1}| }. $$

This helps compare across series with different scales.

Probabilistic metrics

For probabilistic forecasts, use:

negative log-likelihood,
CRPS,
pinball loss,
coverage,
interval width,
calibration error.

Prediction interval coverage:

$$ \operatorname{Coverage} {}={} \frac{1}{N} \sum_i \mathbf{1} {y_i\in [\ell_i,u_i]}. $$

If a 90% interval is calibrated, coverage should be close to 0.9.

Conformal prediction for time series

Conformal forecasting uses residuals to build prediction intervals with frequentist coverage guarantees under assumptions.

A conformal score may be:

$$ s_i=|y_i-\hat{y}_i|. $$

Let:

$$ q_{1-\alpha} $$

be a quantile of calibration scores.

Then interval:

$$ [\hat{y}_{t}-q_{1-\alpha},\hat{y}_{t}+q_{1-\alpha}] $$

has approximate coverage.

Conformal time-series forecasting extends inductive conformal prediction to forecasting and proposes lightweight uncertainty intervals for multi-horizon predictors. (NeurIPS Proceedings)

Part XXI — Advanced mathematical view

Time series as stochastic processes

A time series is one realization of a stochastic process:

$$ {Y_t:t\in\mathcal{T}}. $$

If $\mathcal{T}=\mathbb{Z}$, it is discrete time.

If $\mathcal{T}=\mathbb{R}$, it is continuous time.

The complete mathematical object is not just observed values. It is the probability law:

$$ P(Y_{1:T}\in A). $$

Learning a time-series model means estimating or approximating this probability law.

Conditional distribution view

The general forecasting problem is:

$$ p(y_{T+1:T+H}\mid y_{1:T},x_{1:T+H}). $$

Here:

$y$ is target series,
$x$ are covariates,
past covariates may be observed,
future covariates may be known,
future targets are unknown.

A deep model parameterizes:

$$ p_\theta(y_{T+1:T+H}\mid \mathcal{F}_T), $$

where $\mathcal{F}_T$ is the information available at time $T$.

This is the mathematically correct forecasting object.

Filtration

In stochastic process theory, information up to time $t$ is represented by a filtration:

$$ \mathcal{F}_t. $$

A forecast is conditional on this information:

$$ \mathbb{E}[Y_{t+h}\mid \mathcal{F}_t]. $$

A probabilistic forecast is:

$$ P(Y_{t+h}\in A\mid \mathcal{F}_t). $$

This notation is advanced but extremely important.

It formalizes “no future leakage.”

Martingale difference noise

A noise sequence $\varepsilon_t$ is a martingale difference if:

$$ \mathbb{E}[\varepsilon_t\mid \mathcal{F}_{t-1}]=0. $$

This means the noise is unpredictable from the past.

Many forecasting models aim to make residuals behave like martingale differences.

If residuals still have predictable structure, the model has not captured all available information.

Ergodicity

Ergodicity allows time averages to approximate expectations.

For a stationary ergodic process:

$$ \frac{1}{T} \sum_{t=1}^{T} g(Y_t) \to \mathbb{E}[g(Y)] $$

as:

$$ T\to\infty. $$

This is important because in time series we often observe one long trajectory, not many independent samples.

Non-stationarity and distribution shift

Real time series often violate stationarity.

The distribution may change:

$$ p_t(y)\neq p_{t+1}(y). $$

Sources of non-stationarity include:

climate change,
market regime shifts,
sensor drift,
policy changes,
equipment aging,
seasonal changes,
user behavior shifts.

Modern models use:

normalization,
decomposition,
adaptive training,
online learning,
covariate conditioning,
foundation pretraining,
conformal recalibration.

But non-stationarity remains one of the hardest problems in time-series ML.

Part XXII — The complete mathematical map

The mathematics of time-series machine learning and deep learning includes:

Area	Mathematics
Basic time series	sequences, indexing, windows
Classical statistics	stationarity, autocorrelation, ARMA, ARIMA
Forecasting	conditional expectation, conditional distribution
Seasonality	Fourier analysis, periodic functions
State-space models	linear systems, Gaussian conditioning
Kalman filtering	Bayesian recursion, matrix covariance updates
HMMs	Markov chains, latent variables, dynamic programming
ML forecasting	supervised learning, empirical risk minimization
Probabilistic forecasting	likelihoods, quantiles, CRPS, calibration
RNNs	recurrence, dynamical systems, BPTT
LSTMs/GRUs	gating, memory, gradient flow
TCNs	causal convolution, dilation, receptive fields
Transformers	attention, tensor algebra, positional encodings
Patch models	local subsequence tokenization
Inverted models	variate-token attention
Multiscale models	decomposition, trend-season mixing
SSMs	differential equations, convolution kernels
Neural ODE/CDE	continuous-time dynamics, adjoint methods
Gaussian processes	kernels, Bayesian function priors
Diffusion models	score matching, denoising, conditional generation
Foundation models	pretraining, zero-shot transfer, scaling
Anomaly detection	residuals, reconstruction error, association discrepancy
Imputation	conditional distributions, missing-data inference
Evaluation	proper scoring rules, calibration, backtesting

Section summary

A time series is ordered data:

$$ y_{1:T}=(y_1,\dots,y_T). $$

A multivariate time series is:

$$ Y\in\mathbb{R}^{T\times d}. $$

A forecasting model learns:

$$ \hat{y}_{T+1:T+H}=f_\theta(y_{1:T}). $$

A probabilistic forecasting model learns:

$$ p_\theta(y_{T+1:T+H}\mid y_{1:T}). $$

Classical models include:

$$ \text{AR},\quad \text{MA},\quad \text{ARMA},\quad \text{ARIMA},\quad \text{ETS}. $$

State-space models separate hidden dynamics and noisy observations:

$$ x_t=Ax_{t-1}+w_t, $$$$ y_t=Cx_t+v_t. $$

Deep learning models include:

$$ \text{RNN},\quad \text{LSTM},\quad \text{GRU},\quad \text{TCN},\quad \text{Transformer}. $$

Modern SOTA families include:

$$ \text{N-BEATS},\quad \text{N-HiTS},\quad \text{DeepAR},\quad \text{TFT},\quad \text{PatchTST},\quad \text{iTransformer},\quad \text{TimesNet},\quad \text{TimeMixer}. $$

Advanced models include:

$$ \text{S4},\quad \text{Mamba},\quad \text{Neural ODEs},\quad \text{Neural CDEs},\quad \text{Gaussian Processes},\quad \text{Diffusion Models}. $$

Foundation models aim to learn across many datasets:

$$ \text{many time series} \to \text{one pretrained model} \to \text{zero-shot or fine-tuned forecasting}. $$

The deepest view is:

time-series learning is conditional probability modeling over ordered stochastic processes.

Mathematically:

$$ p_\theta(y_{T+1:T+H}\mid \mathcal{F}_T) $$

is the central object.

Everything else — ARIMA, Kalman filters, RNNs, transformers, diffusion models, and foundation models — is a different way of approximating this conditional law.

Source anchors used for this section

Hyndman and Athanasopoulos, Forecasting: Principles and Practice, for classical forecasting, decomposition, ETS, ARIMA, and practical evaluation. (OTexts: Online, open-access textbooks)
MIT OCW graduate time-series notes for stationarity, lag operators, ARMA, covariance, spectral analysis, and econometric time-series theory. (MIT OpenCourseWare)
Stanford and MIT notes for state-space models and Kalman filtering. (Stanford University)
Rabiner’s HMM tutorial for Markov chains, hidden states, likelihood evaluation, decoding, and parameter estimation. (Computer Science at UBC)
Rasmussen and Williams, Gaussian Processes for Machine Learning, for Gaussian processes, kernels, and probabilistic function modeling. (the Gaussian Process web site)
DeepAR, N-BEATS, N-HiTS, TFT, PatchTST, TimesNet, iTransformer, and TimeMixer papers for representative deep forecasting architectures. (arXiv)
LSTM, GRU, WaveNet, and TCN papers for recurrent and convolutional sequence modeling. (bioinf.jku.at)
S4 and Mamba papers for structured and selective state-space sequence models. (arXiv)
Neural ODE and Neural CDE papers for continuous-time and irregular time-series modeling. (arXiv)
TS2Vec and Anomaly Transformer for representation learning and anomaly detection. (arXiv)
TimeGrad, CSDI, and diffusion time-series surveys for diffusion-based probabilistic forecasting and imputation. (arXiv)
Chronos, TimesFM, Lag-Llama, Moirai, MOMENT, Timer, and Sundial for modern time-series foundation models. (arXiv)

1.5 Mathematics of Machine Learning and Deep Learning for Time Series

Why time series matter#

Part I — Beginner foundation#

A simple time series#

Multivariate time series#

Forecasting#

Input window and output horizon#

Time-series tasks#

Part II — Core statistical time-series mathematics#

Deterministic signal plus noise#

Trend#

Seasonality#

Lag operator#

Stationarity#

Autocovariance#

Autocorrelation#

White noise#

Random walk#

Part III — Classical forecasting models#

Autoregressive model AR\(p\)#

Moving-average model MA\(q\)#

ARMA model#

ARIMA model#

Seasonal ARIMA#

Exponential smoothing#

Part IV — State-space models and Kalman filtering#

Why state-space models matter#

Linear Gaussian state-space model#

Kalman prediction step#

Kalman update step#

Hidden Markov models#

Part V — Spectral and frequency-domain mathematics#

Time domain versus frequency domain#

Discrete Fourier transform#

Spectral density#

Part VI — Machine learning formulation#

Supervised time-series learning#

Time-aware train-test split#

Direct versus recursive forecasting#

Recursive forecasting#

Direct forecasting#

Multi-output forecasting#

Point forecasting and probabilistic forecasting#

Part VII — Probabilistic forecasting losses#

Gaussian likelihood#

Quantile loss#

CRPS#

Proper scoring rules#

Part VIII — RNNs, LSTMs, and GRUs#

Recurrent neural networks#

Vanishing and exploding gradients#

LSTM#

GRU#

Part IX — Temporal convolutional networks#

Causal convolution#

Dilated convolution#

TCN#

Part X — Attention and transformers for time series#

Why attention helps#

Attention as temporal dependency learning#

Quadratic complexity#

Temporal Fusion Transformer#

PatchTST#

iTransformer#

TimesNet#

TimeMixer#

Part XI — Deep forecasting architectures#

DeepAR#

N-BEATS#

N-HiTS#

Part XII — State-space deep sequence models#

Continuous-time state-space model#

S4#

Mamba and selective state spaces#

Part XIII — Neural differential equations for time series#

Neural ODE#

Latent ODE for irregular time series#

Neural CDE#

Part XIV — Gaussian processes for time series#

Gaussian process#

Why time series matter

Part I — Beginner foundation

A simple time series

Multivariate time series

Forecasting

Input window and output horizon

Time-series tasks

Part II — Core statistical time-series mathematics

Deterministic signal plus noise

Trend

Seasonality

Lag operator

Stationarity

Autocovariance

Autocorrelation

White noise

Random walk

Part III — Classical forecasting models

Autoregressive model AR\(p\)

Moving-average model MA\(q\)

ARMA model

ARIMA model

Seasonal ARIMA

Exponential smoothing

Part IV — State-space models and Kalman filtering

Why state-space models matter

Linear Gaussian state-space model

Kalman prediction step

Kalman update step

Hidden Markov models

Part V — Spectral and frequency-domain mathematics

Time domain versus frequency domain

Discrete Fourier transform

Spectral density

Part VI — Machine learning formulation

Supervised time-series learning

Time-aware train-test split

Direct versus recursive forecasting

Recursive forecasting

Direct forecasting

Multi-output forecasting

Point forecasting and probabilistic forecasting

Part VII — Probabilistic forecasting losses

Gaussian likelihood

Quantile loss

CRPS

Proper scoring rules

Part VIII — RNNs, LSTMs, and GRUs

Recurrent neural networks

Vanishing and exploding gradients

LSTM

GRU

Part IX — Temporal convolutional networks

Causal convolution

Dilated convolution

TCN

Part X — Attention and transformers for time series

Why attention helps

Attention as temporal dependency learning

Quadratic complexity

Temporal Fusion Transformer

PatchTST

iTransformer

TimesNet

TimeMixer

Part XI — Deep forecasting architectures

DeepAR

N-BEATS

N-HiTS

Part XII — State-space deep sequence models

Continuous-time state-space model

S4

Mamba and selective state spaces

Part XIII — Neural differential equations for time series

Neural ODE

Latent ODE for irregular time series

Neural CDE

Part XIV — Gaussian processes for time series

Gaussian process

Kernel