Why time series matter
A time series is data ordered by time.
Examples:
| Domain | Time series example |
|---|---|
| Weather | hourly temperature |
| Finance | stock prices |
| Energy | electricity demand |
| Healthcare | heart-rate signal |
| Industry | vibration sensor readings |
| Language/audio | waveform samples |
| Traffic | vehicle count per minute |
| AI systems | GPU utilization over time |
| Video | frames evolving through time |
A time series is different from an ordinary table because order matters.
If we observe:
$$ 10,; 12,; 15,; 14,; 18 $$this is not the same as:
$$ 18,; 10,; 15,; 12,; 14. $$The values are the same, but the temporal structure is different.
Time-series analysis studies exactly this structure: dependence, trend, seasonality, cycles, shocks, noise, regime changes, long memory, and uncertainty. MIT’s graduate time-series course covers stationarity, lag operators, ARMA models, covariance structure, spectral analysis, GMM, VARs, structural breaks, and related econometric theory. (MIT OpenCourseWare)
Part I — Beginner foundation
A simple time series
Suppose we record daily temperature:
| Day | Temperature |
|---|---|
| 1 | 20 |
| 2 | 21 |
| 3 | 19 |
| 4 | 22 |
| 5 | 24 |
| 6 | 23 |
We write:
$$ y_1=20,\quad y_2=21,\quad y_3=19,\quad y_4=22,\quad y_5=24,\quad y_6=23. $$The full time series is:
$$ y_{1:6} = (y_1,y_2,y_3,y_4,y_5,y_6). $$In general:
$$ y_{1:T}=(y_1,y_2,\dots,y_T). $$Here:
- \(t\) is the time index,
- \(T\) is the total number of observations,
- \(y_t\) is the value at time \(t\).
If the values are scalar, then:
$$ y_t\in\mathbb{R}. $$This is a univariate time series.
Multivariate time series
Sometimes we observe many variables at each time.
For example, a weather station may record:
| Time | Temperature | Humidity | Wind speed |
|---|---|---|---|
| 1 | 20 | 60 | 5 |
| 2 | 21 | 58 | 6 |
| 3 | 19 | 70 | 4 |
At time \(t\), we have a vector:
$$ y_t = \begin{bmatrix} \text{temperature}_t\ \text{humidity}_t\ \text{wind}_t \end{bmatrix} \in\mathbb{R}^{3}. $$The whole series is:
$$ y_{1:T} = (y_1,y_2,\dots,y_T). $$As a matrix:
$$ Y = \begin{bmatrix} y_1^\top \\ y_2^\top \\ \vdots \\ y_T^\top \end{bmatrix} \in\mathbb{R}^{T\times d}. $$Here:
- \(T\) is sequence length,
- \(d\) is number of variables.
For a batch of \(B\) time series:
$$ Y\in\mathbb{R}^{B\times T\times d}. $$So time-series deep learning is tensor learning.
Forecasting
The most common time-series task is forecasting.
Given past values:
$$ y_1,y_2,\dots,y_T, $$predict future values:
$$ y_{T+1},y_{T+2},\dots,y_{T+H}. $$Here:
- \(T\) is the history length,
- \(H\) is the forecast horizon.
We write:
$$ \hat{y}_{T+1:T+H} {}={} f_\theta(y_{1:T}). $$For one-step forecasting:
$$ \hat{y}_{T+1}=f_\theta(y_{1:T}). $$For multi-step forecasting:
$$ \hat{y}_{T+1:T+H}=f_\theta(y_{1:T}). $$Hyndman and Athanasopoulos’ Forecasting: Principles and Practice is a standard open textbook that introduces forecasting methods and practical modeling choices, including time-series features, exponential smoothing, ARIMA, dynamic regression, hierarchical forecasting, and practical evaluation. (OTexts: Online, open-access textbooks)
Input window and output horizon
In machine learning, we often convert a time series into supervised examples.
Suppose:
$$ y_1,y_2,\dots,y_{100}. $$Choose input length:
$$ L=5 $$and forecast horizon:
$$ H=2. $$Then one training example is:
$$ x_1= \begin{bmatrix} y_1\ y_2\ y_3\ y_4\ y_5 \end{bmatrix}, \qquad \text{target}= \begin{bmatrix} y_6\ y_7 \end{bmatrix}. $$Another example is:
$$ x_2= \begin{bmatrix} y_2\ y_3\ y_4\ y_5\ y_6 \end{bmatrix}, \qquad \text{target}= \begin{bmatrix} y_7\ y_8 \end{bmatrix}. $$This is called a sliding window construction.
In general:
$$ x_t=(y_t,y_{t+1},\dots,y_{t+L-1}), $$and:
$$ z_t=(y_{t+L},\dots,y_{t+L+H-1}). $$The model learns:
$$ f_\theta:x_t\mapsto z_t. $$Time-series tasks
Time-series machine learning is not only forecasting.
It includes:
| Task | Goal |
|---|---|
| Forecasting | predict future values |
| Classification | assign a label to a sequence |
| Regression | predict a continuous target from a sequence |
| Anomaly detection | find unusual time points or segments |
| Imputation | fill missing values |
| Segmentation | divide sequence into regimes |
| Clustering | group similar time series |
| Representation learning | learn useful embeddings |
| Simulation/generation | generate realistic future trajectories |
| Control | choose actions over time |
The UCR and UEA archives became major public benchmarks for time-series classification; the UEA 2018 multivariate archive was introduced to improve rigorous evaluation for multivariate time-series classification, where each example may contain multiple channels. (arXiv)
Part II — Core statistical time-series mathematics
Deterministic signal plus noise
A useful beginner model is:
$$ y_t = s_t+\varepsilon_t. $$Here:
- \(s_t\) is the true signal,
- \(\varepsilon_t\) is noise.
For example:
$$ y_t = \text{trend}_t + \text{seasonality}_t + \text{noise}_t. $$So:
$$ y_t = T_t + S_t + R_t. $$Here:
- \(T_t\) is trend,
- \(S_t\) is seasonal pattern,
- \(R_t\) is residual noise.
This decomposition is central in classical forecasting and remains useful in modern deep learning. Hyndman and Athanasopoulos present decomposition as a core tool for understanding trend and seasonal structure before modeling. (OTexts: Online, open-access textbooks)
Trend
A trend is a long-term direction.
A linear trend can be written:
$$ T_t = a+bt. $$Here:
- \(a\) is intercept,
- \(b\) is slope.
If \(b>0\), the series increases.
If \(b<0\), the series decreases.
For example:
$$ y_t = 10+0.5t+\varepsilon_t. $$This means the average value grows by 0.5 per time step.
Seasonality
Seasonality is a repeated pattern.
For daily data with weekly seasonality, the period is:
$$ m=7. $$For monthly data with yearly seasonality:
$$ m=12. $$A simple seasonal model is:
$$ y_t = T_t + S_{t \bmod m}+\varepsilon_t. $$Fourier terms can model smooth seasonality:
$$ S_t = \sum_{k=1}^{K} \left[ a_k\cos\left(\frac{2\pi kt}{m}\right) + b_k\sin\left(\frac{2\pi kt}{m}\right) \right]. $$This connects time series to frequency-domain mathematics.
Lag operator
The lag operator \(L\) shifts a series backward:
$$ Ly_t=y_{t-1}. $$Then:
$$ L^2y_t=y_{t-2}. $$MIT’s graduate time-series notes begin with stationarity, lag operators, ARMA models, and covariance structure, because lag notation is the compact language of classical time-series theory. (MIT OpenCourseWare)
For example:
$$ y_t = 0.8y_{t-1}+\varepsilon_t $$can be written:
$$ y_t = 0.8Ly_t+\varepsilon_t. $$So:
$$ (1-0.8L)y_t=\varepsilon_t. $$Stationarity
A time series is stationary if its statistical properties do not change over time.
Weak stationarity means:
- constant mean:
- constant variance:
- autocovariance depends only on lag, not absolute time:
Stationarity matters because many classical time-series models assume stable dependence structure. MIT’s time-series notes treat stationarity as foundational for ARMA models and covariance analysis. (MIT OpenCourseWare)
Autocovariance
The autocovariance at lag \(k\) is:
$$ \gamma(k) {}={} \operatorname{Cov}(y_t,y_{t-k}) \mathbb{E}[(y_t-\mu)(y_{t-k}-\mu)]. $$For \(k=0\):
$$ \gamma(0)=\operatorname{Var}(y_t). $$Autocovariance measures how values separated by \(k\) time steps move together.
Autocorrelation
Autocorrelation normalizes autocovariance:
$$ \rho(k)=\frac{\gamma(k)}{\gamma(0)}. $$So:
$$ -1\leq \rho(k)\leq 1. $$If:
$$ \rho(1)>0, $$then nearby values tend to move together.
If:
$$ \rho(1)<0, $$then high values tend to be followed by low values.
If:
$$ \rho(k)\approx 0, $$then values \(k\) steps apart are weakly linearly related.
White noise
A white-noise process satisfies:
$$ \mathbb{E}[\varepsilon_t]=0, $$$$ \operatorname{Var}(\varepsilon_t)=\sigma^2, $$$$ \operatorname{Cov}(\varepsilon_t,\varepsilon_s)=0 \quad \text{for } t\neq s. $$If:
$$ \varepsilon_t\sim\mathcal{N}(0,\sigma^2) $$independently, then it is Gaussian white noise.
White noise is the basic random innovation in ARMA, state-space, Kalman filtering, SDEs, and probabilistic forecasting.
Random walk
A random walk is:
$$ y_t=y_{t-1}+\varepsilon_t. $$Equivalently:
$$ \Delta y_t = y_t-y_{t-1}=\varepsilon_t. $$The variance grows over time.
If:
$$ y_0=0, $$then:
$$ y_t=\sum_{i=1}^{t}\varepsilon_i. $$If the innovations have variance \(\sigma^2\), then:
$$ \operatorname{Var}(y_t)=t\sigma^2. $$So the process is not stationary.
Random walks are important in finance, stochastic processes, diffusion limits, and non-stationary forecasting.
Part III — Classical forecasting models
Autoregressive model AR\(p\)
An autoregressive model predicts the present from past values.
An AR(1) model is:
$$ y_t = c+\phi y_{t-1}+\varepsilon_t. $$An AR\(p\) model is:
$$ y_t {}={} c+ \phi_1y_{t-1} + \phi_2y_{t-2} + \cdots + \phi_py_{t-p} + \varepsilon_t. $$Using lag notation:
$$ \phi(L)y_t=c+\varepsilon_t, $$where:
$$ \phi(L)=1-\phi_1L-\phi_2L^2-\cdots-\phi_pL^p. $$AR models are linear memory models.
They say:
the future depends linearly on the past.
Moving-average model MA\(q\)
A moving-average model uses past noise terms.
An MA(1) model is:
$$ y_t=\mu+\varepsilon_t+\theta\varepsilon_{t-1}. $$An MA\(q\) model is:
$$ y_t {}={} \mu+ \varepsilon_t+ \theta_1\varepsilon_{t-1} + \cdots + \theta_q\varepsilon_{t-q}. $$This means shocks can influence future observations for several time steps.
ARMA model
An ARMA\(p,q\) model combines AR and MA terms:
$$ y_t {}={} c+ \sum_{i=1}^{p}\phi_i y_{t-i} + \varepsilon_t + \sum_{j=1}^{q}\theta_j\varepsilon_{t-j}. $$This is useful for stationary time series.
MIT’s graduate time-series material covers ARMA and covariance structure as central tools for stationary time-series analysis. (MIT OpenCourseWare)
ARIMA model
Many real time series are non-stationary.
ARIMA handles non-stationarity using differencing.
The first difference is:
$$ \nabla y_t = y_t-y_{t-1}. $$If the differenced series is stationary, we can apply ARMA to:
$$ \nabla^d y_t. $$An ARIMA\(p,d,q\) model means:
- \(p\): autoregressive order,
- \(d\): differencing order,
- \(q\): moving-average order.
Hyndman and Athanasopoulos present ARIMA as a core classical forecasting model, alongside exponential smoothing and dynamic regression. (OTexts: Online, open-access textbooks)
Seasonal ARIMA
Seasonal ARIMA adds seasonal lags.
A common notation is:
$$ \operatorname{ARIMA}(p,d,q)(P,D,Q)_m. $$Here:
- \(m\) is seasonal period,
- \(P,D,Q\) are seasonal AR, differencing, and MA orders.
For monthly data with yearly seasonality:
$$ m=12. $$For hourly data with daily seasonality:
$$ m=24. $$Seasonal models are important because many time series contain repeating patterns.
Exponential smoothing
Simple exponential smoothing updates a level estimate:
$$ \ell_t=\alpha y_t+(1-\alpha)\ell_{t-1}. $$Here:
$$ 0<\alpha<1. $$The forecast is:
$$ \hat{y}_{t+h|t}=\ell_t. $$Holt’s method adds trend:
$$ \ell_t=\alpha y_t+(1-\alpha)(\ell_{t-1}+b_{t-1}), $$$$ b_t=\beta(\ell_t-\ell_{t-1})+(1-\beta)b_{t-1}. $$Holt-Winters methods add seasonality.
ETS models formalize exponential smoothing in terms of error, trend, and seasonal components. Hyndman’s forecasting material gives the standard treatment of ETS, ARIMA, and their state-space relationships. (Rob J Hyndman)
Part IV — State-space models and Kalman filtering
Why state-space models matter
Sometimes the observed time series is only a noisy measurement of a hidden state.
For example:
- observed GPS position is noisy,
- true physical position is hidden,
- sensor readings contain measurement error,
- economic indicators reflect hidden market state.
A state-space model separates:
- hidden state dynamics,
- observation process.
Linear Gaussian state-space model
A standard model is:
$$ x_t = A x_{t-1}+w_t, $$$$ y_t = C x_t+v_t. $$Here:
- \(x_t\) is hidden state,
- \(y_t\) is observation,
- \(A\) is state transition matrix,
- \(C\) is observation matrix,
- (w_t\sim\mathcal{N}(0,Q)) is process noise,
- (v_t\sim\mathcal{N}(0,R)) is observation noise.
Stanford and MIT lecture notes present state-space models and Kalman filtering as efficient algorithms for state estimation in linear Gaussian systems. (Stanford University)
Kalman prediction step
Suppose we have an estimate:
$$ \hat{x}_{t-1|t-1} $$with covariance:
$$ P_{t-1|t-1}. $$Prediction:
$$ \hat{x}_{t|t-1}=A\hat{x}_{t-1|t-1}, $$$$ P_{t|t-1}=AP_{t-1|t-1}A^\top+Q. $$This predicts the next hidden state before seeing \(y_t\).
Kalman update step
The predicted observation is:
$$ \hat{y}_{t|t-1}=C\hat{x}_{t|t-1}. $$The innovation is:
$$ r_t=y_t-\hat{y}_{t|t-1}. $$The innovation covariance is:
$$ S_t=CP_{t|t-1}C^\top+R. $$The Kalman gain is:
$$ K_t=P_{t|t-1}C^\top S_t^{-1}. $$Update:
$$ \hat{x}_{t|t} {}={} \hat{x}_{t|t-1}+K_t r_t, $$$$ P_{t|t} {}={} (I-K_tC)P_{t|t-1}. $$The Kalman filter is important because it is a mathematically exact recursive estimator for linear Gaussian state-space models.
Hidden Markov models
A hidden Markov model, or HMM, uses a discrete hidden state:
$$ z_t\in{1,\dots,K}. $$The hidden state evolves as a Markov chain:
$$ p(z_t\mid z_{t-1}). $$The observation is generated from:
$$ p(y_t\mid z_t). $$So the joint distribution is:
$$ p(z_{1:T},y_{1:T}) {}={} p(z_1)p(y_1\mid z_1) \prod_{t=2}^{T} p(z_t\mid z_{t-1})p(y_t\mid z_t). $$Rabiner’s classic tutorial formalized the three central HMM problems: likelihood evaluation, decoding the most likely hidden state sequence, and parameter estimation. (Computer Science at UBC)
Part V — Spectral and frequency-domain mathematics
Time domain versus frequency domain
A time series can be studied in the time domain:
$$ y_t $$or in the frequency domain.
Frequency analysis asks:
which oscillations are present in the signal?
For example, electricity demand may have:
- daily frequency,
- weekly frequency,
- annual frequency.
MIT’s time-series course includes spectrum and spectrum estimation as core topics after stationarity and ARMA modeling. (MIT OpenCourseWare)
Discrete Fourier transform
For a finite sequence:
$$ y_0,y_1,\dots,y_{T-1}, $$the discrete Fourier transform is:
$$ Y_k {}={} \sum_{t=0}^{T-1} y_t e^{-2\pi i kt/T}. $$The inverse transform is:
$$ y_t {}={} \frac{1}{T} \sum_{k=0}^{T-1} Y_k e^{2\pi i kt/T}. $$The magnitude:
$$ |Y_k| $$shows the strength of frequency \(k\).
Spectral density
For a stationary process with autocovariance \(\gamma(k)\), the spectral density is:
$$ f(\omega) {}={} \frac{1}{2\pi} \sum_{k=-\infty}^{\infty} \gamma(k)e^{-i\omega k}. $$This is the Fourier transform of the autocovariance function.
So the time-domain dependence structure and frequency-domain power distribution are mathematically connected.
Part VI — Machine learning formulation
Supervised time-series learning
In machine learning, we define input-output pairs:
$$ (x_i,z_i)_{i=1}^{N}. $$For forecasting:
$$ x_i = y_{i:i+L-1}, $$$$ z_i = y_{i+L:i+L+H-1}. $$The model is:
$$ \hat{z}*i=f*\theta(x_i). $$A standard mean squared error loss is:
$$ \mathcal{L}(\theta) {}={} \frac{1}{N} \sum_{i=1}^{N} |\hat{z}_i-z_i|_2^2. $$For scalar one-step forecasting:
$$ \mathcal{L}(\theta) {}={} \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_{i+1}-y_{i+1})^2. $$Time-aware train-test split
Ordinary random splitting can leak future information into training.
For time series, we usually split chronologically:
$$ \text{train}: y_1,\dots,y_{T_{\text{train}}}, $$$$ \text{test}: y_{T_{\text{train}}+1},\dots,y_T. $$This respects causality.
A model should not learn from the future when evaluated on the past.
Direct versus recursive forecasting
For multi-step forecasting, there are several strategies.
Recursive forecasting
Train one-step model:
$$ \hat{y}_{t+1}=f_\theta(y_{t-L+1:t}). $$Then feed predictions back:
$$ \hat{y}_{t+2}=f_\theta(y_{t-L+2:t},\hat{y}_{t+1}). $$Problem: errors accumulate.
Direct forecasting
Train separate models:
$$ \hat{y}_{t+h}=f_{\theta_h}(y_{t-L+1:t}) $$for each horizon \(h\).
Problem: many models.
Multi-output forecasting
Train one model:
$$ \hat{y}_{t+1:t+H}=f_\theta(y_{t-L+1:t}). $$This is common in deep learning.
Point forecasting and probabilistic forecasting
A point forecast gives one value:
$$ \hat{y}_{t+h}. $$A probabilistic forecast gives a distribution:
$$ p(y_{t+h}\mid y_{1:t}). $$For decision-making, probabilistic forecasts are often more useful.
For example, energy planning needs:
$$ P(y_{t+h}> \text{capacity}). $$DeepAR explicitly frames forecasting as probabilistic forecasting: estimating future distributions given the past, using an autoregressive recurrent neural network trained across many related time series. (arXiv)
Part VII — Probabilistic forecasting losses
Gaussian likelihood
Suppose the model predicts:
$$ \mu_\theta(x) $$and:
$$ \sigma_\theta(x)>0. $$Assume:
$$ y\mid x\sim\mathcal{N}(\mu_\theta(x),\sigma_\theta(x)^2). $$The negative log-likelihood is:
$$ -\log p(y\mid x) {}={} \frac{1}{2} \log(2\pi\sigma_\theta^2) + \frac{(y-\mu_\theta)^2}{2\sigma_\theta^2}. $$This teaches the model both center and uncertainty.
If uncertainty is high, \(\sigma_\theta\) can be large.
But predicting large \(\sigma_\theta\) is penalized by the log term.
Quantile loss
A quantile forecast predicts \(q_\tau(x)\), the \(\tau\)-quantile.
The pinball loss is:
$$ \ell_\tau(y,q) {}={} \max \left( \tau(y-q), (\tau-1)(y-q) \right). $$For example:
- \(\tau=0.5\) gives median forecasting,
- \(\tau=0.9\) gives upper quantile forecasting.
Multiple quantiles form prediction intervals.
CRPS
The continuous ranked probability score compares a full predictive CDF \(F\) with observation \(y\):
$$ \operatorname{CRPS}(F,y) {}={} \int_{-\infty}^{\infty} \left( F(z)-\mathbf{1}{y\leq z} \right)^2dz. $$Gneiting and Raftery describe CRPS as a strictly proper scoring rule for probabilistic forecasts and note that it generalizes absolute error to predictive distributions. (Statistical Consulting Service)
Proper scoring rules
A scoring rule is proper if the best expected score is achieved by reporting the true distribution.
This matters because we do not want a model to cheat by giving overconfident or underconfident forecasts.
For probabilistic forecasting, common proper scoring rules include:
- negative log-likelihood,
- CRPS,
- energy score,
- variogram score.
Part VIII — RNNs, LSTMs, and GRUs
Recurrent neural networks
A recurrent neural network updates a hidden state:
$$ h_t = \phi(W_xx_t+W_hh_{t-1}+b). $$The output may be:
$$ \hat{y}_{t+1}=W_oh_t+c. $$Here:
- \(x_t\) is input at time \(t\),
- \(h_t\) is memory state,
- \(W_h\) controls recurrence.
RNNs are natural for time series because they process data sequentially.
Vanishing and exploding gradients
Backpropagation through time multiplies many Jacobians.
A simplified gradient contains products like:
$$ \prod_{t=1}^{T} W_h^\top D_t. $$If eigenvalues are small, gradients vanish.
If eigenvalues are large, gradients explode.
This makes long-term dependency learning difficult.
LSTM
LSTM was introduced to address long-term dependency problems in recurrent training. The original LSTM paper explicitly discusses insufficient error backflow and introduces memory cells and gates to improve long-duration information storage. (bioinf.jku.at)
An LSTM uses gates:
$$ f_t=\sigma(W_fx_t+U_fh_{t-1}+b_f), $$$$ i_t=\sigma(W_ix_t+U_ih_{t-1}+b_i), $$$$ o_t=\sigma(W_ox_t+U_oh_{t-1}+b_o), $$candidate memory:
$$ \tilde{c}_t=\tanh(W_cx_t+U_ch_{t-1}+b_c), $$cell update:
$$ c_t=f_t\odot c_{t-1}+i_t\odot \tilde{c}_t, $$hidden state:
$$ h_t=o_t\odot\tanh(c_t). $$The forget gate \(f_t\) decides what to keep.
The input gate \(i_t\) decides what to write.
The output gate \(o_t\) decides what to expose.
GRU
A GRU is a simpler gated recurrent unit.
It uses update and reset gates:
$$ z_t=\sigma(W_zx_t+U_zh_{t-1}+b_z), $$$$ r_t=\sigma(W_rx_t+U_rh_{t-1}+b_r). $$Candidate state:
$$ \tilde{h}*t= \tanh(W_hx_t+U_h(r_t\odot h_{t-1})+b_h). $$Update:
$$ h_t=(1-z_t)\odot h_{t-1}+z_t\odot \tilde{h}_t. $$GRUs were introduced in neural sequence modeling and later compared empirically with LSTMs as gated recurrent architectures. (arXiv)
Part IX — Temporal convolutional networks
Causal convolution
A causal convolution ensures the output at time \(t\) uses only current and past inputs.
For kernel size \(K\):
$$ h_t= \sum_{k=0}^{K-1} w_k x_{t-k}. $$No future value \(x_{t+1}\) is used.
This is necessary for forecasting.
Dilated convolution
A dilated convolution skips time steps:
$$ h_t= \sum_{k=0}^{K-1} w_k x_{t-dk}. $$Here \(d\) is dilation.
With increasing dilations:
$$ 1,2,4,8,\dots $$the receptive field grows quickly.
WaveNet used dilated causal convolutions to obtain very large receptive fields without huge computational cost. (arXiv)
TCN
Temporal convolutional networks combine:
- causal convolutions,
- dilation,
- residual blocks,
- stable parallel training.
A large empirical comparison found that simple convolutional sequence models can outperform canonical recurrent models such as LSTMs on several sequence modeling tasks while showing longer effective memory. (arXiv)
A TCN block may be written:
$$ H^{(\ell+1)} {}={} H^{(\ell)} + \operatorname{Conv}_{\text{causal,dilated}} ( \phi(H^{(\ell)}) ). $$This is similar to residual deep learning, but adapted to time.
Part X — Attention and transformers for time series
Why attention helps
RNNs compress the past into a hidden state.
Attention directly compares time points.
Given:
$$ X\in\mathbb{R}^{T\times d}, $$we compute:
$$ Q=XW_Q, $$$$ K=XW_K, $$$$ V=XW_V. $$Attention is:
$$ \operatorname{Attention}(Q,K,V) {}={} \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right)V. $$Transformers were originally introduced for sequence transduction using attention alone, avoiding recurrence and convolution. (arXiv)
Attention as temporal dependency learning
The attention matrix is:
$$ A= \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right). $$Here:
$$ A_{ij} $$measures how much time point \(i\) attends to time point \(j\).
For forecasting, causal masking may be used:
$$ A_{ij}=0 \quad \text{if} \quad j>i. $$For encoder-style forecasting, the model may attend over the whole observed context.
Quadratic complexity
Standard attention forms a \(T\times T\) matrix.
So memory and compute scale like:
$$ O(T^2). $$This is a problem for long time series.
If:
$$ T=10{,}000, $$then:
$$ T^2=100{,}000{,}000. $$So many time-series transformer papers modify tokenization, attention, decomposition, or variable representation to reduce cost or improve inductive bias.
Temporal Fusion Transformer
Temporal Fusion Transformer combines recurrent layers, attention, gating, static covariate encoders, known future inputs, and variable selection for multi-horizon forecasting. Its paper emphasizes both high-performance multi-horizon forecasting and interpretability over temporal dynamics. (ScienceDirect)
A simplified multi-horizon forecast is:
$$ \hat{y}_{t+1:t+H} {}={} f_\theta ( \text{past observed}, \text{known future covariates}, \text{static features} ). $$This is important because real forecasting often includes:
- static covariates,
- known future calendar features,
- observed historical covariates,
- target history.
PatchTST
PatchTST treats time-series segments as patches.
Instead of tokenizing each time point, it tokenizes subseries:
$$ p_i = [y_{s_i},y_{s_i+1},\dots,y_{s_i+P-1}]. $$Then patches become transformer tokens.
PatchTST argues that patching retains local semantic information, reduces attention cost quadratically for a fixed lookback window, and allows longer histories; it also uses channel independence, sharing embedding and transformer weights across univariate channels. (arXiv)
Mathematically:
$$ Y\in\mathbb{R}^{T\times d} \quad \longrightarrow \quad P\in\mathbb{R}^{N_p\times P\times d}. $$Then each patch is embedded into:
$$ z_i\in\mathbb{R}^{d_{\text{model}}}. $$iTransformer
Traditional transformers often treat each timestamp as a token.
iTransformer inverts this idea.
It treats each variable as a token whose feature vector is the lookback history.
If:
$$ Y\in\mathbb{R}^{T\times d}, $$then for variable \(j\):
$$ v_j= [y_{1j},y_{2j},\dots,y_{Tj}] \in\mathbb{R}^{T}. $$The model embeds variate tokens, and attention captures multivariate correlations.
The iTransformer paper argues that standard timestamp-token transformers may mix delayed events and distinct physical variables poorly, and shows that applying transformer components on inverted dimensions can improve forecasting and generalization. (arXiv)
TimesNet
TimesNet transforms 1D time series into 2D tensors based on discovered periods.
If a period is \(p\), a sequence can be reshaped into:
$$ \text{rows}=\text{number of periods}, \qquad \text{columns}=p. $$This separates:
- intraperiod variation,
- interperiod variation.
TimesNet motivates this by multi-periodicity and models temporal variation in 2D space using parameter-efficient 2D kernels. It reports results across forecasting, imputation, classification, and anomaly detection. (arXiv)
TimeMixer
TimeMixer uses multiscale decomposition.
It builds representations at multiple sampling scales:
$$ Y^{(1)},Y^{(2)},\dots,Y^{(S)}. $$Then it decomposes each scale into trend and seasonal parts:
$$ Y^{(s)}=T^{(s)}+S^{(s)}. $$TimeMixer proposes past-decomposable mixing and future-multipredictor mixing to combine microscopic seasonal information and macroscopic trend information across scales. (ICLR Proceedings)
Part XI — Deep forecasting architectures
DeepAR
DeepAR is a probabilistic autoregressive RNN.
At each time:
$$ h_t=\operatorname{RNN}_\theta(h_{t-1},y_{t-1},x_t), $$then:
$$ \theta_t = g_\theta(h_t), $$where \(\theta_t\) parameterizes a probability distribution:
$$ p(y_t\mid y_{Training maximizes likelihood:
$$ \sum_t \log p_\theta(y_t\mid y_{N-BEATS
N-BEATS uses fully connected residual blocks for univariate forecasting.
Each block produces:
- a backcast,
- a forecast.
Let the input window be:
$$ x\in\mathbb{R}^{L}. $$A block computes:
$$ \theta_b = f_b(x), $$then:
$$ \hat{x}_b = B_b\theta_b, $$$$ \hat{y}_b = F_b\theta_b. $$The residual input to the next block is:
$$ x \leftarrow x-\hat{x}_b. $$The total forecast is:
$$ \hat{y}=\sum_b \hat{y}_b. $$N-BEATS introduced neural basis expansion with backward and forward residual links, and it reported strong results on M3, M4, and TOURISM datasets. (arXiv)
N-HiTS
N-HiTS extends this idea using hierarchical interpolation and multi-rate sampling.
It is designed for long-horizon forecasting, where naive deep models can become computationally expensive and volatile.
N-HiTS reports that hierarchical interpolation can approximate long horizons efficiently under smoothness assumptions and shows strong accuracy and runtime advantages in long-horizon experiments. (arXiv)
Part XII — State-space deep sequence models
Continuous-time state-space model
A linear state-space model can be written:
$$ \frac{dh(t)}{dt}=Ah(t)+Bu(t), $$$$ y(t)=Ch(t)+Du(t). $$Here:
- \(u(t)\) is input,
- \(h(t)\) is hidden state,
- \(y(t)\) is output.
This bridges control theory, signal processing, differential equations, and sequence modeling.
S4
S4 uses structured state-space models to model long sequences efficiently.
The S4 paper starts from continuous-time state-space equations and develops a structured parameterization allowing efficient computation while preserving long-range dependency modeling. (arXiv)
A discretized form is:
$$ h_{t+1}=\bar{A}h_t+\bar{B}x_t, $$$$ y_t=Ch_t+Dx_t. $$The impulse response defines a convolution kernel:
$$ K_k=C\bar{A}^k\bar{B}. $$Then:
$$ y_t=\sum_{k=0}^{t}K_kx_{t-k}. $$So SSMs can be viewed as recurrent models or convolutional models.
Mamba and selective state spaces
A limitation of static SSMs is that their dynamics may not depend enough on the current input.
Mamba introduces selective state-space models where SSM parameters depend on the input, enabling selective propagation or forgetting along the sequence. The paper emphasizes linear scaling in sequence length and hardware-aware parallel algorithms. (arXiv)
A simplified selective update is:
$$ h_t=A(x_t)h_{t-1}+B(x_t)x_t, $$$$ y_t=C(x_t)h_t. $$So the model decides what to remember based on the current input.
Part XIII — Neural differential equations for time series
Neural ODE
A neural ODE models hidden dynamics continuously:
$$ \frac{dh(t)}{dt}=f_\theta(h(t),t). $$Given initial state \(h(t_0)\), the solution is:
$$ h(t_1)=h(t_0)+\int_{t_0}^{t_1}f_\theta(h(t),t),dt. $$Neural ODEs use black-box ODE solvers and support continuous-depth models, adaptive computation, and latent time-series models. (arXiv)
Latent ODE for irregular time series
Suppose observations occur at irregular times:
$$ t_1,t_2,\dots,t_n. $$A latent ODE defines:
$$ z(t)=\operatorname{ODESolve}(z(t_0),f_\theta,t_0,t). $$Then observations are generated by:
$$ y_i\sim p_\theta(y_i\mid z(t_i)). $$This is useful when data is not sampled at regular intervals.
Neural CDE
A neural controlled differential equation is:
$$ d h_t = f_\theta(h_t),dX_t. $$Here:
- \(X_t\) is an interpolated input path,
- \(h_t\) is hidden state.
Neural CDEs were introduced as a continuous-time limit of RNNs and are directly applicable to irregular, partially observed multivariate time series. (arXiv)
A more explicit integral form is:
$$ h_t {}={} h_0+ \int_0^t f_\theta(h_s),dX_s. $$This is advanced mathematics because it uses controlled differential equations, rough path ideas, and continuous-time sequence modeling.
Part XIV — Gaussian processes for time series
Gaussian process
A Gaussian process is a distribution over functions:
$$ f(t)\sim\mathcal{GP}(m(t),k(t,t')). $$This means any finite collection:
$$ [f(t_1),\dots,f(t_n)] $$has a multivariate Gaussian distribution.
Rasmussen and Williams’ Gaussian Processes for Machine Learning is a standard MIT Press reference that presents GPs as a principled probabilistic approach to regression and classification with covariance kernels. (the Gaussian Process web site)
Kernel
The kernel controls similarity:
$$ k(t,t')=\operatorname{Cov}(f(t),f(t')). $$A squared exponential kernel is:
$$ k(t,t') {}={} \sigma_f^2 \exp \left( -\frac{(t-t')^2}{2\ell^2} \right). $$A periodic kernel is:
$$ k(t,t') {}={} \sigma_f^2 \exp \left( -\frac{2\sin^2(\pi |t-t'|/p)}{\ell^2} \right). $$Kernels encode assumptions about smoothness, periodicity, and long-range correlation.
GP prediction
Let:
$$ y=f(X)+\varepsilon, \qquad \varepsilon\sim\mathcal{N}(0,\sigma_n^2I). $$For test inputs \(X_*\), the predictive distribution is Gaussian:
$$ p(f_*\mid X,y,X_*)= \mathcal{N}(\mu_*,\Sigma_*). $$The mean is:
$$ \mu_* {}={} K_{*X} (K_{XX}+\sigma_n^2I)^{-1}y. $$The covariance is:
$$ \Sigma_* {}={} K_{**} - K_{*X} (K_{XX}+\sigma_n^2I)^{-1} K_{X*}. $$This gives both prediction and uncertainty.
Part XV — Self-supervised and representation learning
Why representation learning matters
Sometimes labels are scarce.
Instead of training directly for forecasting or classification, we learn representations:
$$ z_t = f_\theta(y_{1:T})_t. $$Then use \(z_t\) for:
- classification,
- anomaly detection,
- forecasting,
- clustering,
- imputation.
Contrastive learning
A contrastive objective pulls related views together and pushes unrelated views apart.
Let:
$$ z_i $$and:
$$ z_i^+ $$be two views of the same time series.
Let:
$$ z_j^- $$be a negative example.
An InfoNCE-style loss is:
$$ \mathcal{L} {}={} -\log \frac{ \exp(\operatorname{sim}(z_i,z_i^+)/\tau) }{ \exp(\operatorname{sim}(z_i,z_i^+)/\tau) + \sum_j \exp(\operatorname{sim}(z_i,z_j^-)/\tau) }. $$TS2Vec learns time-series representations using hierarchical contrastive learning over augmented context views and reports strong results across UCR/UEA classification, forecasting, and anomaly detection settings. (arXiv)
Part XVI — Imputation and missing data
Missing values
A time series may have missing observations.
Let:
$$ m_t= \begin{cases} 1, & y_t \text{ observed},\ 0, & y_t \text{ missing}. \end{cases} $$The observed data is:
$$ y_t^{\text{obs}}=m_ty_t. $$The goal is to estimate missing values:
$$ p(y_{\text{miss}}\mid y_{\text{obs}}). $$This is a conditional distribution.
Simple imputation methods
Basic methods include:
- forward fill,
- mean fill,
- interpolation,
- seasonal average.
But these may fail when the dynamics are complex.
Diffusion-based imputation
CSDI uses conditional score-based diffusion for probabilistic time-series imputation, conditioning on observed values and learning a conditional distribution over missing values. It reports improvements on healthcare and environmental datasets over earlier probabilistic imputation methods. (arXiv)
Mathematically, the goal is:
$$ p_\theta(y_{\text{miss}}\mid y_{\text{obs}},m). $$A diffusion model gradually denoises missing values while respecting observed values.
Part XVII — Anomaly detection
What is an anomaly?
An anomaly is a time point or segment that does not fit the expected pattern.
Examples:
- sudden spike in traffic,
- abnormal heart rhythm,
- machine vibration before failure,
- cyberattack in network traffic,
- sensor malfunction.
Reconstruction-based anomaly score
Train an autoencoder:
$$ \hat{y}_{1:T}=g_\theta(f_\theta(y_{1:T})). $$Compute reconstruction error:
$$ s_t = |y_t-\hat{y}_t|^2. $$If:
$$ s_t>\tau, $$then \(t\) is anomalous.
Forecasting-based anomaly score
Train a forecasting model:
$$ \hat{y}*t=f*\theta(y_{An anomaly score is:
$$ s_t=|e_t| $$or:
$$ s_t=e_t^\top\Sigma_t^{-1}e_t. $$Large forecast error suggests unusual behavior.
Transformer-based anomaly detection
Anomaly Transformer introduces association discrepancy: anomalies tend to have different attention-association patterns from normal points. The method uses an anomaly-attention mechanism and a minimax strategy to distinguish normal and abnormal temporal associations. (arXiv)
Part XVIII — Diffusion models for time series
Probabilistic generation
A diffusion model defines a noising process:
$$ q(x_t\mid x_{t-1}) $$and learns a reverse process:
$$ p_\theta(x_{t-1}\mid x_t,\text{condition}). $$For forecasting, the condition is historical context:
$$ y_{1:T}. $$The target is future trajectory:
$$ y_{T+1:T+H}. $$TimeGrad
TimeGrad combines autoregressive forecasting with denoising diffusion.
At each future time step, it samples from the conditional distribution by estimating gradients of the data distribution.
TimeGrad is an autoregressive denoising diffusion model for multivariate probabilistic time-series forecasting. (arXiv)
A simplified objective is:
$$ \mathbb{E} \left[ |\epsilon-\epsilon_\theta(x_t,t,\text{history})|^2 \right]. $$This is similar to image diffusion, but the conditioning is temporal history.
Diffusion forecasting survey view
Recent surveys organize diffusion-based time-series forecasting by how conditional information is supplied and how it is integrated into the denoising network. They emphasize that diffusion models provide probabilistic forecasts by sampling future trajectories through iterative denoising. (arXiv)
Part XIX — Time-series foundation models
Why foundation models for time series?
Traditional forecasting often trains one model per dataset.
Foundation models try to pretrain on many time series from many domains and then generalize to new series.
This is analogous to language models, but the data is numerical temporal data rather than text.
A 2025 survey identifies transformer-based time-series foundation models as a major emerging paradigm for forecasting, anomaly detection, classification, trend analysis, and related tasks, and categorizes them by architecture, prediction type, scale, data type, and training objective. (arXiv)
Tokenization-based foundation models
Chronos converts time-series values into discrete tokens using scaling and quantization, then trains language-model architectures with cross-entropy loss. It reports strong zero-shot performance across many datasets. (arXiv)
Mathematically:
$$ y_t \longrightarrow \tilde{y}_t {}={} \frac{y_t-\mu}{s} $$then quantize:
$$ z_t=Q(\tilde{y}_t)\in{1,\dots,V}. $$Train:
$$ p_\theta(z_{t+1}\mid z_{\leq t}). $$Then sample token futures and dequantize.
Decoder-only foundation forecasting
TimesFM is a decoder-only foundation model for time-series forecasting and was designed for zero-shot forecasting across public datasets. Its paper reports that out-of-the-box zero-shot performance can approach supervised models trained specifically on individual datasets. (arXiv)
A decoder-only forecasting model learns:
$$ p_\theta(y_{T+1:T+H}\mid y_{1:T}). $$Depending on design, outputs may be:
- point forecasts,
- quantiles,
- distribution parameters,
- sampled trajectories.
Lag-based foundation forecasting
Lag-Llama is a decoder-only transformer foundation model for univariate probabilistic forecasting that uses lagged values as covariates and is pretrained on diverse time-series data. (arXiv)
Lag features may be:
$$ [y_{t-1},y_{t-7},y_{t-24},y_{t-168}] $$for daily, hourly, or weekly structures.
The model learns:
$$ p_\theta(y_t\mid y_{t-\ell_1},y_{t-\ell_2},\dots,\text{context}). $$Universal forecasting models
Moirai frames universal forecasting as pretraining one model across many time-series datasets to handle diverse downstream forecasting tasks. It supports probabilistic zero-shot forecasting and exogenous features. (arXiv)
The mathematical goal is no longer:
$$ \text{one dataset} \rightarrow \text{one model}. $$It becomes:
$$ \text{many datasets} \rightarrow \text{one general model} \rightarrow \text{zero-shot or fine-tuned forecasting}. $$General-purpose time-series foundation models
MOMENT introduces open time-series foundation models for general-purpose tasks, including forecasting, classification, anomaly detection, and imputation, and discusses challenges such as lack of a large cohesive public repository and heterogeneous time-series characteristics. (arXiv)
So foundation models must handle:
- different frequencies,
- different units,
- different scales,
- different lengths,
- missing values,
- multivariate inputs,
- diverse downstream tasks.
Continuous-valued foundation models
Some recent models avoid discrete tokenization.
Sundial proposes TimeFlow Loss based on flow matching to predict next-patch distributions directly for continuous-valued time series, enabling probabilistic zero-shot forecasting with generated sample trajectories. (arXiv)
This connects time-series foundation models to:
- continuous probability flows,
- generative modeling,
- stochastic forecasting,
- distribution learning.
A flow-matching-style objective learns a vector field:
$$ v_\theta(x,t) \approx v^\star(x,t) $$that transports a simple distribution toward the data distribution.
Part XX — Evaluation metrics
Point forecast metrics
Mean absolute error:
$$ \operatorname{MAE} {}={} \frac{1}{N} \sum_{i=1}^{N} |y_i-\hat{y}_i|. $$Mean squared error:
$$ \operatorname{MSE} {}={} \frac{1}{N} \sum_{i=1}^{N} (y_i-\hat{y}_i)^2. $$Root mean squared error:
$$ \operatorname{RMSE} {}={} \sqrt{\operatorname{MSE}}. $$Mean absolute percentage error:
$$ \operatorname{MAPE} {}={} \frac{100}{N} \sum_i \left| \frac{y_i-\hat{y}_i}{y_i} \right|. $$MAPE is problematic when \(y_i\) is near zero.
Scale-free metric: MASE
Mean absolute scaled error compares forecast error to a naive baseline:
$$ \operatorname{MASE} {}={} \frac{ \frac{1}{N}\sum_i |y_i-\hat{y}*i| }{ \frac{1}{T-1}\sum_{t=2}^{T}|y_t-y_{t-1}| }. $$This helps compare across series with different scales.
Probabilistic metrics
For probabilistic forecasts, use:
- negative log-likelihood,
- CRPS,
- pinball loss,
- coverage,
- interval width,
- calibration error.
Prediction interval coverage:
$$ \operatorname{Coverage} {}={} \frac{1}{N} \sum_i \mathbf{1} {y_i\in [\ell_i,u_i]}. $$If a 90% interval is calibrated, coverage should be close to 0.9.
Conformal prediction for time series
Conformal forecasting uses residuals to build prediction intervals with frequentist coverage guarantees under assumptions.
A conformal score may be:
$$ s_i=|y_i-\hat{y}_i|. $$Let:
$$ q_{1-\alpha} $$be a quantile of calibration scores.
Then interval:
$$ [\hat{y}_{t}-q_{1-\alpha},\hat{y}_{t}+q_{1-\alpha}] $$has approximate coverage.
Conformal time-series forecasting extends inductive conformal prediction to forecasting and proposes lightweight uncertainty intervals for multi-horizon predictors. (NeurIPS Proceedings)
Part XXI — Advanced mathematical view
Time series as stochastic processes
A time series is one realization of a stochastic process:
$$ {Y_t:t\in\mathcal{T}}. $$If \(\mathcal{T}=\mathbb{Z}\), it is discrete time.
If \(\mathcal{T}=\mathbb{R}\), it is continuous time.
The complete mathematical object is not just observed values. It is the probability law:
$$ P(Y_{1:T}\in A). $$Learning a time-series model means estimating or approximating this probability law.
Conditional distribution view
The general forecasting problem is:
$$ p(y_{T+1:T+H}\mid y_{1:T},x_{1:T+H}). $$Here:
- \(y\) is target series,
- \(x\) are covariates,
- past covariates may be observed,
- future covariates may be known,
- future targets are unknown.
A deep model parameterizes:
$$ p_\theta(y_{T+1:T+H}\mid \mathcal{F}_T), $$where \(\mathcal{F}_T\) is the information available at time \(T\).
This is the mathematically correct forecasting object.
Filtration
In stochastic process theory, information up to time \(t\) is represented by a filtration:
$$ \mathcal{F}_t. $$A forecast is conditional on this information:
$$ \mathbb{E}[Y_{t+h}\mid \mathcal{F}_t]. $$A probabilistic forecast is:
$$ P(Y_{t+h}\in A\mid \mathcal{F}_t). $$This notation is advanced but extremely important.
It formalizes “no future leakage.”
Martingale difference noise
A noise sequence \(\varepsilon_t\) is a martingale difference if:
$$ \mathbb{E}[\varepsilon_t\mid \mathcal{F}_{t-1}]=0. $$This means the noise is unpredictable from the past.
Many forecasting models aim to make residuals behave like martingale differences.
If residuals still have predictable structure, the model has not captured all available information.
Ergodicity
Ergodicity allows time averages to approximate expectations.
For a stationary ergodic process:
$$ \frac{1}{T} \sum_{t=1}^{T} g(Y_t) \to \mathbb{E}[g(Y)] $$as:
$$ T\to\infty. $$This is important because in time series we often observe one long trajectory, not many independent samples.
Non-stationarity and distribution shift
Real time series often violate stationarity.
The distribution may change:
$$ p_t(y)\neq p_{t+1}(y). $$Sources of non-stationarity include:
- climate change,
- market regime shifts,
- sensor drift,
- policy changes,
- equipment aging,
- seasonal changes,
- user behavior shifts.
Modern models use:
- normalization,
- decomposition,
- adaptive training,
- online learning,
- covariate conditioning,
- foundation pretraining,
- conformal recalibration.
But non-stationarity remains one of the hardest problems in time-series ML.
Part XXII — The complete mathematical map
The mathematics of time-series machine learning and deep learning includes:
| Area | Mathematics |
|---|---|
| Basic time series | sequences, indexing, windows |
| Classical statistics | stationarity, autocorrelation, ARMA, ARIMA |
| Forecasting | conditional expectation, conditional distribution |
| Seasonality | Fourier analysis, periodic functions |
| State-space models | linear systems, Gaussian conditioning |
| Kalman filtering | Bayesian recursion, matrix covariance updates |
| HMMs | Markov chains, latent variables, dynamic programming |
| ML forecasting | supervised learning, empirical risk minimization |
| Probabilistic forecasting | likelihoods, quantiles, CRPS, calibration |
| RNNs | recurrence, dynamical systems, BPTT |
| LSTMs/GRUs | gating, memory, gradient flow |
| TCNs | causal convolution, dilation, receptive fields |
| Transformers | attention, tensor algebra, positional encodings |
| Patch models | local subsequence tokenization |
| Inverted models | variate-token attention |
| Multiscale models | decomposition, trend-season mixing |
| SSMs | differential equations, convolution kernels |
| Neural ODE/CDE | continuous-time dynamics, adjoint methods |
| Gaussian processes | kernels, Bayesian function priors |
| Diffusion models | score matching, denoising, conditional generation |
| Foundation models | pretraining, zero-shot transfer, scaling |
| Anomaly detection | residuals, reconstruction error, association discrepancy |
| Imputation | conditional distributions, missing-data inference |
| Evaluation | proper scoring rules, calibration, backtesting |
Section summary
A time series is ordered data:
$$ y_{1:T}=(y_1,\dots,y_T). $$A multivariate time series is:
$$ Y\in\mathbb{R}^{T\times d}. $$A forecasting model learns:
$$ \hat{y}_{T+1:T+H}=f_\theta(y_{1:T}). $$A probabilistic forecasting model learns:
$$ p_\theta(y_{T+1:T+H}\mid y_{1:T}). $$Classical models include:
$$ \text{AR},\quad \text{MA},\quad \text{ARMA},\quad \text{ARIMA},\quad \text{ETS}. $$State-space models separate hidden dynamics and noisy observations:
$$ x_t=Ax_{t-1}+w_t, $$$$ y_t=Cx_t+v_t. $$Deep learning models include:
$$ \text{RNN},\quad \text{LSTM},\quad \text{GRU},\quad \text{TCN},\quad \text{Transformer}. $$Modern SOTA families include:
$$ \text{N-BEATS},\quad \text{N-HiTS},\quad \text{DeepAR},\quad \text{TFT},\quad \text{PatchTST},\quad \text{iTransformer},\quad \text{TimesNet},\quad \text{TimeMixer}. $$Advanced models include:
$$ \text{S4},\quad \text{Mamba},\quad \text{Neural ODEs},\quad \text{Neural CDEs},\quad \text{Gaussian Processes},\quad \text{Diffusion Models}. $$Foundation models aim to learn across many datasets:
$$ \text{many time series} \to \text{one pretrained model} \to \text{zero-shot or fine-tuned forecasting}. $$The deepest view is:
time-series learning is conditional probability modeling over ordered stochastic processes.
Mathematically:
$$ p_\theta(y_{T+1:T+H}\mid \mathcal{F}_T) $$is the central object.
Everything else — ARIMA, Kalman filters, RNNs, transformers, diffusion models, and foundation models — is a different way of approximating this conditional law.
Source anchors used for this section
- Hyndman and Athanasopoulos, Forecasting: Principles and Practice, for classical forecasting, decomposition, ETS, ARIMA, and practical evaluation. (OTexts: Online, open-access textbooks)
- MIT OCW graduate time-series notes for stationarity, lag operators, ARMA, covariance, spectral analysis, and econometric time-series theory. (MIT OpenCourseWare)
- Stanford and MIT notes for state-space models and Kalman filtering. (Stanford University)
- Rabiner’s HMM tutorial for Markov chains, hidden states, likelihood evaluation, decoding, and parameter estimation. (Computer Science at UBC)
- Rasmussen and Williams, Gaussian Processes for Machine Learning, for Gaussian processes, kernels, and probabilistic function modeling. (the Gaussian Process web site)
- DeepAR, N-BEATS, N-HiTS, TFT, PatchTST, TimesNet, iTransformer, and TimeMixer papers for representative deep forecasting architectures. (arXiv)
- LSTM, GRU, WaveNet, and TCN papers for recurrent and convolutional sequence modeling. (bioinf.jku.at)
- S4 and Mamba papers for structured and selective state-space sequence models. (arXiv)
- Neural ODE and Neural CDE papers for continuous-time and irregular time-series modeling. (arXiv)
- TS2Vec and Anomaly Transformer for representation learning and anomaly detection. (arXiv)
- TimeGrad, CSDI, and diffusion time-series surveys for diffusion-based probabilistic forecasting and imputation. (arXiv)
- Chronos, TimesFM, Lag-Llama, Moirai, MOMENT, Timer, and Sundial for modern time-series foundation models. (arXiv)
