[{"categories":["Research","Physics"],"content":"This note is based on my master\u0026rsquo;s thesis, Exciton Diffusion as a Blind Search: First-Hitting and First-Return Time Analysis. The central idea is simple: instead of treating exciton transport only as a diffusion problem, we can also interpret it as a search process in a disordered landscape.\nIn that interpretation, an exciton moves through a semiconductor as a random walker while traps, defects, or recombination centers play the role of targets. This makes it natural to ask two different questions:\nHow efficiently does the walker find a new target? How quickly does it return to where it started? Those two questions are captured by first-hitting time (FHT) and first-return time (FRT), and the thesis shows that they are optimized by different movement regimes.\nMain takeawayThe most efficient exploratory regime in the simulations is not the canonical Levy-foraging value $\\alpha = 1$, but a lower value near $\\alpha \\approx 0.6$. That difference matters because it shows the \u0026ldquo;best\u0026rdquo; search strategy depends strongly on the physical environment and objective. Why frame exciton transport as search? Excitons are bound electron-hole pairs, and their transport is critical for optoelectronic systems. In disordered materials, however, transport is not simply smooth Brownian motion. Defects, traps, and heterogeneous structure can produce:\nanomalous diffusion coexistence of free and trapped populations strong sensitivity to geometry and disorder Blind-search theory gives a language for studying exactly these features. It lets us ask whether long jumps help or hurt, whether revisits are beneficial, and how disorder changes transport statistics.\nModel The thesis studies a two-dimensional search space populated by fixed traps. The exciton is modeled as a Levy-like random walker with stability index $\\alpha$, where the jump-length tail scales like\n$$ p(\\\\ell) \\\\sim \\\\ell^{-(\\\\alpha + 1)}. $$Smaller $\\alpha$ means heavier tails and more long relocations. Larger $\\alpha$ approaches Brownian-like behavior. The simulations used large Monte Carlo ensembles, reaching roughly\n$$ N \\\\approx 10^6 $$trajectories in the highest-confidence runs.\nThe transport environment is deliberately disordered:\na 2D continuous domain with periodic boundaries a fixed set of traps representing quenched disorder a finite reaction radius around each trap sub-stepped motion so long jumps do not simply leap over targets without being detected This makes the comparison between exploration and recurrence statistically meaningful.\nThe two observables First-hitting time FHT measures the first time the walker reaches a new target region. In the exciton interpretation, this corresponds to successful transport toward a new trap or interaction site.\nLow mean FHT means high exploratory efficiency.\nFirst-return time FRT measures the first time the walker comes back to its original region after leaving it. This is a recurrence statistic rather than a discovery statistic.\nLow FRT means the dynamics remain locally revisiting or recurrent.\nMain findings 1. Exploration is optimized near alpha approximately 0.6 The clearest result of the thesis is that the minimum mean FHT appears near\n$$ \\\\alpha_{\\\\mathrm{opt}} \\\\approx 0.6. $$That means the best exploratory behavior is obtained neither in the Brownian limit nor at the standard Levy-foraging benchmark $\\alpha = 1$, but at a lower heavy-tailed regime.\n2. Strategies with alpha greater than 1 are poor for discovering new targets Once the dynamics move into the $\\alpha \u0026gt; 1$ regime, the simulations show a sharp degradation in FHT performance:\nlonger mean FHT larger inefficiency in reaching new traps very low probability of successful hit within the simulated time window So if the physical goal is efficient discovery of new trap sites, these regimes are strongly suboptimal.\n3. Mean FRT stays low across the range Unlike FHT, the mean FRT remains comparatively low and fairly stable across the explored range of $\\alpha$.\nThis means recurrence is governed by a different statistical structure than exploration. A strategy that is good at coming back is not necessarily good at finding something new.\n4. The variance of FRT changes sharply around alpha = 1 Although the mean FRT stays low, the variance of FRT undergoes a clear transition:\nfor $\\alpha \\le 1$, return-time variance is extremely large for $\\alpha \u0026gt; 1$, the variance drops substantially That suggests a trade-off:\nlower $\\alpha$ improves exploration higher $\\alpha$ makes returns more predictable Interpretation This leads to a physically useful conclusion: exploration and recurrence are different optimization problems.\nIf the objective is to reach new sites efficiently, the system prefers a heavier-tailed motion near $\\alpha \\approx 0.6$. If the objective is stable local revisitation, then larger $\\alpha$ values become more attractive.\nThis is important because many discussions of Levy optimality focus on a single universal value. The thesis argues against that kind of universality in disordered exciton transport. The best exponent depends on:\ndisorder structure target definition trapping geometry the metric being optimized Why it matters For semiconductor and excitonic systems, these results suggest that transport pathways are not just governed by average diffusion constants. They are shaped by the full statistical structure of motion in a disordered environment.\nThat matters for:\nunderstanding transport losses interpreting anomalous diffusion experiments thinking about trap engineering designing materials where energy transport toward or away from localized sites matters Methodological note The thesis combines:\nLevy-stable jump statistics discrete Langevin-style updates cKDTree-based nearest-neighbor searches large Monte Carlo ensembles This combination makes it possible to study both asymptotic statistics and practical hit/return behavior in large disordered systems.\nThesis contribution in one sentence The thesis shows that when exciton transport in a disordered semiconductor is treated as a blind-search problem, the best strategy for discovering new targets emerges near $\\alpha \\approx 0.6$, while return dynamics obey a different and more stable statistical regime.\n","date":"2026-04-14","description":"Master's thesis note on exciton diffusion, blind search, and first-hitting versus first-return statistics in disordered semiconductor environments.","featured":true,"featured_image":null,"permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/research/exciton-diffusion-blind-search/","popular":true,"readingTime":5,"relPermalink":"/Sang_e_Mehrab/research/exciton-diffusion-blind-search/","section":"research","series":"Thesis Notes","summary":"A research note based on my master's thesis, showing that exciton transport modeled as a blind search is most efficient near alpha approximately 0.6 rather than the canonical Levy-foraging value alpha = 1.","tags":["mathematics","physics","anomalous diffusion","levy flights","semiconductors"],"title":"Exciton Diffusion as a Blind Search","type":"research"},{"categories":["Course"],"content":"Why start with one house? Artificial intelligence begins with data, but data does not enter a model as a vague idea. It must be represented mathematically.\nA house, for example, is a real-world object. It has a price, a location, an area, a number of rooms, an age, a height, a neighborhood, and many other properties. A machine-learning model cannot directly understand the sentence:\nThis is a two-floor house in a good location with a price of 9.0.\nThe model needs numbers.\nSo the first question of applied mathematics for AI is:\nHow do we turn a real object into mathematical objects that a model can compute with?\nWe begin with the simplest possible case: one number.\nA single data point as a scalar Suppose we only know one thing about a house:\n$$ \\text{price} = 9.0 $$This number may mean 9.0 million rupees, 9.0 hundred-thousand dollars, or any other chosen unit. The unit is not the main point yet. The mathematical point is that we have one number.\nWe can name this number:\n$$ x = 9.0 $$Here, \\(x\\) is a scalar.\nA scalar is a single number. It has magnitude, but it does not have multiple components.\nExamples of scalars are:\n$$ x = 9.0 $$$$ a = 1200 $$$$ r = 0.05 $$$$ n = 100 $$In machine learning, scalars appear everywhere:\none house price, one temperature value, one loss value, one learning rate, one probability, one model output for a single regression task. Key ideaA scalar is the smallest numerical object we usually use in machine learning. It is one value, not a list, not a table, and not a tensor with many axes. If we write:\n$$ x = 9.0 $$then \\(x\\) is a scalar data point.\nBut one number is rarely enough to describe a house.\nFrom one number to many features A real house is not described only by price. Suppose we observe the following information:\nProperty Value Price 9.0 Area 1200 square feet Bedrooms 3 Floors 2 Age 10 years Distance from city center 3.2 km Now the house is no longer represented by one number. It is represented by several numbers.\nWe may collect these numbers into one ordered list:\n$$ [9.0,\\;1200,\\;3,\\;2,\\;10,\\;3.2] $$This list is the beginning of a vector representation.\nEach number describes one property of the house. In machine learning, these properties are usually called features.\nSo we may write:\n$$ x = \\begin{bmatrix} 9.0 \\\\ 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\end{bmatrix} $$This object is no longer a scalar. It is a vector.\nIt has six components, so we say:\n$$ x \\in \\mathbb{R}^6 $$This means:\n\\(x\\) is a vector with 6 real-valued components.\nNumerical example: one house Let the features be ordered as:\n$$ \\text{features} = [ \\text{price}, \\text{area}, \\text{bedrooms}, \\text{floors}, \\text{age}, \\text{distance} ] $$For one house:\n$$ x = \\begin{bmatrix} 9.0 \\\\ 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\end{bmatrix} $$The first component is the price:\n$$ x_1 = 9.0 $$The second component is the area:\n$$ x_2 = 1200 $$The third component is the number of bedrooms:\n$$ x_3 = 3 $$The fourth component is the number of floors:\n$$ x_4 = 2 $$The fifth component is the age:\n$$ x_5 = 10 $$The sixth component is the distance from the city center:\n$$ x_6 = 3.2 $$So the vector is not just a list of numbers. It is a structured representation of one house.\nPrice as an input versus price as a target There is an important machine-learning distinction here.\nIf we are simply describing a house, we may include price inside \\(x\\):\n$$ x = \\begin{bmatrix} 9.0 \\\\ 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\end{bmatrix} $$But if our task is to predict house price, then price is usually not placed inside the input vector. Instead, price becomes the target value.\nThen we write:\n$$ x = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\end{bmatrix} $$and\n$$ y = 9.0 $$Here:\n\\(x\\) is the input vector, \\(y\\) is the target output, the model learns a function that maps \\(x\\) to \\(y\\). That is:\n$$ f(x) \\approx y $$For house-price prediction:\n$$ f \\left( \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\end{bmatrix} \\right) \\approx 9.0 $$Important distinctionThe same number can play different roles depending on the task. If price is just an observed property, it may be part of \\(x\\). If price is what we want to predict, it is usually written as \\(y\\). Why words must become numbers Suppose we also know the location:\nProperty Value Location North The word \u0026quot;North\u0026quot; cannot directly enter most mathematical models. We need to encode it numerically.\nA poor encoding would be:\n$$ \\text{North}=1,\\qquad \\text{South}=2,\\qquad \\text{East}=3 $$This may accidentally tell the model that East is somehow greater than South, and South is greater than North. But neighborhoods usually do not have such a natural numerical order.\nA better simple encoding is one-hot encoding.\nSuppose the possible locations are:\n$$ \\{\\text{North},\\text{South},\\text{East}\\} $$Then:\n$$ \\text{North} = \\begin{bmatrix} 1 \\\\ 0 \\\\ 0 \\end{bmatrix} $$$$ \\text{South} = \\begin{bmatrix} 0 \\\\ 1 \\\\ 0 \\end{bmatrix} $$$$ \\text{East} = \\begin{bmatrix} 0 \\\\ 0 \\\\ 1 \\end{bmatrix} $$Now a house in the North location may be represented as:\n$$ x = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\\\ 1 \\\\ 0 \\\\ 0 \\end{bmatrix} $$This vector has 8 components:\n$$ x \\in \\mathbb{R}^8 $$The first five components are numerical house properties. The last three components encode the location.\nA single row of data In a spreadsheet or dataset, one house is often stored as one row.\nArea Bedrooms Floors Age Distance North South East Price 1200 3 2 10 3.2 1 0 0 9.0 This row contains one training example.\nIf the goal is to predict price, then the input features are:\n$$ x = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\\\ 1 \\\\ 0 \\\\ 0 \\end{bmatrix} $$and the target is:\n$$ y = 9.0 $$What one row means One row usually means:\none object, one observation, one sample, or one data point.\nIn house-price prediction:\none row = one house, one column = one feature, one target column = the value we want to predict. In image classification:\none row may represent one image after flattening, columns may represent pixel values, the target may represent the class label. In natural language processing:\none row may represent one sentence, columns may represent token IDs or embedding values, the target may represent sentiment, next word, or class label. Features, dimensions, and units If a vector has \\(d\\) features, we write:\n$$ x \\in \\mathbb{R}^d $$The number \\(d\\) is called the dimension of the vector.\nFor example:\n$$ x = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\end{bmatrix} \\in \\mathbb{R}^5 $$Here:\n$$ d = 5 $$The vector has five components.\nHowever, each component may have a different unit:\nComponent Meaning Unit \\(x_1\\) Area square feet \\(x_2\\) Bedrooms count \\(x_3\\) Floors count \\(x_4\\) Age years \\(x_5\\) Distance kilometers This matters because raw numerical values can have very different scales.\nFor example, area may be around 1200, while bedrooms may be around 3. If we compute distances or dot products directly, the area feature may dominate simply because its numerical scale is larger.\nML intuitionA feature with large numerical values is not automatically more important. It may only have a larger unit scale. This is why normalization and standardization are important in many ML models. Row vectors and column vectors The same numbers can be written horizontally or vertically.\nHorizontally:\n$$ x = \\begin{bmatrix} 1200 \u0026 3 \u0026 2 \u0026 10 \u0026 3.2 \\end{bmatrix} $$Vertically:\n$$ x = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\end{bmatrix} $$These look similar, but mathematically they have different shapes.\nRow vector A row vector has one row and many columns.\nFor example:\n$$ x_{\\text{row}} = \\begin{bmatrix} 1200 \u0026 3 \u0026 2 \u0026 10 \u0026 3.2 \\end{bmatrix} $$This has shape:\n$$ 1 \\times 5 $$So:\n$$ x_{\\text{row}} \\in \\mathbb{R}^{1 \\times 5} $$A row vector is useful when we think of a data point as one row in a dataset.\nColumn vector A column vector has many rows and one column.\nFor example:\n$$ x_{\\text{col}} = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\end{bmatrix} $$This has shape:\n$$ 5 \\times 1 $$So:\n$$ x_{\\text{col}} \\in \\mathbb{R}^{5 \\times 1} $$A column vector is common in linear algebra because many formulas are written using column vectors.\nSame numbers, different mathematical shape The row vector is:\n$$ x_{\\text{row}} = \\begin{bmatrix} 1200 \u0026 3 \u0026 2 \u0026 10 \u0026 3.2 \\end{bmatrix} \\in \\mathbb{R}^{1 \\times 5} $$The column vector is:\n$$ x_{\\text{col}} = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\end{bmatrix} \\in \\mathbb{R}^{5 \\times 1} $$They contain the same values, but they are not the same matrix-shaped object.\nTheir shapes are different:\n$$ 1 \\times 5 \\neq 5 \\times 1 $$This difference matters when we multiply vectors and matrices.\nFor example, suppose:\n$$ a = \\begin{bmatrix} 2 \u0026 3 \u0026 4 \\end{bmatrix} $$and\n$$ b = \\begin{bmatrix} 5 \\\\ 6 \\\\ 7 \\end{bmatrix} $$Then we multiply \\(a\\) and \\(b\\) in that order. Writing \\(ab\\) means the usual matrix product: a \\(1 \\times 3\\) row times a \\(3 \\times 1\\) column, so the result is a single number (a scalar).\nHow the entries combine: the first component of \\(a\\) multiplies the first component of \\(b\\), the second multiplies the second, the third multiplies the third. Those three products are added — that sum is the value of \\(ab\\) (also called the dot product of the row and the column).\n$$ ab = \\begin{bmatrix} 2 \u0026 3 \u0026 4 \\end{bmatrix} \\begin{bmatrix} 5 \\\\ 6 \\\\ 7 \\end{bmatrix} $$So:\n$$ ab = 2(5) + 3(6) + 4(7) $$$$ ab = 10 + 18 + 28 = 56 $$So:\n$$ ab = 56 $$This is a scalar.\nBut if we reverse the order and multiply \\(b\\) by \\(a\\) instead — a \\(3 \\times 1\\) column times a \\(1 \\times 3\\) row — you get a \\(3 \\times 3\\) matrix (each entry is one component of \\(b\\) times one component of \\(a\\); we spell that out below).\n$$ ba = \\begin{bmatrix} 5 \\\\ 6 \\\\ 7 \\end{bmatrix} \\begin{bmatrix} 2 \u0026 3 \u0026 4 \\end{bmatrix} $$then the result is:\n$$ ba = \\begin{bmatrix} 10 \u0026 15 \u0026 20 \\\\ 12 \u0026 18 \u0026 24 \\\\ 14 \u0026 21 \u0026 28 \\end{bmatrix} $$This is a \\(3 \\times 3\\) matrix.\nSo:\n$$ ab \\neq ba $$The order and shape matter.\nTranspose The transpose changes rows into columns and columns into rows.\nThe transpose of an object is written using the symbol:\n$$ ^\\top $$If \\(A\\) is a matrix, then its transpose is:\n$$ A^\\top $$Scalar transpose A scalar has no row or column direction.\nIf:\n$$ x = 9.0 $$then:\n$$ x^\\top = x $$So:\n$$ 9.0^\\top = 9.0 $$The transpose of a scalar is the same scalar.\nRow-to-column transpose Let:\n$$ x_{\\text{row}} = \\begin{bmatrix} 1200 \u0026 3 \u0026 2 \u0026 10 \u0026 3.2 \\end{bmatrix} $$Then:\n$$ x_{\\text{row}}^\\top = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\end{bmatrix} $$So the transpose changes a row vector into a column vector.\nShape-wise:\n$$ x_{\\text{row}} \\in \\mathbb{R}^{1 \\times 5} $$and:\n$$ x_{\\text{row}}^\\top \\in \\mathbb{R}^{5 \\times 1} $$Column-to-row transpose Similarly, if:\n$$ x_{\\text{col}} = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\end{bmatrix} $$then:\n$$ x_{\\text{col}}^\\top = \\begin{bmatrix} 1200 \u0026 3 \u0026 2 \u0026 10 \u0026 3.2 \\end{bmatrix} $$Shape-wise:\n$$ x_{\\text{col}} \\in \\mathbb{R}^{5 \\times 1} $$and:\n$$ x_{\\text{col}}^\\top \\in \\mathbb{R}^{1 \\times 5} $$Matrix transpose What is a matrix (first pass)? For now, think of a matrix in whichever picture helps: two (or more) rows written one under the other, or two (or more) columns placed side by side — equivalently, row vectors stacked vertically or column vectors lined up horizontally. We are about to see how transpose swaps that row/column view. A fuller story — how we read off entries, shapes like \\(n \\times d\\), and how matrices act in models — comes when we build the dataset matrix in the next major section, From one house to a dataset matrix.\nSuppose we have a matrix:\n$$ A = \\begin{bmatrix} 1 \u0026 2 \u0026 3 \\\\ 4 \u0026 5 \u0026 6 \\end{bmatrix} $$This matrix has 2 rows and 3 columns:\n$$ A \\in \\mathbb{R}^{2 \\times 3} $$The transpose is:\n$$ A^\\top = \\begin{bmatrix} 1 \u0026 4 \\\\ 2 \u0026 5 \\\\ 3 \u0026 6 \\end{bmatrix} $$Now the shape is:\n$$ A^\\top \\in \\mathbb{R}^{3 \\times 2} $$The first row of \\(A\\) becomes the first column of \\(A^\\top\\). The second row of \\(A\\) becomes the second column of \\(A^\\top\\).\nIn general, if:\n$$ A \\in \\mathbb{R}^{m \\times n} $$then:\n$$ A^\\top \\in \\mathbb{R}^{n \\times m} $$Code warning: 1-D arrays are not row or column vectors In mathematics, we distinguish between:\n$$ \\mathbb{R}^{1 \\times d} $$and:\n$$ \\mathbb{R}^{d \\times 1} $$But in NumPy or PyTorch, a one-dimensional array may have shape:\n(d,) This is not the same as:\n(1, d) and it is not the same as:\n(d, 1) For example:\nimport numpy as np x = np.array([1200, 3, 2, 10, 3.2]) print(x.shape) Output:\n(5,) This is a 1-D array.\nIf we write:\nprint(x.T.shape) the shape is still:\n(5,) The transpose did not turn it into a column vector, because the array has only one axis.\nTo create a row vector:\nx_row = x.reshape(1, 5) print(x_row.shape) Output:\n(1, 5) To create a column vector:\nx_col = x.reshape(5, 1) print(x_col.shape) Output:\n(5, 1) Now transposition behaves as expected:\nprint(x_row.T.shape) Output:\n(5, 1) and:\nprint(x_col.T.shape) Output:\n(1, 5) Common coding mistakeIn mathematical writing, \\(x^\\top\\) changes a column vector into a row vector. In NumPy, x.T does not change the shape of a 1-D array. Use reshape, [:, None], or another dimension-expanding operation when you need an explicit row or column. From one house to a dataset matrix Machine learning rarely uses only one house. Usually, we have many houses.\nSuppose we have three houses:\nHouse Area Bedrooms Floors Age Distance North South Price 1 1200 3 2 10 3.2 1 0 9.0 2 850 2 1 18 8.5 0 1 6.5 3 1600 4 2 5 1.1 1 0 12.0 If price is the target, then the input matrix is:\n$$ X = \\begin{bmatrix} 1200 \u0026 3 \u0026 2 \u0026 10 \u0026 3.2 \u0026 1 \u0026 0 \\\\ 850 \u0026 2 \u0026 1 \u0026 18 \u0026 8.5 \u0026 0 \u0026 1 \\\\ 1600 \u0026 4 \u0026 2 \u0026 5 \u0026 1.1 \u0026 1 \u0026 0 \\end{bmatrix} $$The target vector is:\n$$ y = \\begin{bmatrix} 9.0 \\\\ 6.5 \\\\ 12.0 \\end{bmatrix} $$Here:\n$$ X \\in \\mathbb{R}^{3 \\times 7} $$and:\n$$ y \\in \\mathbb{R}^{3} $$or, if written explicitly as a column:\n$$ y \\in \\mathbb{R}^{3 \\times 1} $$Rows as examples Each row of \\(X\\) is one house.\nThe first row is:\n$$ \\begin{bmatrix} 1200 \u0026 3 \u0026 2 \u0026 10 \u0026 3.2 \u0026 1 \u0026 0 \\end{bmatrix} $$This represents House 1.\nThe second row is:\n$$ \\begin{bmatrix} 850 \u0026 2 \u0026 1 \u0026 18 \u0026 8.5 \u0026 0 \u0026 1 \\end{bmatrix} $$This represents House 2.\nThe third row is:\n$$ \\begin{bmatrix} 1600 \u0026 4 \u0026 2 \u0026 5 \u0026 1.1 \u0026 1 \u0026 0 \\end{bmatrix} $$This represents House 3.\nSo the number of rows is the number of data points:\n$$ n = 3 $$Columns as features Each column of \\(X\\) is one feature.\nThe first column is area:\n$$ \\begin{bmatrix} 1200 \\\\ 850 \\\\ 1600 \\end{bmatrix} $$The second column is bedrooms:\n$$ \\begin{bmatrix} 3 \\\\ 2 \\\\ 4 \\end{bmatrix} $$The fifth column is distance:\n$$ \\begin{bmatrix} 3.2 \\\\ 8.5 \\\\ 1.1 \\end{bmatrix} $$So the number of columns is the number of input features:\n$$ d = 7 $$The design matrix In machine learning, the input matrix is often called the design matrix.\nWe usually write:\n$$ X \\in \\mathbb{R}^{n \\times d} $$where:\n\\(n\\) is the number of examples, \\(d\\) is the number of features. The \\(i\\)-th data point is often written as:\n$$ x_i \\in \\mathbb{R}^d $$If \\(x_i\\) is treated as a column vector, then the dataset matrix is written as:\n$$ X = \\begin{bmatrix} x_1^\\top \\\\ x_2^\\top \\\\ \\vdots \\\\ x_n^\\top \\end{bmatrix} $$This notation is very important.\nIt says:\neach \\(x_i\\) is naturally a column vector, each row of \\(X\\) is \\(x_i^\\top\\), the full dataset stacks transposed data points row by row. So:\n$$ x_i \\in \\mathbb{R}^{d} $$but:\n$$ x_i^\\top \\in \\mathbb{R}^{1 \\times d} $$and:\n$$ X \\in \\mathbb{R}^{n \\times d} $$What is a vector mathematically? A vector is not merely a list of numbers. A vector is an object that can be added to other vectors and multiplied by scalars while staying inside the same space.\nFor example, if:\n$$ u = \\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\end{bmatrix} $$and:\n$$ v = \\begin{bmatrix} 4 \\\\ 5 \\\\ 6 \\end{bmatrix} $$then:\n$$ u + v = \\begin{bmatrix} 1+4 \\\\ 2+5 \\\\ 3+6 \\end{bmatrix} = \\begin{bmatrix} 5 \\\\ 7 \\\\ 9 \\end{bmatrix} $$This result is still a vector in \\(\\mathbb{R}^3\\).\nVector addition For two vectors:\n$$ u = \\begin{bmatrix} u_1 \\\\ u_2 \\\\ \\vdots \\\\ u_d \\end{bmatrix} $$and:\n$$ v = \\begin{bmatrix} v_1 \\\\ v_2 \\\\ \\vdots \\\\ v_d \\end{bmatrix} $$their sum is:\n$$ u+v = \\begin{bmatrix} u_1+v_1 \\\\ u_2+v_2 \\\\ \\vdots \\\\ u_d+v_d \\end{bmatrix} $$Vectors can be added only when their dimensions match.\nFor example:\n$$ \\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\end{bmatrix} + \\begin{bmatrix} 4 \\\\ 5 \\\\ 6 \\end{bmatrix} = \\begin{bmatrix} 5 \\\\ 7 \\\\ 9 \\end{bmatrix} $$But this is not valid:\n$$ \\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\end{bmatrix} + \\begin{bmatrix} 4 \\\\ 5 \\end{bmatrix} $$because the first vector is in \\(\\mathbb{R}^3\\), while the second is in \\(\\mathbb{R}^2\\).\nScalar multiplication If:\n$$ u = \\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\end{bmatrix} $$and:\n$$ \\alpha = 2 $$then:\n$$ \\alpha u = 2 \\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\end{bmatrix} = \\begin{bmatrix} 2 \\\\ 4 \\\\ 6 \\end{bmatrix} $$In general:\n$$ \\alpha u = \\begin{bmatrix} \\alpha u_1 \\\\ \\alpha u_2 \\\\ \\vdots \\\\ \\alpha u_d \\end{bmatrix} $$This operation scales the vector.\nIn feature space, scaling a vector changes its distance from the origin but not its direction if \\(\\alpha \u003e 0\\).\nNorm, distance, and similarity The length of a vector is called its norm.\nThe most common norm is the Euclidean norm:\n$$ |x|_2 = \\sqrt{x_1^2+x_2^2+\\cdots+x_d^2} $$For example:\n$$ x = \\begin{bmatrix} 3 \\\\ 4 \\end{bmatrix} $$Then:\n$$ |x|_2 = \\sqrt{3^2+4^2} $$$$ |x|_2 = \\sqrt{9+16} $$$$ |x|_2 = \\sqrt{25} $$$$ |x|_2 = 5 $$This is the same geometry as the Pythagorean theorem.\nDistance between two vectors is computed by subtracting them first:\n$$ \\text{distance}(u,v) = |u-v|_2 $$Suppose:\n$$ u = \\begin{bmatrix} 1200 \\\\ 3 \\end{bmatrix} $$and:\n$$ v = \\begin{bmatrix} 1500 \\\\ 4 \\end{bmatrix} $$Then:\n$$ u-v = \\begin{bmatrix} -300 \\\\ -1 \\end{bmatrix} $$So:\n$$ |u-v|_2 = \\sqrt{(-300)^2+(-1)^2} $$$$ |u-v|_2 = \\sqrt{90000+1} $$$$ |u-v|_2 \\approx 300.0017 $$The distance is dominated by the area difference because area has a much larger numerical scale than bedrooms.\nThis is why raw feature distances can be misleading.\nIf we standardize the features, the comparison becomes more balanced.\nFor a feature \\(x_j\\), standardization often uses:\n$$ z_j = \\frac{x_j-\\mu_j}{\\sigma_j} $$where:\n\\(\\mu_j\\) is the mean of feature \\(j\\), \\(\\sigma_j\\) is the standard deviation of feature \\(j\\). For example, suppose:\n$$ \\mu_{\\text{area}} = 1200,\\qquad \\sigma_{\\text{area}} = 300 $$and:\n$$ \\mu_{\\text{bedrooms}} = 3,\\qquad \\sigma_{\\text{bedrooms}} = 1 $$For a house with:\n$$ \\text{area}=1500,\\qquad \\text{bedrooms}=4 $$we get:\n$$ z_{\\text{area}} = \\frac{1500-1200}{300}=1 $$and:\n$$ z_{\\text{bedrooms}} = \\frac{4-3}{1}=1 $$So the standardized vector is:\n$$ z = \\begin{bmatrix} 1 \\\\ 1 \\end{bmatrix} $$Now both features are measured in comparable standardized units.\nLinear combinations A linear combination of vectors is formed by multiplying vectors by scalars and adding them.\nIf:\n$$ u = \\begin{bmatrix} 1 \\\\ 0 \\end{bmatrix} $$and:\n$$ v = \\begin{bmatrix} 0 \\\\ 1 \\end{bmatrix} $$then:\n$$ 3u + 4v = 3 \\begin{bmatrix} 1 \\\\ 0 \\end{bmatrix} + 4 \\begin{bmatrix} 0 \\\\ 1 \\end{bmatrix} $$$$ 3u + 4v = \\begin{bmatrix} 3 \\\\ 0 \\end{bmatrix} + \\begin{bmatrix} 0 \\\\ 4 \\end{bmatrix} $$$$ 3u + 4v = \\begin{bmatrix} 3 \\\\ 4 \\end{bmatrix} $$Machine-learning models are full of linear combinations.\nA neuron computes a weighted combination of input features.\nA linear regression model computes a weighted combination of input features.\nA matrix multiplication computes many linear combinations at once.\nThe first machine-learning model: a weighted sum Now suppose we want to predict house price from features.\nUse four features:\n$$ x = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 10 \\\\ 3.2 \\end{bmatrix} $$where:\n\\(x_1=1200\\) is area, \\(x_2=3\\) is bedrooms, \\(x_3=10\\) is age, \\(x_4=3.2\\) is distance. Let the model weights be:\n$$ w = \\begin{bmatrix} 0.003 \\\\ 0.8 \\\\ -0.02 \\\\ -0.15 \\end{bmatrix} $$and let the bias be:\n$$ b = 1.5 $$A simple linear model predicts:\n$$ \\hat{y} = x^\\top w + b $$Dot product The dot product is:\n$$ x^\\top w = \\begin{bmatrix} 1200 \u0026 3 \u0026 10 \u0026 3.2 \\end{bmatrix} \\begin{bmatrix} 0.003 \\\\ 0.8 \\\\ -0.02 \\\\ -0.15 \\end{bmatrix} $$Multiply matching components and add:\n$$ x^\\top w = 1200(0.003) + 3(0.8) + 10(-0.02) + 3.2(-0.15) $$Now compute each part:\n$$ 1200(0.003)=3.6 $$$$ 3(0.8)=2.4 $$$$ 10(-0.02)=-0.2 $$$$ 3.2(-0.15)=-0.48 $$So:\n$$ x^\\top w = 3.6 + 2.4 - 0.2 - 0.48 $$$$ x^\\top w = 5.32 $$Now add the bias:\n$$ \\hat{y} = 5.32 + 1.5 $$$$ \\hat{y} = 6.82 $$So the model predicts:\n$$ \\hat{y} = 6.82 $$If the true price is:\n$$ y = 9.0 $$then the model underpredicts the price.\nPrediction The prediction formula:\n$$ \\hat{y} = x^\\top w + b $$has a simple interpretation.\nEach feature contributes something:\nFeature Value Weight Contribution Area 1200 0.003 3.6 Bedrooms 3 0.8 2.4 Age 10 -0.02 -0.2 Distance 3.2 -0.15 -0.48 The bias contributes:\n$$ b = 1.5 $$Total:\n$$ 3.6 + 2.4 - 0.2 - 0.48 + 1.5 = 6.82 $$The positive weights increase the prediction. The negative weights decrease the prediction.\nIn this toy example:\nlarger area increases predicted price, more bedrooms increase predicted price, older age decreases predicted price, larger distance from city center decreases predicted price. Error and loss The prediction error is:\n$$ e = \\hat{y} - y $$Using our values:\n$$ e = 6.82 - 9.0 $$$$ e = -2.18 $$The model prediction is 2.18 units below the true price.\nA common loss for regression is squared error:\n$$ L = (\\hat{y}-y)^2 $$So:\n$$ L = (-2.18)^2 $$$$ L = 4.7524 $$The goal of training is to adjust \\(w\\) and \\(b\\) so that the loss becomes smaller across the training dataset.\nFirst machine-learning patternA large part of machine learning can be seen as choosing parameters so that predictions become close to targets. Batch prediction Now suppose we have three houses and four features:\n$$ X = \\begin{bmatrix} 1200 \u0026 3 \u0026 10 \u0026 3.2 \\\\ 850 \u0026 2 \u0026 18 \u0026 8.5 \\\\ 1600 \u0026 4 \u0026 5 \u0026 1.1 \\end{bmatrix} $$The weight vector is:\n$$ w = \\begin{bmatrix} 0.003 \\\\ 0.8 \\\\ -0.02 \\\\ -0.15 \\end{bmatrix} $$The bias is:\n$$ b = 1.5 $$The batch prediction is:\n$$ \\hat{y} = Xw + b $$Strictly, \\(b\\) is added to each row result.\nFirst compute:\n$$ Xw = \\begin{bmatrix} 1200 \u0026 3 \u0026 10 \u0026 3.2 \\\\ 850 \u0026 2 \u0026 18 \u0026 8.5 \\\\ 1600 \u0026 4 \u0026 5 \u0026 1.1 \\end{bmatrix} \\begin{bmatrix} 0.003 \\\\ 0.8 \\\\ -0.02 \\\\ -0.15 \\end{bmatrix} $$For House 1:\n$$ 1200(0.003)+3(0.8)+10(-0.02)+3.2(-0.15)=5.32 $$For House 2:\n$$ 850(0.003)+2(0.8)+18(-0.02)+8.5(-0.15) $$$$ =2.55+1.6-0.36-1.275 $$$$ =2.515 $$For House 3:\n$$ 1600(0.003)+4(0.8)+5(-0.02)+1.1(-0.15) $$$$ =4.8+3.2-0.1-0.165 $$$$ =7.735 $$So:\n$$ Xw = \\begin{bmatrix} 5.32 \\\\ 2.515 \\\\ 7.735 \\end{bmatrix} $$Now add \\(b=1.5\\):\n$$ \\hat{y} = \\begin{bmatrix} 5.32 \\\\ 2.515 \\\\ 7.735 \\end{bmatrix} + \\begin{bmatrix} 1.5 \\\\ 1.5 \\\\ 1.5 \\end{bmatrix} $$Therefore:\n$$ \\hat{y} = \\begin{bmatrix} 6.82 \\\\ 4.015 \\\\ 9.235 \\end{bmatrix} $$This is one prediction for each house.\nShape-wise:\n$$ X \\in \\mathbb{R}^{3 \\times 4} $$$$ w \\in \\mathbb{R}^{4 \\times 1} $$Therefore:\n$$ Xw \\in \\mathbb{R}^{3 \\times 1} $$The feature dimension (4) must match:\n$$ (3 \\times 4)(4 \\times 1) = (3 \\times 1) $$Neural-network view A neural network uses the same idea, but repeats it many times.\nOne neuron A single neuron receives an input vector:\n$$ x = \\begin{bmatrix} x_1 \\\\ x_2 \\\\ \\vdots \\\\ x_d \\end{bmatrix} $$It has a weight vector:\n$$ w = \\begin{bmatrix} w_1 \\\\ w_2 \\\\ \\vdots \\\\ w_d \\end{bmatrix} $$It computes:\n$$ z = w^\\top x + b $$Expanded:\n$$ z = w_1x_1+w_2x_2+\\cdots+w_dx_d+b $$Then it applies an activation function:\n$$ a = \\sigma(z) $$For example, if \\(\\sigma\\) is ReLU:\n$$ a = \\max(0,z) $$So a neuron is a weighted sum followed by a nonlinear function.\nOne layer A layer contains many neurons.\nSuppose the input has \\(d=4\\) features and the layer has \\(h=3\\) neurons.\nThe input is:\n$$ x \\in \\mathbb{R}^{4} $$The weight matrix is:\n$$ W \\in \\mathbb{R}^{4 \\times 3} $$The bias vector is:\n$$ b \\in \\mathbb{R}^{3} $$If \\(x\\) is written as a row vector, then the layer computes:\n$$ z = xW + b $$The shape is:\n$$ (1 \\times 4)(4 \\times 3) = 1 \\times 3 $$So:\n$$ z \\in \\mathbb{R}^{3} $$Each component of \\(z\\) is the pre-activation value of one neuron.\nFor example:\n$$ z = \\begin{bmatrix} z_1 \u0026 z_2 \u0026 z_3 \\end{bmatrix} $$Then:\n$$ a = \\begin{bmatrix} \\sigma(z_1) \u0026 \\sigma(z_2) \u0026 \\sigma(z_3) \\end{bmatrix} $$Batch input In deep learning, we usually process many examples at once.\nLet:\n$$ X \\in \\mathbb{R}^{n \\times d} $$where:\n\\(n\\) is the batch size, \\(d\\) is the number of features. Let:\n$$ W \\in \\mathbb{R}^{d \\times h} $$where:\n\\(h\\) is the number of neurons in the layer. Then:\n$$ Z = XW + b $$The shape is:\n$$ (n \\times d)(d \\times h) = n \\times h $$So:\n$$ Z \\in \\mathbb{R}^{n \\times h} $$For example, if:\n$$ n=32,\\qquad d=7,\\qquad h=64 $$then:\n$$ X \\in \\mathbb{R}^{32 \\times 7} $$$$ W \\in \\mathbb{R}^{7 \\times 64} $$and:\n$$ Z = XW + b \\in \\mathbb{R}^{32 \\times 64} $$This means:\n32 examples are processed together, each example has 7 input features, the layer produces 64 hidden features for each example. In code, this often appears as:\nimport torch X = torch.randn(32, 7) # 32 houses, 7 features each W = torch.randn(7, 64) # maps 7 input features to 64 hidden features b = torch.randn(64) # one bias per hidden neuron Z = X @ W + b print(Z.shape) Output:\ntorch.Size([32, 64]) This is the same mathematics:\n$$ Z = XW + b $$The code and the mathematics are two views of the same operation.\nCommon mistakes Mistake 1: Confusing a scalar with a vector A scalar is one number:\n$$ x = 9.0 $$A vector is an ordered collection of numbers:\n$$ x = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\end{bmatrix} $$Do not call every number a vector.\nMistake 2: Forgetting what the components mean The vector:\n$$ x = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\end{bmatrix} $$is meaningful only if we know the feature order.\nIt may mean:\n$$ [ \\text{area}, \\text{bedrooms}, \\text{floors}, \\text{age} ] $$Without feature names, the numbers lose their interpretation.\nMistake 3: Treating categorical labels as ordinary numbers If:\n$$ \\text{North}=1,\\qquad \\text{South}=2,\\qquad \\text{East}=3 $$the model may interpret this as an artificial ordering.\nOne-hot encoding avoids this simple problem by representing categories as indicator vectors.\nMistake 4: Ignoring units and scales A feature like area may be around 1200, while bedrooms may be around 3.\nA model may be affected strongly by the feature with the larger numerical scale.\nThis is why standardization is often useful:\n$$ z_j = \\frac{x_j-\\mu_j}{\\sigma_j} $$Mistake 5: Thinking a 1-D array has row or column orientation In mathematical notation:\n$$ x^\\top $$has a clear meaning.\nBut in NumPy, an array with shape:\n(5,) is neither explicitly a row vector nor explicitly a column vector.\nFor explicit orientation, use:\n(1, 5) or:\n(5, 1) Mistake 6: Multiplying incompatible shapes The product:\n$$ Xw $$is valid when:\n$$ X \\in \\mathbb{R}^{n \\times d} $$and:\n$$ w \\in \\mathbb{R}^{d \\times 1} $$because the inner dimensions match:\n$$ (n \\times d)(d \\times 1) $$But this is not valid:\n$$ (n \\times d)(n \\times 1) $$unless \\(d=n\\), and even then it would usually not mean the desired operation.\nShape habitBefore doing matrix multiplication, always write the shapes. Most linear algebra mistakes in machine learning are shape mistakes. Section summary We began with one number:\n$$ x = 9.0 $$This is a scalar.\nThen we described one house using many numbers:\n$$ x = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\end{bmatrix} $$This is a vector.\nWe learned that a row vector has shape:\n$$ 1 \\times d $$while a column vector has shape:\n$$ d \\times 1 $$We learned that transpose changes rows into columns:\n$$ x_{\\text{row}}^\\top = x_{\\text{col}} $$We then stacked many examples into a dataset matrix:\n$$ X \\in \\mathbb{R}^{n \\times d} $$where rows are examples and columns are features.\nFinally, we saw the first machine-learning model:\n$$ \\hat{y} = x^\\top w + b $$and its batch form:\n$$ \\hat{y} = Xw + b $$This same structure appears inside neural networks:\n$$ Z = XW + b $$So the path from data to deep learning begins with a simple idea:\none object becomes numbers, numbers become a vector, vectors become matrices, and matrices become the language of machine learning.\n","date":"2026-04-27","description":"A careful introduction to how a single house measurement becomes a scalar, how many measurements become a vector, how rows become datasets, and how vectors are used in machine learning and neural networks.","featured":false,"featured_image":"mathematicsforai/Chapter_01_Language_illustrative_image.png","permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-01/section-01/","popular":false,"readingTime":23,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-01/section-01/","section":"courses","series":"","summary":"We begin with one number, then build toward feature vectors, row and column vectors, transposes, dataset matrices, and the first mathematical form of a machine-learning model.","tags":["linear algebra","vectors","machine learning","data representation","transpose"],"title":"1.1 From Scalars to Vectors: Data Points, Rows, Columns, and Transpose","type":"courses"},{"categories":["Course"],"content":"Why matrices and tensors come next In the previous section, we learned how one object can become a vector.\nA house became:\n$$ x = \\begin{bmatrix} 1200 \\\\ 3 \\\\ 2 \\\\ 10 \\\\ 3.2 \\end{bmatrix} $$This vector represented one house using several features.\nBut machine learning almost never works with only one object.\nUsually, we have many houses, many images, many words, many patients, many time steps, or many training examples. Once we collect many vectors together, we naturally arrive at a matrix.\nAnd once data has more than two directions — for example height, width, and color channels in an image — we naturally arrive at a tensor.\nMIT’s 18.06 course describes linear algebra as matrix theory with emphasis on systems of equations, vector spaces, determinants, eigenvalues, and positive definite matrices, all of which are foundational for applied mathematics and AI. (MIT OpenCourseWare)\nSo the path is:\n$$ \\text{scalar} \\longrightarrow \\text{vector} \\longrightarrow \\text{matrix} \\longrightarrow \\text{tensor}. $$A scalar is one number.\nA vector is a one-dimensional collection of numbers.\nA matrix is a two-dimensional collection of numbers.\nA tensor, in the machine-learning sense, is a multidimensional array. Kolda and Bader define a tensor as a multidimensional or \\(N\\)-way array, while also noting the more formal view of an \\(N\\)-th order tensor as an element of a tensor product of \\(N\\) vector spaces.\nImportant language noteIn deep learning libraries such as PyTorch and TensorFlow, the word tensor usually means a multidimensional numerical array. In more advanced mathematics, a tensor can mean an element of a tensor product space or a multilinear map. These views are related, but they are not always used with the same level of precision. Real-world example 1: a house dataset as a matrix Suppose we collect data from three houses.\nHouse Area Bedrooms Floors Age Distance Price 1 1200 3 2 10 3.2 9.0 2 850 2 1 18 8.5 6.5 3 1600 4 2 5 1.1 12.0 If price is the target, then the input features are:\n$$ X = \\begin{bmatrix} 1200 \u0026 3 \u0026 2 \u0026 10 \u0026 3.2 \\\\ 850 \u0026 2 \u0026 1 \u0026 18 \u0026 8.5 \\\\ 1600 \u0026 4 \u0026 2 \u0026 5 \u0026 1.1 \\end{bmatrix} $$and the target vector is:\n$$ y = \\begin{bmatrix} 9.0 \\\\ 6.5 \\\\ 12.0 \\end{bmatrix}. $$Here:\n$$ X \\in \\mathbb{R}^{3 \\times 5} $$and:\n$$ y \\in \\mathbb{R}^{3}. $$The matrix \\(X\\) has:\n3 rows, 5 columns, 15 entries. Each row is one house.\nEach column is one feature.\nSo a matrix lets us store many feature vectors in one object.\nIn machine learning, this matrix is often called the design matrix.\nReal-world example 2: a grayscale image as a matrix A grayscale image is also naturally represented as a matrix.\nSuppose we have a tiny \\(4 \\times 5\\) grayscale image. Each number represents pixel intensity.\n$$ I = \\begin{bmatrix} 0 \u0026 20 \u0026 50 \u0026 20 \u0026 0 \\\\ 10 \u0026 80 \u0026 150 \u0026 80 \u0026 10 \\\\ 20 \u0026 120 \u0026 255 \u0026 120 \u0026 20 \\\\ 0 \u0026 30 \u0026 60 \u0026 30 \u0026 0 \\end{bmatrix} $$Here:\n$$ I \\in \\mathbb{R}^{4 \\times 5}. $$The first dimension is height.\nThe second dimension is width.\nThe entry \\(I_{ij}\\) is the brightness of the pixel in row \\(i\\), column \\(j\\).\nFor example:\n$$ I_{3,3} = 255 $$is the brightest pixel in this small image.\nSo a grayscale image is not just a picture. To a model, it is a matrix.\nReal-world example 3: a color image as a tensor A color image has more structure.\nEach pixel usually has three color channels:\n$$ \\text{red},\\quad \\text{green},\\quad \\text{blue}. $$So instead of one number per pixel, we have three numbers per pixel.\nFor a color image with height \\(H\\), width \\(W\\), and 3 color channels, we can write:\n$$ X \\in \\mathbb{R}^{H \\times W \\times 3}. $$This is no longer a matrix.\nIt has three axes:\nheight, width, color channel. So it is a third-order tensor.\nFor a tiny \\(2 \\times 2\\) color image:\n$$ X \\in \\mathbb{R}^{2 \\times 2 \\times 3}. $$One pixel may be:\n$$ X_{1,1,:} {}={} \\begin{bmatrix} 255 \\\\ 0 \\\\ 0 \\end{bmatrix} $$which represents a red pixel.\nAnother pixel may be:\n$$ X_{1,2,:} {}={} \\begin{bmatrix} 0 \\\\ 255 \\\\ 0 \\end{bmatrix} $$which represents a green pixel.\nThe colon notation means:\nkeep all values along that axis.\nSo \\(X_{1,1,:}\\) means:\nrow 1, column 1, all color channels.\nReal-world example 4: a batch of images as a fourth-order tensor Deep learning usually processes many examples at once.\nSuppose we have a batch of 32 color images.\nEach image has size:\n$$ 224 \\times 224 $$and 3 color channels.\nThen the batch may be stored as:\n$$ X \\in \\mathbb{R}^{32 \\times 224 \\times 224 \\times 3}. $$The axes are:\nAxis Meaning 1 batch/example index 2 image height 3 image width 4 color channel So:\n$$ X_{n,i,j,c} $$means:\nthe value of color channel \\(c\\) at pixel row \\(i\\), pixel column \\(j\\), in image \\(n\\).\nThis is a fourth-order tensor.\nIn PyTorch, images are often stored in another layout:\n$$ X \\in \\mathbb{R}^{32 \\times 3 \\times 224 \\times 224} $$where the axes are:\nAxis Meaning 1 batch 2 channel 3 height 4 width The mathematics is the same, but the axis order is different.\nAxis order mattersThe same data can be stored with different axis conventions. For images, TensorFlow often uses batch-height-width-channel format, while PyTorch often uses batch-channel-height-width format. The numerical values may be the same, but the shape interpretation changes. Real-world example 5: language data as tensors In natural language processing, a sentence is often converted into a sequence of token embeddings.\nSuppose:\nbatch size is \\(B\\), sequence length is \\(L\\), embedding dimension is \\(d\\). Then the input to a language model layer may be:\n$$ X \\in \\mathbb{R}^{B \\times L \\times d}. $$For example:\n$$ X \\in \\mathbb{R}^{16 \\times 128 \\times 768}. $$This means:\n16 sentences or text chunks, 128 tokens per sequence, 768 numbers per token embedding. So one token is a vector:\n$$ X_{b,\\ell,:} \\in \\mathbb{R}^{768}. $$One full sentence is a matrix:\n$$ X_{b,:,:} \\in \\mathbb{R}^{128 \\times 768}. $$The full batch is a third-order tensor:\n$$ X \\in \\mathbb{R}^{16 \\times 128 \\times 768}. $$In transformer models, attention scores may have shape:\n$$ A \\in \\mathbb{R}^{B \\times H \\times L \\times L} $$where \\(H\\) is the number of attention heads.\nSo modern AI is full of matrices and tensors.\nWhat is a matrix? A matrix is a rectangular array of numbers.\nIf a matrix has \\(m\\) rows and \\(n\\) columns, we write:\n$$ A \\in \\mathbb{R}^{m \\times n}. $$Stanford CS229 uses this notation: \\(A \\in \\mathbb{R}^{m \\times n}\\) denotes a matrix with \\(m\\) rows and \\(n\\) columns, with real-valued entries. (CS229)\nA general matrix looks like:\n$$ A = \\begin{bmatrix} a_{11} \u0026 a_{12} \u0026 \\cdots \u0026 a_{1n} \\\\ a_{21} \u0026 a_{22} \u0026 \\cdots \u0026 a_{2n} \\\\ \\vdots \u0026 \\vdots \u0026 \\ddots \u0026 \\vdots \\\\ a_{m1} \u0026 a_{m2} \u0026 \\cdots \u0026 a_{mn} \\end{bmatrix}. $$The entry \\(a_{ij}\\) means:\nthe entry in row \\(i\\), column \\(j\\).\nSo:\n$$ a_{23} $$means:\nrow 2, column 3.\nMatrix shape The shape of \\(A\\) is:\n$$ m \\times n. $$The first number counts rows.\nThe second number counts columns.\nFor example:\n$$ A = \\begin{bmatrix} 1 \u0026 2 \u0026 3 \\\\ 4 \u0026 5 \u0026 6 \\end{bmatrix} $$has 2 rows and 3 columns, so:\n$$ A \\in \\mathbb{R}^{2 \\times 3}. $$The entry in row 2, column 3 is:\n$$ a_{23} = 6. $$Rows and columns A matrix can be viewed in two complementary ways.\nIt can be viewed as a stack of rows:\n$$ A = \\begin{bmatrix} a_1^\\top \\\\ a_2^\\top \\\\ \\vdots \\\\ a_m^\\top \\end{bmatrix} $$where each row is a row vector.\nIt can also be viewed as a collection of columns:\n$$ A = \\begin{bmatrix} | \u0026 | \u0026 \u0026 | \\\\ a_1 \u0026 a_2 \u0026 \\cdots \u0026 a_n \\\\ | \u0026 | \u0026 \u0026 | \\end{bmatrix} $$where each column is a column vector.\nThis row-column viewpoint is important because matrix multiplication can be understood using rows, columns, dot products, and linear combinations. Stanford CS229 explicitly describes matrix-vector multiplication both as row inner products and as a linear combination of columns.\nMatrix addition Two matrices can be added only when they have the same shape.\nLet:\n$$ A = \\begin{bmatrix} 1 \u0026 2 \\\\ 3 \u0026 4 \\end{bmatrix} $$and:\n$$ B = \\begin{bmatrix} 10 \u0026 20 \\\\ 30 \u0026 40 \\end{bmatrix}. $$Then:\n$$ A+B = \\begin{bmatrix} 1+10 \u0026 2+20 \\\\ 3+30 \u0026 4+40 \\end{bmatrix} {}={} \\begin{bmatrix} 11 \u0026 22 \\\\ 33 \u0026 44 \\end{bmatrix}. $$In general, if:\n$$ A,B \\in \\mathbb{R}^{m \\times n}, $$then:\n$$ (A+B)_{ij} = A_{ij}+B_{ij}. $$But this is not valid:\n$$ \\mathbb{R}^{2 \\times 3} + \\mathbb{R}^{3 \\times 2}. $$The shapes do not match.\nScalar multiplication of a matrix A matrix can be multiplied by a scalar.\nLet:\n$$ A = \\begin{bmatrix} 1 \u0026 2 \\\\ 3 \u0026 4 \\end{bmatrix} $$and:\n$$ \\alpha = 3. $$Then:\n$$ \\alpha A = 3 \\begin{bmatrix} 1 \u0026 2 \\\\ 3 \u0026 4 \\end{bmatrix} {}={} \\begin{bmatrix} 3 \u0026 6 \\\\ 9 \u0026 12 \\end{bmatrix}. $$In general:\n$$ (\\alpha A)_{ij} = \\alpha A_{ij}. $$So scalar multiplication scales every entry.\nMatrix-vector multiplication Matrix-vector multiplication is one of the most important operations in AI.\nLet:\n$$ A = \\begin{bmatrix} 1 \u0026 2 \u0026 3 \\\\ 4 \u0026 5 \u0026 6 \\end{bmatrix} $$and:\n$$ x = \\begin{bmatrix} 10 \\\\ 20 \\\\ 30 \\end{bmatrix}. $$The shape of \\(A\\) is:\n$$ 2 \\times 3. $$The shape of \\(x\\) is:\n$$ 3 \\times 1. $$The product (Ax) is valid because the inner dimensions match:\n$$ (2 \\times 3)(3 \\times 1) = 2 \\times 1. $$Now compute:\n$$ Ax = \\begin{bmatrix} 1 \u0026 2 \u0026 3 \\\\ 4 \u0026 5 \u0026 6 \\end{bmatrix} \\begin{bmatrix} 10 \\\\ 20 \\\\ 30 \\end{bmatrix}. $$The first entry is:\n$$ 1(10)+2(20)+3(30)=10+40+90=140. $$The second entry is:\n$$ 4(10)+5(20)+6(30)=40+100+180=320. $$So:\n$$ Ax = \\begin{bmatrix} 140 \\\\ 320 \\end{bmatrix}. $$Row-dot-product view Each output entry is a dot product between one row of \\(A\\) and the vector \\(x\\).\nThe first row of \\(A\\) is:\n$$ \\begin{bmatrix} 1 \u0026 2 \u0026 3 \\end{bmatrix}. $$The second row is:\n$$ \\begin{bmatrix} 4 \u0026 5 \u0026 6 \\end{bmatrix}. $$So:\n$$ Ax = \\begin{bmatrix} \\text{row}_1(A)\\cdot x \\\\ \\text{row}_2(A)\\cdot x \\end{bmatrix}. $$In general, if:\n$$ A \\in \\mathbb{R}^{m \\times n} $$and:\n$$ x \\in \\mathbb{R}^{n}, $$then:\n$$ y = Ax \\in \\mathbb{R}^{m} $$with:\n$$ y_i = \\sum_{j=1}^{n} A_{ij}x_j. $$Column-linear-combination view The same product can be viewed another way.\nWrite \\(A\\) in columns:\n$$ A = \\begin{bmatrix} | \u0026 | \u0026 | \\\\ a_1 \u0026 a_2 \u0026 a_3 \\\\ | \u0026 | \u0026 | \\end{bmatrix}. $$Then:\n$$ Ax = x_1a_1+x_2a_2+x_3a_3. $$Using our example:\n$$ a_1 = \\begin{bmatrix} 1 \\\\ 4 \\end{bmatrix}, \\quad a_2 = \\begin{bmatrix} 2 \\\\ 5 \\end{bmatrix}, \\quad a_3 = \\begin{bmatrix} 3 \\\\ 6 \\end{bmatrix}. $$Since:\n$$ x = \\begin{bmatrix} 10 \\\\ 20 \\\\ 30 \\end{bmatrix}, $$we get:\n$$ Ax = 10 \\begin{bmatrix} 1 \\\\ 4 \\end{bmatrix} + 20 \\begin{bmatrix} 2 \\\\ 5 \\end{bmatrix} + 30 \\begin{bmatrix} 3 \\\\ 6 \\end{bmatrix}. $$So:\n$$ Ax = \\begin{bmatrix} 10 \\\\ 40 \\end{bmatrix} + \\begin{bmatrix} 40 \\\\ 100 \\end{bmatrix} + \\begin{bmatrix} 90 \\\\ 180 \\end{bmatrix} {}={} \\begin{bmatrix} 140 \\\\ 320 \\end{bmatrix}. $$Two views of AxThe product (Ax) can be seen as row dot products or as a linear combination of the columns of \\(A\\). Both views are correct, and both are important in machine learning. Matrix-matrix multiplication Now suppose:\n$$ A \\in \\mathbb{R}^{m \\times n} $$and:\n$$ B \\in \\mathbb{R}^{n \\times p}. $$Then:\n$$ C = AB \\in \\mathbb{R}^{m \\times p}. $$Stanford CS229 defines the matrix product \\(C=AB\\) entrywise as:\n$$ C_{ij} = \\sum_{k=1}^{n} A_{ik}B_{kj}, $$with the requirement that the number of columns of \\(A\\) equals the number of rows of \\(B\\).\nNumerical example Let:\n$$ A = \\begin{bmatrix} 1 \u0026 2 \u0026 3 \\\\ 4 \u0026 5 \u0026 6 \\end{bmatrix} $$and:\n$$ B = \\begin{bmatrix} 1 \u0026 0 \\\\ 0 \u0026 1 \\\\ 1 \u0026 1 \\end{bmatrix}. $$Then:\n$$ A \\in \\mathbb{R}^{2 \\times 3}, \\qquad B \\in \\mathbb{R}^{3 \\times 2}. $$So:\n$$ AB \\in \\mathbb{R}^{2 \\times 2}. $$Now compute:\n$$ AB = \\begin{bmatrix} 1 \u0026 2 \u0026 3 \\\\ 4 \u0026 5 \u0026 6 \\end{bmatrix} \\begin{bmatrix} 1 \u0026 0 \\\\ 0 \u0026 1 \\\\ 1 \u0026 1 \\end{bmatrix}. $$The entry in row 1, column 1 is:\n$$ 1(1)+2(0)+3(1)=4. $$The entry in row 1, column 2 is:\n$$ 1(0)+2(1)+3(1)=5. $$The entry in row 2, column 1 is:\n$$ 4(1)+5(0)+6(1)=10. $$The entry in row 2, column 2 is:\n$$ 4(0)+5(1)+6(1)=11. $$Therefore:\n$$ AB = \\begin{bmatrix} 4 \u0026 5 \\\\ 10 \u0026 11 \\end{bmatrix}. $$Matrix multiplication is not usually commutative In ordinary arithmetic:\n$$ 2 \\cdot 3 = 3 \\cdot 2. $$But matrices do not usually behave this way.\nUsually:\n$$ AB \\neq BA. $$Sometimes (BA) is not even defined.\nIn the previous example:\n$$ A \\in \\mathbb{R}^{2 \\times 3} $$and:\n$$ B \\in \\mathbb{R}^{3 \\times 2}. $$So:\n$$ AB \\in \\mathbb{R}^{2 \\times 2} $$but:\n$$ BA \\in \\mathbb{R}^{3 \\times 3}. $$The shapes are different.\nSo (AB) and (BA) cannot be equal.\nOrder mattersMatrix multiplication represents structured operations. Changing the order changes the operation. In neural networks, (XW) and (WX) are usually completely different or one of them is invalid. Matrices as linear maps A matrix is not only a table of numbers.\nA matrix can represent a linear transformation.\nLet:\n$$ A \\in \\mathbb{R}^{m \\times n}. $$Then \\(A\\) defines a function:\n$$ A:\\mathbb{R}^{n} \\to \\mathbb{R}^{m} $$by:\n$$ x \\mapsto Ax. $$This means:\nthe input vector has \\(n\\) components, the output vector has \\(m\\) components. For example, let:\n$$ A = \\begin{bmatrix} 2 \u0026 0 \\\\ 0 \u0026 1 \\end{bmatrix}. $$Then:\n$$ A \\begin{bmatrix} 3 \\\\ 4 \\end{bmatrix} {}={} \\begin{bmatrix} 6 \\\\ 4 \\end{bmatrix}. $$This matrix stretches the first coordinate by a factor of 2 and leaves the second coordinate unchanged.\nSo a matrix can be understood geometrically.\nIt transforms space.\nLinear maps preserve addition and scalar multiplication A function \\(T:\\mathbb{R}^n \\to \\mathbb{R}^m\\) is linear if:\n$$ T(u+v)=T(u)+T(v) $$and:\n$$ T(\\alpha u)=\\alpha T(u). $$Matrix multiplication satisfies these properties.\nIf:\n$$ T(x)=Ax, $$then:\n$$ T(u+v)=A(u+v)=Au+Av=T(u)+T(v) $$and:\n$$ T(\\alpha u)=A(\\alpha u)=\\alpha Au=\\alpha T(u). $$Therefore every matrix multiplication map is linear.\nThis is why matrices are the natural language of linear models.\nThe identity matrix The identity matrix is the matrix that does nothing.\nFor dimension \\(n\\), it is written:\n$$ I_n. $$For example:\n$$ I_3 = \\begin{bmatrix} 1 \u0026 0 \u0026 0 \\\\ 0 \u0026 1 \u0026 0 \\\\ 0 \u0026 0 \u0026 1 \\end{bmatrix}. $$For any vector \\(x \\in \\mathbb{R}^3\\):\n$$ I_3x=x. $$For example:\n$$ I_3 \\begin{bmatrix} 5 \\\\ -2 \\\\ 7 \\end{bmatrix} {}={} \\begin{bmatrix} 5 \\\\ -2 \\\\ 7 \\end{bmatrix}. $$The identity matrix plays the same role as the number 1 in ordinary multiplication.\nDiagonal matrices A diagonal matrix has nonzero entries only on the main diagonal.\nFor example:\n$$ D = \\begin{bmatrix} 2 \u0026 0 \u0026 0 \\\\ 0 \u0026 5 \u0026 0 \\\\ 0 \u0026 0 \u0026 -1 \\end{bmatrix}. $$Multiplying by \\(D\\) scales each coordinate separately.\nIf:\n$$ x = \\begin{bmatrix} x_1 \\\\ x_2 \\\\ x_3 \\end{bmatrix}, $$then:\n$$ Dx = \\begin{bmatrix} 2x_1 \\\\ 5x_2 \\\\ -x_3 \\end{bmatrix}. $$The Deep Learning book notes that diagonal matrices are computationally efficient because multiplying by a diagonal matrix only scales individual elements.\nIn machine learning, diagonal matrices appear in:\nfeature scaling, covariance approximations, normalization, singular value decomposition, optimization preconditioning. Symmetric matrices A square matrix is symmetric if:\n$$ A = A^\\top. $$For example:\n$$ A = \\begin{bmatrix} 2 \u0026 5 \\\\ 5 \u0026 3 \\end{bmatrix} $$is symmetric.\nThe entries mirror across the main diagonal.\nA distance matrix is often symmetric because the distance from object \\(i\\) to object \\(j\\) is usually the same as the distance from object \\(j\\) to object \\(i\\). The Deep Learning book gives this as a typical example of symmetry in matrices.\nThe trace of a matrix For a square matrix:\n$$ A = \\begin{bmatrix} a_{11} \u0026 a_{12} \u0026 \\cdots \u0026 a_{1n} \\\\ a_{21} \u0026 a_{22} \u0026 \\cdots \u0026 a_{2n} \\\\ \\vdots \u0026 \\vdots \u0026 \\ddots \u0026 \\vdots \\\\ a_{n1} \u0026 a_{n2} \u0026 \\cdots \u0026 a_{nn} \\end{bmatrix}, $$the trace is the sum of diagonal entries:\n$$ \\operatorname{tr}(A)=a_{11}+a_{22}+\\cdots+a_{nn}. $$For example:\n$$ A = \\begin{bmatrix} 2 \u0026 1 \\\\ 4 \u0026 5 \\end{bmatrix}. $$Then:\n$$ \\operatorname{tr}(A)=2+5=7. $$Trace appears frequently in matrix calculus, optimization, covariance analysis, and neural-network derivations.\nMatrix rank The rank of a matrix measures how many independent directions it contains.\nConsider:\n$$ A = \\begin{bmatrix} 1 \u0026 2 \\\\ 2 \u0026 4 \\end{bmatrix}. $$The second column is twice the first column:\n$$ \\begin{bmatrix} 2 \\\\ 4 \\end{bmatrix} {}={} 2 \\begin{bmatrix} 1 \\\\ 2 \\end{bmatrix}. $$So the two columns do not provide two independent directions.\nThe rank is:\n$$ \\operatorname{rank}(A)=1. $$Now consider:\n$$ B = \\begin{bmatrix} 1 \u0026 2 \\\\ 3 \u0026 4 \\end{bmatrix}. $$The columns are not scalar multiples of each other, so:\n$$ \\operatorname{rank}(B)=2. $$Stanford’s 2024 CS229 linear algebra review states that column rank is the largest number of linearly independent columns, row rank is the largest number of linearly independent rows, and these are equal, so both are called the rank of the matrix. (CS229)\nColumn space The column space of a matrix is the set of all linear combinations of its columns.\nIf:\n$$ A = \\begin{bmatrix} | \u0026 | \u0026 | \\\\ a_1 \u0026 a_2 \u0026 a_3 \\\\ | \u0026 | \u0026 | \\end{bmatrix}, $$then:\n$$ \\operatorname{Col}(A) {}={} {c_1a_1+c_2a_2+c_3a_3:; c_1,c_2,c_3 \\in \\mathbb{R}}. $$The column space tells us which outputs are reachable by (Ax).\nIf:\n$$ y = Ax, $$then \\(y\\) must lie in the column space of \\(A\\).\nThis is central in solving linear systems.\nNull space The null space of \\(A\\) is the set of all vectors that are sent to zero:\n$$ \\operatorname{Null}(A) {}={} {x:; Ax=0}. $$For example:\n$$ A = \\begin{bmatrix} 1 \u0026 2 \\\\ 2 \u0026 4 \\end{bmatrix}. $$We want:\n$$ Ax=0. $$Let:\n$$ x = \\begin{bmatrix} x_1 \\\\ x_2 \\end{bmatrix}. $$Then:\n$$ \\begin{bmatrix} 1 \u0026 2 \\\\ 2 \u0026 4 \\end{bmatrix} \\begin{bmatrix} x_1 \\\\ x_2 \\end{bmatrix} {}={} \\begin{bmatrix} x_1+2x_2 \\\\ 2x_1+4x_2 \\end{bmatrix} {}={} \\begin{bmatrix} 0 \\\\ 0 \\end{bmatrix}. $$The second equation is just twice the first.\nSo we need:\n$$ x_1+2x_2=0. $$Thus:\n$$ x_1=-2x_2. $$Let:\n$$ x_2=t. $$Then:\n$$ x = \\begin{bmatrix} -2t \\\\ t \\end{bmatrix} {}={} t \\begin{bmatrix} -2 \\\\ 1 \\end{bmatrix}. $$So:\n$$ \\operatorname{Null}(A) {}={} \\operatorname{span} \\left{ \\begin{bmatrix} -2 \\\\ 1 \\end{bmatrix} \\right}. $$The null space tells us which inputs are invisible to the transformation.\nMatrix norms A norm measures size.\nFor a vector, the Euclidean norm is:\n$$ |x|_2 = \\sqrt{x_1^2+x_2^2+\\cdots+x_n^2}. $$For a matrix, a common norm is the Frobenius norm:\n$$ \\|A\\|_F = \\sqrt{\\sum_{i,j} A_{ij}^2}. $$For example:\n$$ A = \\begin{bmatrix} 1 \u0026 2 \\\\ 3 \u0026 4 \\end{bmatrix}. $$Then:\n$$ |A|_F = \\sqrt{1^2+2^2+3^2+4^2} {}={} \\sqrt{30}. $$The Deep Learning book notes that the Frobenius norm is commonly used in deep learning and is analogous to the vector \\(L^2\\) norm.\nMatrix norms are used in:\nregularization, stability analysis, optimization, low-rank approximation, neural-network weight control. Eigenvalues and eigenvectors For a square matrix \\(A\\), an eigenvector is a nonzero vector \\(v\\) such that:\n$$ Av = \\lambda v. $$Here:\n\\(v\\) is the eigenvector, \\(\\lambda\\) is the eigenvalue. This equation says:\napplying \\(A\\) to \\(v\\) only scales \\(v\\); it does not change its direction.\nFor example:\n$$ A = \\begin{bmatrix} 2 \u0026 0 \\\\ 0 \u0026 3 \\end{bmatrix}. $$Then:\n$$ A \\begin{bmatrix} 1 \\\\ 0 \\end{bmatrix} {}={} \\begin{bmatrix} 2 \\\\ 0 \\end{bmatrix} {}={} 2 \\begin{bmatrix} 1 \\\\ 0 \\end{bmatrix}. $$So:\n$$ v_1 = \\begin{bmatrix} 1 \\\\ 0 \\end{bmatrix} $$is an eigenvector with eigenvalue:\n$$ \\lambda_1 = 2. $$Similarly:\n$$ A \\begin{bmatrix} 0 \\\\ 1 \\end{bmatrix} {}={} \\begin{bmatrix} 0 \\\\ 3 \\end{bmatrix} {}={} 3 \\begin{bmatrix} 0 \\\\ 1 \\end{bmatrix}. $$So:\n$$ v_2 = \\begin{bmatrix} 0 \\\\ 1 \\end{bmatrix} $$is an eigenvector with eigenvalue:\n$$ \\lambda_2 = 3. $$The Deep Learning book explains eigendecomposition as a way to understand matrices by decomposing them into eigenvectors and eigenvalues.\nSingular value decomposition The singular value decomposition, or SVD, is one of the most important matrix factorizations.\nFor a matrix:\n$$ A \\in \\mathbb{R}^{m \\times n}, $$the SVD writes:\n$$ A = U\\Sigma V^\\top. $$Here:\n\\(U\\) contains left singular vectors, \\(\\Sigma\\) contains singular values, \\(V\\) contains right singular vectors. The Deep Learning book describes SVD as a factorization of a matrix into singular vectors and singular values, with \\(U\\) and \\(V\\) orthogonal and \\(\\Sigma\\) diagonal.\nSVD is used in:\ndimensionality reduction, principal component analysis, low-rank approximation, recommendation systems, image compression, numerical stability, understanding neural-network weight matrices. AI intuitionSVD tells us which directions in the input space are amplified most strongly by a matrix. In machine learning, this helps us understand compression, rank, conditioning, and feature directions. Matrices in a neural network layer A fully connected neural-network layer is a matrix operation.\nSuppose:\n$$ x \\in \\mathbb{R}^{d} $$is one input vector.\nLet:\n$$ W \\in \\mathbb{R}^{d \\times h} $$be the weight matrix.\nLet:\n$$ b \\in \\mathbb{R}^{h} $$be the bias vector.\nThen the layer computes:\n$$ z = x^\\top W + b. $$If \\(x\\) is written as a row vector:\n$$ x^\\top \\in \\mathbb{R}^{1 \\times d}, $$then:\n$$ (1 \\times d)(d \\times h)=1 \\times h. $$So:\n$$ z \\in \\mathbb{R}^{h}. $$For a batch:\n$$ X \\in \\mathbb{R}^{n \\times d}, $$the layer computes:\n$$ Z = XW+b. $$The shape is:\n$$ (n \\times d)(d \\times h)=n \\times h. $$So:\n$$ Z \\in \\mathbb{R}^{n \\times h}. $$This is the matrix form of many neurons computed at once.\nFrom matrices to tensors A matrix has two axes.\nA tensor may have more than two axes.\nExamples:\nObject Shape Mathematical type scalar \\(()\\) order-0 tensor vector \\((d)\\) order-1 tensor matrix \\((m,n)\\) order-2 tensor grayscale image \\((H,W)\\) order-2 tensor color image \\((H,W,3)\\) order-3 tensor batch of color images \\((N,H,W,3)\\) order-4 tensor video \\((T,H,W,3)\\) order-4 tensor batch of videos \\((N,T,H,W,3)\\) order-5 tensor Kolda and Bader state that a first-order tensor is a vector, a second-order tensor is a matrix, and tensors of order three or higher are called higher-order tensors.\nTensor order, modes, and shape The order of a tensor is the number of axes.\nThe axes are often called modes.\nFor example:\n$$ X \\in \\mathbb{R}^{I \\times J \\times K} $$is a third-order tensor.\nIt has three modes:\nmode 1 has size \\(I\\), mode 2 has size \\(J\\), mode 3 has size \\(K\\). An entry is written:\n$$ x_{ijk}. $$This means:\nindex \\(i\\) along mode 1, index \\(j\\) along mode 2, index \\(k\\) along mode 3.\nFor an \\(N\\)-th order tensor:\n$$ \\mathcal{X} \\in \\mathbb{R}^{I_1 \\times I_2 \\times \\cdots \\times I_N}. $$An entry is:\n$$ x_{i_1 i_2 \\cdots i_N}. $$The total number of entries is:\n$$ I_1 I_2 \\cdots I_N. $$A small numerical tensor Let:\n$$ \\mathcal{X} \\in \\mathbb{R}^{2 \\times 3 \\times 2}. $$This tensor has:\n2 entries along mode 1, 3 entries along mode 2, 2 entries along mode 3. We can display it using two frontal slices.\nFirst slice:\n$$ X_{:,:,1} {}={} \\begin{bmatrix} 1 \u0026 2 \u0026 3 \\\\ 4 \u0026 5 \u0026 6 \\end{bmatrix}. $$Second slice:\n$$ X_{:,:,2} {}={} \\begin{bmatrix} 7 \u0026 8 \u0026 9 \\\\ 10 \u0026 11 \u0026 12 \\end{bmatrix}. $$Then:\n$$ x_{1,1,1}=1, \\qquad x_{2,3,1}=6, \\qquad x_{1,2,2}=8, \\qquad x_{2,3,2}=12. $$This tensor has:\n$$ 2 \\cdot 3 \\cdot 2 = 12 $$entries.\nTensor slices A slice is obtained by fixing one index.\nFor:\n$$ \\mathcal{X} \\in \\mathbb{R}^{2 \\times 3 \\times 2}, $$the slice:\n$$ X_{:,:,1} $$fixes the third index at 1.\nThe slice:\n$$ X_{:,:,2} $$fixes the third index at 2.\nEach slice is a matrix.\nFor a color image:\n$$ X \\in \\mathbb{R}^{H \\times W \\times 3}, $$the red channel is:\n$$ X_{:,:,1}. $$The green channel is:\n$$ X_{:,:,2}. $$The blue channel is:\n$$ X_{:,:,3}. $$So a color image can be seen as three matrices stacked together.\nTensor fibers A fiber is obtained by fixing all indices except one.\nFor a third-order tensor:\n$$ \\mathcal{X} \\in \\mathbb{R}^{I \\times J \\times K}, $$examples of fibers are:\n$$ \\mathcal{X}_{:,j,k}, \\qquad \\mathcal{X}_{i,:,k}, \\qquad \\mathcal{X}_{i,j,:}. $$Each fiber is a vector.\nIn a color image:\n$$ X_{i,j,:} $$is the RGB vector at pixel \\((i,j)\\).\nSo the pixel itself is a vector.\nTensor unfolding or matricization Sometimes we convert a tensor into a matrix.\nThis is called:\nunfolding, flattening, matricization. Kolda and Bader describe matricization as the process of reordering the elements of an \\(N\\)-way array into a matrix and note that the exact ordering convention may vary, as long as it is used consistently.\nFor:\n$$ \\mathcal{X} \\in \\mathbb{R}^{2 \\times 3 \\times 2}, $$one possible mode-1 unfolding is:\n$$ X_{(1)} {}={} \\begin{bmatrix} 1 \u0026 2 \u0026 3 \u0026 7 \u0026 8 \u0026 9 \\\\ 4 \u0026 5 \u0026 6 \u0026 10 \u0026 11 \u0026 12 \\end{bmatrix}. $$This gives:\n$$ X_{(1)} \\in \\mathbb{R}^{2 \\times 6}. $$One possible mode-2 unfolding is:\n$$ X_{(2)} {}={} \\begin{bmatrix} 1 \u0026 4 \u0026 7 \u0026 10 \\\\ 2 \u0026 5 \u0026 8 \u0026 11 \\\\ 3 \u0026 6 \u0026 9 \u0026 12 \\end{bmatrix}. $$This gives:\n$$ X_{(2)} \\in \\mathbb{R}^{3 \\times 4}. $$One possible mode-3 unfolding is:\n$$ X_{(3)} {}={} \\begin{bmatrix} 1 \u0026 2 \u0026 3 \u0026 4 \u0026 5 \u0026 6 \\\\ 7 \u0026 8 \u0026 9 \u0026 10 \u0026 11 \u0026 12 \\end{bmatrix}. $$This gives:\n$$ X_{(3)} \\in \\mathbb{R}^{2 \\times 6}. $$Unfolding conventionDifferent books and libraries may unfold tensors in different orders. The important rule is consistency. Once you choose an unfolding convention, use it consistently across all related calculations. Tensor addition and scalar multiplication Two tensors can be added if they have the same shape.\nIf:\n$$ \\mathcal{X},\\mathcal{Y} \\in \\mathbb{R}^{I_1 \\times I_2 \\times \\cdots \\times I_N}, $$then:\n$$ (\\mathcal{X}+\\mathcal{Y})_{i_1i_2\\cdots i_N} {}={} x_{i_1i_2\\cdots i_N} + y_{i_1i_2\\cdots i_N}. $$A tensor can also be multiplied by a scalar:\n$$ (\\alpha \\mathcal{X})_{i_1i_2\\cdots i_N} {}={} \\alpha x_{i_1i_2\\cdots i_N}. $$So tensors form vector spaces when their shape is fixed.\nFor example, all tensors in:\n$$ \\mathbb{R}^{2 \\times 3 \\times 2} $$can be added and scaled.\nTensor inner product If two tensors have the same shape, their inner product is the sum of elementwise products:\n$$ \\langle \\mathcal{X},\\mathcal{Y} \\rangle {}={} \\sum_{i_1=1}^{I_1} \\sum_{i_2=1}^{I_2} \\cdots \\sum_{i_N=1}^{I_N} x_{i_1i_2\\cdots i_N} y_{i_1i_2\\cdots i_N}. $$For matrices, this becomes:\n$$ \\langle A,B\\rangle {}={} \\sum_{i,j} A_{ij}B_{ij}. $$The Frobenius norm of a tensor is:\n$$ |\\mathcal{X}|_F {}={} \\sqrt{ \\sum_{i_1=1}^{I_1} \\sum_{i_2=1}^{I_2} \\cdots \\sum_{i_N=1}^{I_N} x_{i_1i_2\\cdots i_N}^2 }. $$This generalizes the Euclidean norm of vectors and the Frobenius norm of matrices.\nOuter product The outer product combines vectors to create higher-order objects.\nLet:\n$$ a = \\begin{bmatrix} 1 \\\\ 2 \\end{bmatrix} $$and:\n$$ b = \\begin{bmatrix} 3 \\\\ 4 \\\\ 5 \\end{bmatrix}. $$The outer product is:\n$$ a \\circ b = \\begin{bmatrix} 1 \\cdot 3 \u0026 1 \\cdot 4 \u0026 1 \\cdot 5 \\\\ 2 \\cdot 3 \u0026 2 \\cdot 4 \u0026 2 \\cdot 5 \\end{bmatrix} {}={} \\begin{bmatrix} 3 \u0026 4 \u0026 5 \\\\ 6 \u0026 8 \u0026 10 \\end{bmatrix}. $$This creates a matrix from two vectors.\nNow add a third vector:\n$$ c = \\begin{bmatrix} 10 \\\\ 20 \\end{bmatrix}. $$Then:\n$$ \\mathcal{X}=a\\circ b\\circ c $$is a third-order tensor with entries:\n$$ x_{ijk}=a_i b_j c_k. $$For example:\n$$ x_{1,2,1}=a_1b_2c_1=1(4)(10)=40. $$And:\n$$ x_{2,3,2}=a_2b_3c_2=2(5)(20)=200. $$A tensor formed from the outer product of vectors is called a rank-one tensor.\nTensor rank and CP decomposition For matrices, rank is the minimum number of rank-one matrices needed to add up to the matrix.\nFor higher-order tensors, one important idea is similar:\n$$ \\mathcal{X} \\approx \\sum_{r=1}^{R} a_r^{(1)} \\circ a_r^{(2)} \\circ \\cdots \\circ a_r^{(N)}. $$This is the idea behind CP decomposition, also known as CANDECOMP/PARAFAC.\nKolda and Bader describe CP as decomposing a tensor as a sum of rank-one tensors.\nFor a third-order tensor:\n$$ \\mathcal{X} \\in \\mathbb{R}^{I \\times J \\times K}, $$a CP decomposition has the form:\n$$ \\mathcal{X} \\approx \\sum_{r=1}^{R} a_r \\circ b_r \\circ c_r. $$Elementwise:\n$$ x_{ijk} \\approx \\sum_{r=1}^{R} a_{ir} b_{jr} c_{kr}. $$This is useful when a large tensor can be approximated by a small number of structured components.\nThe \\(n\\)-mode product A central tensor operation is the \\(n\\)-mode product.\nLet:\n$$ \\mathcal{X} \\in \\mathbb{R}^{I_1 \\times I_2 \\times \\cdots \\times I_N}. $$Let:\n$$ U \\in \\mathbb{R}^{J \\times I_n}. $$Then the \\(n\\)-mode product is written:\n$$ \\mathcal{Y} {}={} \\mathcal{X} \\times_n U. $$The output shape is:\n$$ \\mathcal{Y} \\in \\mathbb{R}^{I_1 \\times \\cdots \\times I_{n-1} \\times J \\times I_{n+1} \\times \\cdots \\times I_N}. $$So the size of mode \\(n\\) changes from \\(I_n\\) to \\(J\\).\nKolda and Bader define the \\(n\\)-mode matrix product of a tensor with a matrix and give its elementwise formula.\nElementwise:\n$$ (\\mathcal{X}\\times_n U)_{i_1\\cdots i_{n-1}j i_{n+1}\\cdots i_N} {}={} \\sum_{i_n=1}^{I_n} x_{i_1 i_2 \\cdots i_N} u_{j i_n}. $$This means:\nmultiply each mode-\\(n\\) fiber by the matrix \\(U\\).\nIn unfolded form:\n$$ Y_{(n)} = U X_{(n)}. $$This connects tensor multiplication back to matrix multiplication.\nTensor contraction Tensor contraction means summing over one or more shared indices.\nThe dot product is the simplest contraction:\n$$ x^\\top y {}={} \\sum_{i=1}^{d} x_i y_i. $$Matrix multiplication is also a contraction:\n$$ C_{ij} {}={} \\sum_{k=1}^{n} A_{ik}B_{kj}. $$The index \\(k\\) is repeated and summed over.\nA tensor contraction generalizes this idea.\nFor example, suppose:\n$$ \\mathcal{X} \\in \\mathbb{R}^{I \\times J \\times K} $$and:\n$$ v \\in \\mathbb{R}^{K}. $$Then contracting over the third mode gives a matrix:\n$$ Y_{ij} {}={} \\sum_{k=1}^{K} x_{ijk}v_k. $$So:\n$$ Y \\in \\mathbb{R}^{I \\times J}. $$This is exactly what happens when we collapse one tensor dimension by taking weighted sums.\nEinstein summation notation Because tensor formulas can become long, mathematicians and machine-learning libraries often use Einstein summation notation.\nInstead of writing:\n$$ C_{ij} {}={} \\sum_{k=1}^{n} A_{ik}B_{kj}, $$we write:\n$$ C_{ij}=A_{ik}B_{kj}. $$The repeated index \\(k\\) is automatically summed over.\nSimilarly:\n$$ y_i=A_{ij}x_j $$means:\n$$ y_i=\\sum_j A_{ij}x_j. $$In NumPy, PyTorch, and JAX, this idea appears in functions such as einsum.\nFor example:\nimport torch A = torch.randn(2, 3) B = torch.randn(3, 4) C = torch.einsum(\u0026#34;ik,kj-\u0026gt;ij\u0026#34;, A, B) print(C.shape) Output:\ntorch.Size([2, 4]) The notation:\n\u0026#34;ik,kj-\u0026gt;ij\u0026#34; means:\n\\(A\\) has indices \\(i,k\\), \\(B\\) has indices \\(k,j\\), \\(k\\) is summed, output has indices \\(i,j\\). Tucker decomposition Another major tensor decomposition is the Tucker decomposition.\nFor a third-order tensor:\n$$ \\mathcal{X} \\in \\mathbb{R}^{I \\times J \\times K}, $$the Tucker decomposition writes:\n$$ \\mathcal{X} \\approx \\mathcal{G} \\times_1 A \\times_2 B \\times_3 C. $$Here:\n\\(\\mathcal{G}\\) is the core tensor, \\(A\\), \\(B\\), and \\(C\\) are factor matrices, each factor matrix acts along one tensor mode. Kolda and Bader describe Tucker decomposition as a form of higher-order PCA that decomposes a tensor into a core tensor multiplied or transformed by a matrix along each mode.\nElementwise:\n$$ x_{ijk} \\approx \\sum_{p=1}^{P} \\sum_{q=1}^{Q} \\sum_{r=1}^{R} g_{pqr}a_{ip}b_{jq}c_{kr}. $$For an \\(N\\)-way tensor:\n$$ \\mathcal{X} {}={} \\mathcal{G} \\times_1 A^{(1)} \\times_2 A^{(2)} \\cdots \\times_N A^{(N)}. $$The Tucker decomposition is useful for:\ncompression, denoising, feature extraction, dimensionality reduction, multilinear PCA, multiway data analysis. Formal mathematical view: tensor products So far, we have used the machine-learning view:\na tensor is a multidimensional array.\nNow we move to the more mathematical view.\nLet \\(V\\) and \\(W\\) be vector spaces.\nThe tensor product \\(V \\otimes W\\) is a vector space built from formal products:\n$$ v \\otimes w, \\qquad v \\in V,; w \\in W. $$These formal products satisfy bilinearity:\n$$ (v_1+v_2)\\otimes w {}={} v_1\\otimes w+v_2\\otimes w, $$$$ v\\otimes(w_1+w_2) {}={} v\\otimes w_1+v\\otimes w_2, $$$$ (\\alpha v)\\otimes w {}={} \\alpha(v\\otimes w) v\\otimes(\\alpha w). $$Lerman’s multilinear algebra notes describe two standard ways to think about tensors: as multilinear maps and as elements of tensor products of two or more vector spaces.\nThe tensor product has a universal property: bilinear maps out of \\(V \\times W\\) correspond to linear maps out of \\(V \\otimes W\\). This is the formal reason tensor products convert bilinear structure into linear structure.\nBasis of a tensor product space Let:\n$$ V = \\mathbb{R}^{m} $$with basis:\n$$ e_1,\\dots,e_m. $$Let:\n$$ W = \\mathbb{R}^{n} $$with basis:\n$$ f_1,\\dots,f_n. $$Then a basis for:\n$$ V \\otimes W $$is:\n$$ {e_i \\otimes f_j:; 1\\le i\\le m,;1\\le j\\le n}. $$So:\n$$ \\dim(V\\otimes W)=mn. $$This matches the number of entries in an \\(m \\times n\\) matrix.\nThat is why a matrix can be viewed as coordinates of an element in a tensor product space.\nMore generally, if:\n$$ V_1,\\dots,V_N $$are finite-dimensional vector spaces, then:\n$$ V_1 \\otimes V_2 \\otimes \\cdots \\otimes V_N $$is the natural space for order-\\(N\\) tensors.\nIf:\n$$ \\dim(V_k)=I_k, $$then:\n$$ \\dim(V_1 \\otimes \\cdots \\otimes V_N) {}={} I_1I_2\\cdots I_N. $$This equals the number of entries in an array of shape:\n$$ I_1 \\times I_2 \\times \\cdots \\times I_N. $$Tensors as multilinear maps A multilinear map is a function that is linear in each argument separately.\nFor example, a bilinear map satisfies:\n$$ B(\\alpha u_1+\\beta u_2,v) {}={} \\alpha B(u_1,v)+\\beta B(u_2,v) $$and:\n$$ B(u,\\alpha v_1+\\beta v_2) {}={} \\alpha B(u,v_1)+\\beta B(u,v_2). $$The dot product is bilinear:\n$$ \\langle u,v\\rangle {}={} u^\\top v. $$The determinant is multilinear in the columns of a matrix.\nLerman’s notes give the determinant as an example of an \\(n\\)-linear map when a matrix is viewed as a tuple of column vectors.\nIn this deeper mathematical view, tensors are not just arrays.\nThey encode multilinear relationships.\nWhy matrices are second-order tensors A vector has one index:\n$$ x_i. $$A matrix has two indices:\n$$ A_{ij}. $$A third-order tensor has three indices:\n$$ x_{ijk}. $$An \\(N\\)-th order tensor has \\(N\\) indices:\n$$ x_{i_1i_2\\cdots i_N}. $$So a matrix is naturally a second-order tensor because it requires two indices to identify one entry.\nBut in linear algebra, matrices receive special attention because they represent linear maps:\n$$ A:\\mathbb{R}^{n}\\to\\mathbb{R}^{m}. $$A general higher-order tensor does not automatically represent a simple linear map from one vector space to another.\nInstead, it often represents a multilinear object.\nImportant distinction: array tensor versus geometric tensor In machine learning, we often say:\na tensor is a multidimensional array.\nThis is practical and correct for numerical computing.\nBut in geometry and physics, tensors must obey transformation rules under changes of coordinates.\nKolda and Bader explicitly warn that the tensor-array notion in tensor decomposition literature should not be confused with tensor fields in physics and engineering.\nFor this course, we mainly use tensors in the AI sense:\n$$ \\text{tensor} = \\text{structured multidimensional numerical array}. $$But we also keep the deeper mathematical meaning in mind:\n$$ \\text{tensor} = \\text{element of a tensor product space or multilinear map}. $$Both views matter.\nTensors in convolutional neural networks A convolutional neural network uses tensors everywhere.\nSuppose the input batch is:\n$$ X \\in \\mathbb{R}^{N \\times C_{\\text{in}} \\times H \\times W}. $$Here:\n\\(N\\) is batch size, \\(C_{\\text{in}}\\) is number of input channels, \\(H\\) is height, \\(W\\) is width. A convolution kernel may have shape:\n$$ K \\in \\mathbb{R}^{C_{\\text{out}} \\times C_{\\text{in}} \\times k_H \\times k_W}. $$Here:\n\\(C_{\\text{out}}\\) is number of output channels, \\(C_{\\text{in}}\\) is number of input channels, \\(k_H\\) is kernel height, \\(k_W\\) is kernel width. The output is:\n$$ Y \\in \\mathbb{R}^{N \\times C_{\\text{out}} \\times H' \\times W'}. $$A simplified convolution formula is:\n$$ Y_{n,c_{\\text{out}},i,j} {}={} \\sum_{c_{\\text{in}}} \\sum_{u} \\sum_{v} K_{c_{\\text{out}},c_{\\text{in}},u,v} X_{n,c_{\\text{in}},i+u,j+v}. $$This is a tensor contraction.\nWe multiply kernel entries with input entries and sum over:\ninput channels, kernel height, kernel width. So convolution is not mysterious.\nIt is a structured tensor operation.\nTensors in transformer models Transformers also use tensors heavily.\nSuppose:\n$$ X \\in \\mathbb{R}^{B \\times L \\times d}. $$This is a batch of token embeddings.\nLinear projections produce:\n$$ Q = XW_Q, \\qquad K = XW_K, \\qquad V = XW_V. $$For multi-head attention, these are reshaped into:\n$$ Q,K,V \\in \\mathbb{R}^{B \\times H \\times L \\times d_h}. $$The attention score tensor has shape:\n$$ S \\in \\mathbb{R}^{B \\times H \\times L \\times L}. $$One formula is:\n$$ S_{b,h,i,j} {}={} \\sum_{r=1}^{d_h} Q_{b,h,i,r}K_{b,h,j,r}. $$This means:\nfor each batch item and each attention head, compare token \\(i\\) with token \\(j\\) by taking a dot product over the hidden dimension.\nAgain, this is a tensor contraction.\nThe repeated index \\(r\\) is summed.\nThe hierarchy of objects We can now organize the objects clearly.\nObject Notation Example shape Number of indices scalar \\(x\\) \\(()\\) 0 vector \\(x_i\\) \\((d)\\) 1 matrix \\(A_{ij}\\) \\((m,n)\\) 2 third-order tensor \\(x_{ijk}\\) \\((I,J,K)\\) 3 \\(N\\)-th order tensor \\(x_{i_1\\cdots i_N}\\) \\((I_1,\\dots,I_N)\\) \\(N\\) The key idea is:\neach additional axis adds one more index.\nWhy tensors matter in AI Tensors matter because AI data is naturally multi-axis.\nImages have:\n$$ \\text{height} \\times \\text{width} \\times \\text{channels}. $$Videos have:\n$$ \\text{time} \\times \\text{height} \\times \\text{width} \\times \\text{channels}. $$Text batches have:\n$$ \\text{batch} \\times \\text{sequence} \\times \\text{embedding}. $$Attention has:\n$$ \\text{batch} \\times \\text{heads} \\times \\text{query positions} \\times \\text{key positions}. $$Convolution kernels have:\n$$ \\text{output channels} \\times \\text{input channels} \\times \\text{kernel height} \\times \\text{kernel width}. $$So tensors are not optional in deep learning.\nThey are the native format of modern AI computation.\nCommon mistakes Mistake 1: Calling every array a matrix A matrix has exactly two axes.\nA color image:\n$$ X \\in \\mathbb{R}^{H \\times W \\times 3} $$is not a matrix.\nIt is a third-order tensor.\nMistake 2: Ignoring axis meaning The shape:\n$$ (32,224,224,3) $$means something different from:\n$$ (32,3,224,224). $$The numbers are the same, but the axis meanings are different.\nMistake 3: Confusing matrix multiplication with elementwise multiplication Matrix multiplication:\n$$ C_{ij}=\\sum_k A_{ik}B_{kj} $$is not the same as elementwise multiplication:\n$$ C_{ij}=A_{ij}B_{ij}. $$In Python, these are usually different operations.\nFor example, in NumPy:\nA @ B # matrix multiplication A * B # elementwise multiplication Mistake 4: Forgetting batch dimensions A single image may have shape:\n(224, 224, 3) A batch of images may have shape:\n(32, 224, 224, 3) The extra dimension is the batch dimension.\nMany deep learning errors come from forgetting whether the batch axis is present.\nMistake 5: Thinking tensors are always deep and abstract In machine learning, a tensor can be very practical.\nA tensor may simply be:\na batch of images, a batch of embeddings, a stack of time-series measurements, a collection of attention scores. The abstract mathematics becomes useful because it gives us a precise language for these objects.\nSection summary A matrix is a two-dimensional array:\n$$ A \\in \\mathbb{R}^{m \\times n}. $$Its entries are:\n$$ A_{ij}. $$A matrix can represent:\na data table, a grayscale image, a system of equations, a linear transformation, a neural-network weight layer. Matrix-vector multiplication is:\n$$ y=Ax, $$with:\n$$ y_i=\\sum_j A_{ij}x_j. $$Matrix-matrix multiplication is:\n$$ C=AB, $$with:\n$$ C_{ij}=\\sum_k A_{ik}B_{kj}. $$A tensor generalizes vectors and matrices to more axes:\n$$ \\mathcal{X} \\in \\mathbb{R}^{I_1 \\times I_2 \\times \\cdots \\times I_N}. $$Its entries are:\n$$ x_{i_1i_2\\cdots i_N}. $$A vector is a first-order tensor.\nA matrix is a second-order tensor.\nA color image is often a third-order tensor.\nA batch of color images is often a fourth-order tensor.\nA tensor can be sliced, unfolded, contracted, multiplied along modes, and decomposed.\nThe \\(n\\)-mode product is:\n$$ \\mathcal{Y} {}={} \\mathcal{X} \\times_n U. $$The CP decomposition writes a tensor approximately as a sum of rank-one tensors:\n$$ \\mathcal{X} \\approx \\sum_{r=1}^{R} a_r^{(1)} \\circ a_r^{(2)} \\circ \\cdots \\circ a_r^{(N)}. $$The Tucker decomposition writes:\n$$ \\mathcal{X} \\approx \\mathcal{G} \\times_1 A^{(1)} \\times_2 A^{(2)} \\cdots \\times_N A^{(N)}. $$At the deeper mathematical level, tensors can be understood as elements of tensor product spaces or as multilinear maps.\nSo the conceptual path is:\n$$ \\text{numbers} \\to \\text{vectors} \\to \\text{matrices} \\to \\text{tensors} \\to \\text{multilinear computation}. $$And this is exactly the path modern AI follows.\nSource anchors used for this section MIT OCW 18.06 identifies linear algebra as matrix theory with applications to systems of equations, vector spaces, determinants, eigenvalues, similarity, and positive definite matrices. (MIT OpenCourseWare) Stanford CS229 linear algebra notes define matrix notation, row/column notation, and matrix multiplication using the entrywise summation formula. (CS229) Boyd \u0026amp; Vandenberghe’s Introduction to Applied Linear Algebra explicitly connects vectors, matrices, least squares, data fitting, machine learning, AI, image processing, and other applied areas. Goodfellow, Bengio, and Courville’s Deep Learning discusses Frobenius norms, diagonal matrices, symmetric matrices, orthogonal matrices, eigendecomposition, and SVD in a deep-learning linear algebra chapter. Kolda \u0026amp; Bader’s SIAM Review paper defines tensors as multidimensional or \\(N\\)-way arrays, distinguishes first-order vectors and second-order matrices, defines tensor unfolding/matricization, \\(n\\)-mode products, CP decomposition, and Tucker decomposition. Multilinear algebra notes by Lerman and LMU notes give the formal tensor-product and multilinear-map view, including the universal property of tensor products. ","date":"2026-04-28","description":"A deep introduction to matrices and tensors for AI mathematics: data matrices, images as matrices, color images and batches as tensors, matrix operations, linear maps, rank, norms, eigenvalues, SVD, tensor order, modes, unfolding, n-mode products, tensor contractions, CP decomposition, Tucker decomposition, and the formal tensor-product view.","featured":false,"featured_image":"mathematicsforai/mathsforaisection02.png","permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-01/section-02/","popular":false,"readingTime":32,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-01/section-02/","section":"courses","series":"","summary":"We extend vectors into matrices and tensors, starting from real-world data tables, images, videos, and neural-network batches, then building toward formal matrix algebra, tensor notation, tensor products, tensor unfolding, contractions, and multilinear maps.","tags":["linear algebra","matrices","tensors","multilinear algebra","machine learning","deep learning","data representation"],"title":"1.2 From Matrices to Tensors: Tables, Images, Batches, and Multilinear Structure","type":"courses"},{"categories":["Course"],"content":"Why probability comes after tensors In the previous section, we learned that AI data is often stored as vectors, matrices, and tensors.\nA single house may be a vector:\n$$ x \\in \\mathbb{R}^d. $$A dataset may be a matrix:\n$$ X \\in \\mathbb{R}^{n \\times d}. $$An image may be a tensor:\n$$ X \\in \\mathbb{R}^{H \\times W \\times C}. $$A video may be a higher-order tensor:\n$$ X \\in \\mathbb{R}^{T \\times H \\times W \\times C}. $$But tensors alone are not enough.\nAI does not only need to store data. It must also reason about uncertainty.\nFor example:\na sensor measurement may be noisy, a house price may be uncertain, a pixel may be corrupted, a video frame may contain motion blur, a neural network may output probabilities, a generative model may sample many possible images from the same text prompt. This means we need probability.\nMIT’s probability and random variables course covers distribution functions, common distributions, conditional probability, Bayes’ theorem, joint distributions, the law of large numbers, and the central limit theorem — exactly the probability foundation used throughout machine learning. (MIT OpenCourseWare)\nIn modern generative AI, probability becomes even more central.\nA diffusion model begins with data, adds noise, and then learns how to remove noise.\nSo the path is:\n$$ \\text{data} \\longrightarrow \\text{random variables} \\longrightarrow \\text{distributions} \\longrightarrow \\text{noise} \\longrightarrow \\text{stochastic processes} \\longrightarrow \\text{diffusion models}. $$Main ideaProbability is the mathematics of uncertainty. Diffusion generative models use probability to describe how clean data becomes noisy and how noisy data can be transformed back into realistic samples. A real-world example: measuring a house Suppose the true area of a house is:\n$$ x = 1200 $$square feet.\nBut a measurement device may not return exactly 1200. It may return:\n$$ 1198,\\quad 1201,\\quad 1203,\\quad 1197. $$The measured value is not fixed. It changes because of measurement noise.\nWe may write:\n$$ Y = x + \\varepsilon $$where:\n\\(x\\) is the true value, \\(\\varepsilon\\) is random noise, \\(Y\\) is the observed noisy measurement. If the noise is small, the measurement is close to the truth.\nIf the noise is large, the measurement may be unreliable.\nThis simple equation:\n$$ Y = x + \\varepsilon $$is one of the most important ideas in statistics, machine learning, signal processing, and diffusion models.\nA real-world example: noisy images A clean grayscale image can be represented as a matrix:\n$$ X \\in \\mathbb{R}^{H \\times W}. $$A noisy image may be written as:\n$$ Y = X + E $$where:\n\\(X\\) is the clean image, \\(E\\) is a noise matrix, \\(Y\\) is the corrupted image. Entry by entry:\n$$ Y_{ij}=X_{ij}+E_{ij}. $$If each noise value \\(E_{ij}\\) is random, then the whole image is random.\nFor a color image:\n$$ X \\in \\mathbb{R}^{H \\times W \\times C} $$and:\n$$ Y_{i,j,c}=X_{i,j,c}+E_{i,j,c}. $$For a video:\n$$ X \\in \\mathbb{R}^{T \\times H \\times W \\times C} $$and:\n$$ Y_{t,i,j,c}=X_{t,i,j,c}+E_{t,i,j,c}. $$So noise is not only one random number. It may be a random vector, random matrix, random image, random video, or random field.\nRandom experiments A random experiment is a process whose outcome is uncertain.\nExamples:\nflipping a coin, rolling a die, measuring temperature, recording a noisy image, sampling an image from a diffusion model, sampling the next token from a language model. The set of all possible outcomes is called the sample space.\nIt is usually written:\n$$ \\Omega. $$For a coin flip:\n$$ \\Omega={H,T}. $$For a die roll:\n$$ \\Omega={1,2,3,4,5,6}. $$For an image generation model, the sample space is much larger. It may be the space of all possible image tensors:\n$$ \\Omega \\subseteq \\mathbb{R}^{H \\times W \\times C}. $$A probability assigns a number between 0 and 1 to events.\nIf an event \\(A\\) is impossible:\n$$ P(A)=0. $$If an event \\(A\\) is certain:\n$$ P(A)=1. $$Random variables A random variable is a function that assigns a numerical value to the outcome of a random experiment.\nFormally:\n$$ X:\\Omega \\to \\mathbb{R}. $$For example, if we roll a die, then \\(X\\) may be the number that appears.\nIf:\n$$ \\omega = \\text{the die lands on 4}, $$then:\n$$ X(\\omega)=4. $$In machine learning, random variables may represent:\na noisy pixel, a class label, a house price, a token ID, a latent variable, a model prediction, a diffusion timestep, a random noise vector. A random vector is a vector of random variables:\n$$ X = \\begin{bmatrix} X_1 \\ X_2 \\ \\vdots \\ X_d \\end{bmatrix}. $$A random image is a tensor-valued random variable:\n$$ X:\\Omega \\to \\mathbb{R}^{H \\times W \\times C}. $$This is the correct mathematical way to say:\nan image is sampled from a probability distribution.\nProbability distributions A probability distribution describes how likely different values of a random variable are.\nMIT’s statistics notes describe discrete distributions using probability functions and continuous distributions using probability density functions. (MIT OpenCourseWare)\nThere are two main cases:\ndiscrete random variables, continuous random variables. Discrete distributions A discrete random variable takes values in a finite or countable set.\nFor example, a die roll has values:\n$$ 1,2,3,4,5,6. $$Its probability mass function, or PMF, is:\n$$ p_X(x)=P(X=x). $$For a fair die:\n$$ P(X=1)=P(X=2)=\\cdots=P(X=6)=\\frac{1}{6}. $$The probabilities must satisfy:\n$$ p_X(x)\\geq 0 $$and:\n$$ \\sum_x p_X(x)=1. $$Bernoulli distribution A Bernoulli random variable has two possible values:\n$$ X \\in {0,1}. $$It is often used for yes/no events.\nIf:\n$$ P(X=1)=p, $$then:\n$$ P(X=0)=1-p. $$We write:\n$$ X \\sim \\operatorname{Bernoulli}(p). $$The PMF is:\n$$ P(X=x)=p^x(1-p)^{1-x}, \\qquad x\\in{0,1}. $$In AI, Bernoulli distributions appear in:\nbinary classification, dropout masks, binary pixels, success/failure events. Categorical distribution A categorical random variable takes one of \\(K\\) possible classes:\n$$ X \\in {1,2,\\dots,K}. $$If:\n$$ P(X=k)=p_k, $$then:\n$$ p_k \\geq 0 $$and:\n$$ \\sum_{k=1}^{K}p_k=1. $$A neural network classifier often outputs:\n$$ p = \\begin{bmatrix} p_1 \\ p_2 \\ \\vdots \\ p_K \\end{bmatrix} $$where \\(p_k\\) is the predicted probability of class \\(k\\).\nFor example:\n$$ p = \\begin{bmatrix} 0.1 \\ 0.7 \\ 0.2 \\end{bmatrix} $$means:\nclass 1 has probability 0.1, class 2 has probability 0.7, class 3 has probability 0.2. Continuous distributions A continuous random variable can take infinitely many values.\nFor example, temperature, height, time, or measurement noise may be continuous.\nA continuous random variable is described by a probability density function, or PDF:\n$$ f_X(x). $$The probability that \\(X\\) lies between \\(a\\) and \\(b\\) is:\n$$ P(a\\leq X\\leq b) {}={} \\int_a^b f_X(x),dx. $$The density must satisfy:\n$$ f_X(x)\\geq 0 $$and:\n$$ \\int_{-\\infty}^{\\infty} f_X(x),dx = 1. $$For a continuous random variable:\n$$ P(X=a)=0 $$for any exact single value \\(a\\).\nThis does not mean values are impossible. It means probability is measured over intervals.\nExpectation The expectation is the average value of a random variable.\nFor a discrete random variable:\n$$ \\mathbb{E}[X] {}={} \\sum_x xP(X=x). $$For a continuous random variable:\n$$ \\mathbb{E}[X] {}={} \\int_{-\\infty}^{\\infty} x f_X(x),dx. $$Expectation is also called the mean.\nFor example, if:\n$$ X = \\begin{cases} 0 \u0026 \\text{with probability } 0.3,\\ 1 \u0026 \\text{with probability } 0.7, \\end{cases} $$then:\n$$ \\mathbb{E}[X] {}={} 0(0.3)+1(0.7)=0.7. $$Variance Variance measures spread around the mean.\nIf:\n$$ \\mu=\\mathbb{E}[X], $$then:\n$$ \\operatorname{Var}(X) {}={} \\mathbb{E}[(X-\\mu)^2]. $$The standard deviation is:\n$$ \\sigma = \\sqrt{\\operatorname{Var}(X)}. $$Small variance means the values are concentrated near the mean.\nLarge variance means the values are spread out.\nIn machine learning, variance matters because noise level controls how corrupted the data becomes.\nThe Gaussian distribution The Gaussian distribution is the most important continuous distribution in diffusion models.\nA random variable \\(X\\) is Gaussian with mean \\(\\mu\\) and variance \\(\\sigma^2\\) if:\n$$ X \\sim \\mathcal{N}(\\mu,\\sigma^2). $$Its density is:\n$$ p(x) {}={} \\frac{1}{\\sqrt{2\\pi\\sigma^2}} \\exp \\left( -\\frac{(x-\\mu)^2}{2\\sigma^2} \\right). $$MIT notes define a normal random variable \\(X\\sim N(\\mu,\\sigma^2)\\) using this bell-shaped density, with mean \\(\\mu\\) and variance \\(\\sigma^2\\). (MIT OpenCourseWare)\nIf:\n$$ \\mu=0, \\qquad \\sigma^2=1, $$then:\n$$ X\\sim\\mathcal{N}(0,1) $$is called a standard normal random variable.\nWhy the Gaussian is important The Gaussian distribution is important because:\nit appears in measurement noise, it is mathematically convenient, sums of many small independent effects tend to become approximately Gaussian, multivariate Gaussians are easy to manipulate, diffusion models usually add Gaussian noise. In DDPMs, the forward diffusion process adds Gaussian noise step by step. The original DDPM paper defines diffusion models as latent-variable models with a Markov noising process and connects their training objective to denoising score matching and Langevin dynamics. (NeurIPS Proceedings)\nMultivariate Gaussian distribution A random vector:\n$$ X \\in \\mathbb{R}^{d} $$has a multivariate Gaussian distribution if:\n$$ X \\sim \\mathcal{N}(\\mu,\\Sigma), $$where:\n\\(\\mu \\in \\mathbb{R}^d\\) is the mean vector, \\(\\Sigma \\in \\mathbb{R}^{d\\times d}\\) is the covariance matrix. The density is:\n$$ p(x) {}={} \\frac{1}{(2\\pi)^{d/2}|\\Sigma|^{1/2}} \\exp \\left( -\\frac{1}{2}(x-\\mu)^\\top\\Sigma^{-1}(x-\\mu) \\right). $$The covariance matrix describes how components vary together.\nThe diagonal entries are variances:\n$$ \\Sigma_{ii}=\\operatorname{Var}(X_i). $$The off-diagonal entries are covariances:\n$$ \\Sigma_{ij}=\\operatorname{Cov}(X_i,X_j). $$Stanford’s probability review emphasizes useful Gaussian closure properties: sums of independent Gaussian random variables are Gaussian, marginals of joint Gaussians are Gaussian, and conditionals of joint Gaussians are Gaussian. (CS229)\nThese properties are one reason Gaussian noise is so common in machine learning and diffusion models.\nCovariance For two random variables \\(X\\) and \\(Y\\), covariance is:\n$$ \\operatorname{Cov}(X,Y) {}={} \\mathbb{E}[(X-\\mathbb{E}[X])(Y-\\mathbb{E}[Y])]. $$If covariance is positive, \\(X\\) and \\(Y\\) tend to increase together.\nIf covariance is negative, one tends to increase while the other decreases.\nIf covariance is zero, they are uncorrelated.\nFor a random vector:\n$$ X = \\begin{bmatrix} X_1 \\ X_2 \\ \\vdots \\ X_d \\end{bmatrix}, $$the covariance matrix is:\n$$ \\Sigma {}={} \\mathbb{E}[(X-\\mu)(X-\\mu)^\\top]. $$This is a matrix of all pairwise covariances.\nIndependent Gaussian noise The simplest noise model is independent Gaussian noise:\n$$ \\varepsilon \\sim \\mathcal{N}(0,\\sigma^2 I). $$This means:\neach component has mean 0, each component has variance \\(\\sigma^2\\), components are independent, the covariance matrix is diagonal. For a vector:\n$$ x \\in \\mathbb{R}^d, $$a noisy version is:\n$$ y=x+\\varepsilon. $$Entrywise:\n$$ y_i=x_i+\\varepsilon_i, \\qquad \\varepsilon_i\\sim\\mathcal{N}(0,\\sigma^2). $$This is the basic finite-dimensional version of Gaussian white noise.\nWhite noise White noise is noise with no correlation across time or space.\nIn discrete time, a white noise sequence is often written:\n$$ \\varepsilon_1,\\varepsilon_2,\\dots $$with:\n$$ \\mathbb{E}[\\varepsilon_t]=0 $$and:\n$$ \\mathbb{E}[\\varepsilon_t\\varepsilon_s] {}={} \\sigma^2\\delta_{ts}. $$Here:\n$$ \\delta_{ts} {}={} \\begin{cases} 1, \u0026 t=s,\\ 0, \u0026 t\\neq s. \\end{cases} $$So different times are uncorrelated.\nIf the noise is Gaussian, then:\n$$ \\varepsilon_t \\sim \\mathcal{N}(0,\\sigma^2) $$and the sequence is called Gaussian white noise.\nMIT non-equilibrium statistical mechanics notes describe random forces satisfying Gaussian white noise assumptions and express their moment structure using two-time correlations. (MIT OpenCourseWare)\nWhite noise in images For an image:\n$$ X \\in \\mathbb{R}^{H \\times W}, $$white Gaussian noise means:\n$$ E_{ij}\\sim \\mathcal{N}(0,\\sigma^2) $$independently for each pixel location \\(i,j\\).\nThen:\n$$ Y_{ij}=X_{ij}+E_{ij}. $$This produces a noisy image where each pixel is independently perturbed.\nWhite noise in continuous time Continuous-time white noise is more subtle.\nIt is often informally written:\n$$ \\xi(t) $$with:\n$$ \\mathbb{E}[\\xi(t)]=0 $$and:\n$$ \\mathbb{E}[\\xi(t)\\xi(s)] {}={} \\sigma^2\\delta(t-s). $$Here (\\delta(t-s)) is the Dirac delta distribution.\nThis means continuous white noise is not an ordinary function.\nIt is a generalized random distribution.\nA useful way to understand it is:\n$$ \\xi(t)=\\frac{dW_t}{dt}, $$where \\(W_t\\) is Brownian motion.\nBut Brownian motion is almost surely nowhere differentiable, so this derivative does not exist in the ordinary classical sense. MIT Itô calculus notes emphasize that Brownian sample paths are nowhere differentiable with probability 1, which is why ordinary calculus must be replaced by Itô calculus. (MIT OpenCourseWare)\nImportant mathematical warningIn continuous time, white noise is not a normal function. It should be treated as a generalized random object, often interpreted through Brownian motion and stochastic integrals. Colored noise White noise has no correlation across time.\nColored noise has correlation.\nFor a discrete-time noise sequence:\n$$ \\varepsilon_t, $$white noise satisfies:\n$$ \\operatorname{Cov}(\\varepsilon_t,\\varepsilon_s)=0 \\quad \\text{for } t\\neq s. $$Colored noise may satisfy:\n$$ \\operatorname{Cov}(\\varepsilon_t,\\varepsilon_s)\\neq 0 \\quad \\text{for nearby } t,s. $$For example:\n$$ \\operatorname{Cov}(\\varepsilon_t,\\varepsilon_s) {}={} \\sigma^2 e^{-|t-s|/\\ell}. $$Here \\(\\ell\\) is a correlation length.\nIf \\(\\ell\\) is small, the noise loses memory quickly.\nIf \\(\\ell\\) is large, the noise remains correlated over longer times.\nGardiner’s stochastic methods text explicitly discusses white noise as a limit of nonwhite noise processes, which is important because many physical systems have finite correlation time rather than ideal delta-correlated noise. (Deutsche Nationalbibliothek)\nPower spectrum intuition Noise can also be described by frequency content.\nWhite noise has equal power at all frequencies.\nColored noise does not.\nExamples:\nwhite noise: flat spectrum, pink noise: power roughly decreases like (1/f), brown noise: power roughly decreases like \\(1/f^2\\). In images and video, colored noise often looks smoother because nearby pixels or frames are correlated.\nIn diffusion models, the standard assumption is usually Gaussian noise with independent components, but variants can use structured or correlated noise.\nRandom fields A random field is a collection of random variables indexed by space.\nFor example:\n$$ X(s), \\qquad s\\in D\\subseteq\\mathbb{R}^2. $$This may represent:\ntemperature over a 2D region, terrain height over a map, image intensity over a pixel grid, material properties in a physical simulation. A random image can be viewed as a discretized random field.\nA random video can be viewed as a random field indexed by time and space:\n$$ X(t,s), \\qquad t\\in[0,T], \\quad s\\in D. $$The Cambridge book An Introduction to Computational Stochastic PDEs develops stochastic processes, random fields, stochastic differential equations, and numerical methods together, especially for uncertainty quantification and stochastic PDE computation. (Cambridge University Press \u0026amp; Assessment)\nConditional probability Conditional probability tells us the probability of one event given another event.\nIt is written:\n$$ P(A\\mid B) {}={} \\frac{P(A\\cap B)}{P(B)} $$provided:\n$$ P(B)\u003e0. $$In machine learning, we often model:\n$$ p(y\\mid x), $$the probability of output \\(y\\) given input \\(x\\).\nFor example:\n$$ p(\\text{cat}\\mid \\text{image}) $$is the probability that an image contains a cat.\nIn diffusion models, we often model reverse conditional distributions:\n$$ p_\\theta(x_{t-1}\\mid x_t). $$This means:\ngiven a noisy sample \\(x_t\\), predict a slightly less noisy sample \\(x_{t-1}\\).\nBayes’ theorem Bayes’ theorem is:\n$$ P(A\\mid B) {}={} \\frac{P(B\\mid A)P(A)}{P(B)}. $$In density notation:\n$$ p(z\\mid x) {}={} \\frac{p(x\\mid z)p(z)}{p(x)}. $$This appears throughout latent-variable models.\nHere:\n\\(p(z)\\) is the prior, \\(p(x\\mid z)\\) is the likelihood, \\(p(z\\mid x)\\) is the posterior, \\(p(x)\\) is the evidence. Diffusion models are often interpreted as latent-variable models, where the noisy states \\(x_1,\\dots,x_T\\) are latent variables between clean data \\(x_0\\) and pure noise \\(x_T\\). The DDPM paper explicitly presents diffusion probabilistic models as latent-variable models trained with a variational bound. (arXiv)\nData distributions In machine learning, we assume data comes from an unknown distribution:\n$$ x \\sim p_{\\text{data}}(x). $$For images:\n$$ x \\in \\mathbb{R}^{H \\times W \\times C}. $$For videos:\n$$ x \\in \\mathbb{R}^{T \\times H \\times W \\times C}. $$A generative model tries to learn a distribution:\n$$ p_\\theta(x) $$that approximates:\n$$ p_{\\text{data}}(x). $$The goal is:\n$$ p_\\theta(x)\\approx p_{\\text{data}}(x). $$Then we can sample:\n$$ x \\sim p_\\theta(x) $$to generate new images, videos, audio, or text.\nMarkov chains A Markov chain is a stochastic process where the future depends on the present, not the full past.\nLet:\n$$ X_0,X_1,X_2,\\dots $$be random variables.\nThe Markov property is:\n$$ P(X_{t+1}=x_{t+1}\\mid X_t=x_t,X_{t-1}=x_{t-1},\\dots,X_0=x_0) {}={} P(X_{t+1}=x_{t+1}\\mid X_t=x_t). $$In words:\nonce we know the current state, the older history is not needed.\nMIT’s stochastic process material introduces stochastic processes as collections of random variables indexed by time and discusses Markov chains as a central example. (MIT OpenCourseWare)\nTransition matrices For a finite-state Markov chain, transition probabilities are stored in a matrix:\n$$ P = \\begin{bmatrix} P_{11} \u0026 P_{12} \u0026 \\cdots \u0026 P_{1K} \\ P_{21} \u0026 P_{22} \u0026 \\cdots \u0026 P_{2K} \\ \\vdots \u0026 \\vdots \u0026 \\ddots \u0026 \\vdots \\ P_{K1} \u0026 P_{K2} \u0026 \\cdots \u0026 P_{KK} \\end{bmatrix}. $$Here:\n$$ P_{ij}=P(X_{t+1}=j\\mid X_t=i). $$Each row sums to 1:\n$$ \\sum_{j=1}^K P_{ij}=1. $$Numerical example Suppose weather has two states:\n$$ 0=\\text{sunny}, \\qquad 1=\\text{rainy}. $$Let:\n$$ P= \\begin{bmatrix} 0.8 \u0026 0.2 \\ 0.4 \u0026 0.6 \\end{bmatrix}. $$This means:\nif today is sunny, tomorrow is sunny with probability 0.8, if today is sunny, tomorrow is rainy with probability 0.2, if today is rainy, tomorrow is sunny with probability 0.4, if today is rainy, tomorrow is rainy with probability 0.6. If today’s distribution is:\n$$ \\pi_0= \\begin{bmatrix} 1 \u0026 0 \\end{bmatrix}, $$then tomorrow’s distribution is:\n$$ \\pi_1=\\pi_0P {}={} \\begin{bmatrix} 1 \u0026 0 \\end{bmatrix} \\begin{bmatrix} 0.8 \u0026 0.2 \\ 0.4 \u0026 0.6 \\end{bmatrix} {}={} \\begin{bmatrix} 0.8 \u0026 0.2 \\end{bmatrix}. $$After two days:\n$$ \\pi_2=\\pi_1P. $$So probability evolves by matrix multiplication.\nStationary distributions A stationary distribution is a distribution that does not change after applying the transition matrix.\nIt satisfies:\n$$ \\pi P = \\pi. $$MIT’s Markov chain notes state that every finite-state Markov chain has at least one stationary distribution. (MIT OpenCourseWare)\nFor the weather example:\n$$ P= \\begin{bmatrix} 0.8 \u0026 0.2 \\ 0.4 \u0026 0.6 \\end{bmatrix}. $$Let:\n$$ \\pi = \\begin{bmatrix} a \u0026 b \\end{bmatrix}. $$We need:\n$$ \\pi P=\\pi $$and:\n$$ a+b=1. $$Compute:\n$$ \\begin{bmatrix} a \u0026 b \\end{bmatrix} \\begin{bmatrix} 0.8 \u0026 0.2 \\ 0.4 \u0026 0.6 \\end{bmatrix} {}={} \\begin{bmatrix} 0.8a+0.4b \u0026 0.2a+0.6b \\end{bmatrix}. $$So:\n$$ 0.8a+0.4b=a $$and:\n$$ 0.2a+0.6b=b. $$From the first equation:\n$$ 0.4b=0.2a $$so:\n$$ a=2b. $$Using \\(a+b=1\\):\n$$ 2b+b=1 $$so:\n$$ b=\\frac{1}{3}, \\qquad a=\\frac{2}{3}. $$Thus:\n$$ \\pi = \\begin{bmatrix} 2/3 \u0026 1/3 \\end{bmatrix}. $$In the long run, the chain spends about two-thirds of the time sunny and one-third rainy.\nMarkov chains in diffusion models Diffusion models use Markov chains.\nThe forward diffusion process gradually adds noise:\n$$ x_0 \\to x_1 \\to x_2 \\to \\cdots \\to x_T. $$A standard DDPM forward transition is:\n$$ q(x_t\\mid x_{t-1}) {}={} \\mathcal{N} \\left( x_t; \\sqrt{1-\\beta_t}x_{t-1}, \\beta_t I \\right). $$Here:\n\\(x_0\\) is clean data, \\(x_t\\) is noisy data at time \\(t\\), \\(\\beta_t\\) controls how much noise is added, \\(I\\) is the identity covariance matrix. The reverse model learns:\n$$ p_\\theta(x_{t-1}\\mid x_t). $$The DDPM paper presents this as a diffusion probabilistic model with a learned reverse denoising process. (NeurIPS Proceedings)\nDiffusion-model connectionIn DDPMs, the forward process is a fixed Markov chain that corrupts data with Gaussian noise. The neural network learns the reverse Markov chain that removes noise. Closed-form noising in DDPMs Define:\n$$ \\alpha_t = 1-\\beta_t $$and:\n$$ \\bar{\\alpha}_t = \\prod_{s=1}^{t}\\alpha_s. $$Then the forward process has a closed form:\n$$ q(x_t\\mid x_0) {}={} \\mathcal{N} \\left( x_t; \\sqrt{\\bar{\\alpha}_t}x_0, (1-\\bar{\\alpha}_t)I \\right). $$Equivalently, we can sample \\(x_t\\) directly as:\n$$ x_t {}={} \\sqrt{\\bar{\\alpha}_t}x_0 + \\sqrt{1-\\bar{\\alpha}_t}\\epsilon, $$where:\n$$ \\epsilon\\sim\\mathcal{N}(0,I). $$This formula is central because it allows training at random timesteps without simulating every previous step.\nBrownian motion Brownian motion is the mathematical model of continuous random motion.\nA standard Brownian motion \\(W_t\\) satisfies:\n\\(W_0=0\\), (W_t-W_s\\sim\\mathcal{N}(0,t-s)) for \\(t\u003es\\), increments over disjoint intervals are independent, paths are continuous, paths are almost surely nowhere differentiable. Brownian motion is the continuous-time limit of random walks.\nIt is also the source of continuous-time Gaussian noise.\nThe derivative:\n$$ \\frac{dW_t}{dt} $$does not exist classically, but it is often informally called white noise.\nThis is why stochastic calculus is needed.\nStochastic differential equations An ordinary differential equation has the form:\n$$ \\frac{dx}{dt}=f(x,t). $$A stochastic differential equation, or SDE, adds random forcing:\n$$ dX_t = a(X_t,t),dt + b(X_t,t),dW_t. $$Here:\n\\(X_t\\) is a random process, \\(a(X_t,t)\\) is the drift, \\(b(X_t,t)\\) is the diffusion coefficient, \\(W_t\\) is Brownian motion. The term:\n$$ a(X_t,t),dt $$describes deterministic motion.\nThe term:\n$$ b(X_t,t),dW_t $$describes random motion.\nHigham’s SIAM Review paper gives a practical introduction to numerical SDE simulation and covers stochastic integration, Euler–Maruyama, Milstein, strong and weak convergence, and stability. (SIAM)\nExample: pure Brownian motion The simplest SDE is:\n$$ dX_t=dW_t. $$With:\n$$ X_0=0, $$the solution is:\n$$ X_t=W_t. $$So Brownian motion itself solves an SDE.\nExample: Brownian motion with drift Consider:\n$$ dX_t = \\mu,dt+\\sigma,dW_t. $$The solution is:\n$$ X_t=X_0+\\mu t+\\sigma W_t. $$The mean is:\n$$ \\mathbb{E}[X_t]=X_0+\\mu t. $$The variance is:\n$$ \\operatorname{Var}(X_t)=\\sigma^2t. $$So the drift controls average motion.\nThe diffusion coefficient controls random spread.\nExample: Ornstein–Uhlenbeck process The Ornstein–Uhlenbeck process is:\n$$ dX_t = \\theta(\\mu-X_t),dt+\\sigma,dW_t. $$Here:\n\\(\\mu\\) is the long-term mean, \\(\\theta\\) controls how strongly the process returns to the mean, \\(\\sigma\\) controls noise strength. If \\(X_t\\) is above \\(\\mu\\), then:\n$$ \\theta(\\mu-X_t)\u003c0, $$so the drift pulls it downward.\nIf \\(X_t\\) is below \\(\\mu\\), then:\n$$ \\theta(\\mu-X_t)\u003e0, $$so the drift pulls it upward.\nThis is a mean-reverting stochastic process.\nIt is important in physics, finance, stochastic filtering, and generative modeling intuition.\nItô calculus In ordinary calculus:\n$$ df(X_t)=f'(X_t)dX_t. $$But in stochastic calculus, Brownian motion has quadratic variation:\n$$ (dW_t)^2 = dt. $$So Itô’s formula includes a second derivative term.\nIf:\n$$ dX_t=a(X_t,t),dt+b(X_t,t),dW_t, $$then:\n$$ df(X_t,t) {}={} \\left( \\frac{\\partial f}{\\partial t} + a\\frac{\\partial f}{\\partial x} + \\frac{1}{2}b^2\\frac{\\partial^2 f}{\\partial x^2} \\right)dt + b\\frac{\\partial f}{\\partial x}dW_t. $$This extra term:\n$$ \\frac{1}{2}b^2\\frac{\\partial^2 f}{\\partial x^2} $$is one of the key differences between deterministic and stochastic calculus.\nMIT’s Itô calculus lecture notes explain that Brownian paths are nowhere differentiable, so expressions involving functions of Brownian motion require Itô calculus rather than ordinary differentiation. (MIT OpenCourseWare)\nFokker–Planck equation An SDE describes random sample paths.\nThe Fokker–Planck equation describes how the probability density evolves.\nFor the SDE:\n$$ dX_t=a(X_t,t),dt+b(X_t,t),dW_t, $$the density \\(p(x,t)\\) often satisfies:\n$$ \\frac{\\partial p}{\\partial t} {}={} -\\frac{\\partial}{\\partial x} \\left( a(x,t)p(x,t) \\right) + \\frac{1}{2} \\frac{\\partial^2}{\\partial x^2} \\left( b^2(x,t)p(x,t) \\right). $$In higher dimensions:\n$$ dX_t=f(X_t,t),dt+G(X_t,t),dW_t, $$the Fokker–Planck equation becomes:\n$$ \\frac{\\partial p}{\\partial t} {}={} -\\sum_i \\frac{\\partial}{\\partial x_i} \\left( f_i p \\right) + \\frac{1}{2} \\sum_{i,j} \\frac{\\partial^2}{\\partial x_i\\partial x_j} \\left( D_{ij}p \\right), $$where:\n$$ D=GG^\\top. $$This PDE view is very important in diffusion models because the noising process changes the entire probability distribution over time.\nScore function The score function of a probability density is:\n$$ s(x,t)=\\nabla_x \\log p_t(x). $$It points in the direction where the log-density increases fastest.\nScore-based generative modeling learns:\n$$ s_\\theta(x,t) \\approx \\nabla_x \\log p_t(x). $$Song et al. present a framework where an SDE transforms data into a known noise distribution, and the reverse-time SDE transforms noise back into data using the time-dependent score function. (arXiv)\nThis is the deep mathematical connection between:\nprobability distributions, gradients, SDEs, noise, generative AI. Reverse-time SDE in diffusion models A forward SDE may be written:\n$$ dX_t=f(X_t,t),dt+g(t),dW_t. $$Under suitable conditions, the reverse-time process has the form:\n$$ dX_t= \\left[ f(X_t,t) - g(t)^2\\nabla_x\\log p_t(X_t) \\right]dt + g(t),d\\bar{W}_t. $$Here:\n\\(p_t\\) is the density of \\(X_t\\), \\(\\nabla_x\\log p_t(x)\\) is the score, \\(\\bar{W}_t\\) is reverse-time Brownian motion. The neural network estimates the score.\nThen numerical SDE solvers generate samples.\nSong et al. also derive a probability-flow ODE that samples from the same marginal distributions and enables likelihood computation. (arXiv)\nNumerical methods for SDEs Most SDEs cannot be solved exactly.\nWe approximate them numerically.\nConsider:\n$$ dX_t=a(X_t,t),dt+b(X_t,t),dW_t. $$Choose a timestep:\n$$ \\Delta t. $$Let:\n$$ t_n=n\\Delta t. $$Brownian increments satisfy:\n$$ \\Delta W_n=W_{t_{n+1}}-W_{t_n} \\sim \\mathcal{N}(0,\\Delta t). $$So we can sample:\n$$ \\Delta W_n=\\sqrt{\\Delta t},Z_n, \\qquad Z_n\\sim\\mathcal{N}(0,1). $$Euler–Maruyama method The Euler–Maruyama method is the stochastic analogue of Euler’s method.\nIt approximates:\n$$ dX_t=a(X_t,t),dt+b(X_t,t),dW_t $$by:\n$$ X_{n+1} {}={} X_n + a(X_n,t_n)\\Delta t + b(X_n,t_n)\\Delta W_n. $$Since:\n$$ \\Delta W_n=\\sqrt{\\Delta t}Z_n, $$we can write:\n$$ X_{n+1} {}={} X_n + a(X_n,t_n)\\Delta t + b(X_n,t_n)\\sqrt{\\Delta t}Z_n. $$Higham’s article introduces Euler–Maruyama as a core numerical method for SDE simulation and uses practical programs to explain its behavior. (SIAM)\nNumerical example Consider:\n$$ dX_t = -X_t,dt + 0.5,dW_t. $$Let:\n$$ X_0=1, \\qquad \\Delta t=0.01. $$Euler–Maruyama gives:\n$$ X_{n+1} {}={} -X_n\\Delta t + 0.5\\sqrt{\\Delta t}Z_n. $$Since:\n$$ \\sqrt{0.01}=0.1, $$we get:\n$$ X_{n+1} {}={} 0.99X_n+0.05Z_n. $$At each step, the process is pulled toward zero but perturbed by random Gaussian noise.\nMilstein method For scalar SDEs, the Milstein method adds a correction term:\n$$ X_{n+1} {}={} X_n + a(X_n,t_n)\\Delta t + b(X_n,t_n)\\Delta W_n + \\frac{1}{2} b(X_n,t_n)b_x(X_n,t_n) \\left[ (\\Delta W_n)^2-\\Delta t \\right]. $$Here:\n$$ b_x=\\frac{\\partial b}{\\partial x}. $$If \\(b\\) is constant, then:\n$$ b_x=0 $$and Milstein reduces to Euler–Maruyama.\nMilstein can achieve better strong convergence for certain SDEs, and Higham’s SIAM Review paper covers both Euler–Maruyama and Milstein methods, including strong and weak convergence. (SIAM)\nStrong and weak convergence There are two main ways to measure numerical accuracy for SDEs.\nStrong convergence Strong convergence measures pathwise accuracy.\nIt asks:\nDoes the simulated path stay close to the true path?\nA method has strong order \\(\\gamma\\) if:\n$$ \\mathbb{E} \\left[ |X_T-X_T^{\\Delta t}| \\right] \\leq C(\\Delta t)^\\gamma. $$This matters when individual sample paths are important.\nWeak convergence Weak convergence measures accuracy of expectations.\nIt asks:\nDoes the method estimate averages correctly?\nA method has weak order \\(\\gamma\\) if:\n$$ | \\mathbb{E}[\\phi(X_T)] - \\mathbb{E}[\\phi(X_T^{\\Delta t})] | \\leq C(\\Delta t)^\\gamma. $$This matters when we care about statistics, distributions, and expected values.\nThe Cambridge computational SPDE text explicitly discusses strong approximation of samples and weak approximation of averages, along with Euler–Maruyama, Milstein, and multilevel Monte Carlo. (PagePlace)\nMonte Carlo simulation Monte Carlo means estimating quantities by repeated random sampling.\nSuppose we want:\n$$ \\mathbb{E}[\\phi(X_T)]. $$We simulate \\(M\\) independent paths:\n$$ X_T^{(1)},X_T^{(2)},\\dots,X_T^{(M)}. $$Then estimate:\n$$ \\mathbb{E}[\\phi(X_T)] \\approx \\frac{1}{M} \\sum_{m=1}^{M} \\phi(X_T^{(m)}). $$The Monte Carlo error usually decreases like:\n$$ \\frac{1}{\\sqrt{M}}. $$This is slow but general.\nMonte Carlo methods are fundamental in stochastic simulation, Bayesian inference, uncertainty quantification, and stochastic PDE computation.\nStochastic PDEs A deterministic PDE may look like:\n$$ \\frac{\\partial u}{\\partial t} {}={} \\Delta u. $$This is the heat equation.\nA stochastic PDE adds randomness:\n$$ \\frac{\\partial u}{\\partial t} {}={} \\Delta u+\\xi. $$Here:\n$$ \\xi $$may be space-time white noise.\nMore formally, a stochastic heat equation may be written:\n$$ du {}={} \\Delta u,dt + \\sigma,dW_t. $$Here \\(W_t\\) may be an infinite-dimensional Brownian motion or a cylindrical Wiener process.\nHairer’s introduction to SPDEs discusses motivating examples such as stochastic heat equations and develops intuition for random evolution equations. (hairer.org)\nWhy SPDEs are harder than SDEs An SDE evolves a finite-dimensional random variable:\n$$ X_t \\in \\mathbb{R}^d. $$An SPDE evolves a random function:\n$$ u(t,x). $$So the unknown is infinite-dimensional.\nFor example:\n$$ u(t,\\cdot) $$is a function over space.\nThis means SPDEs combine:\nprobability, stochastic processes, functional analysis, PDE theory, numerical analysis, infinite-dimensional linear algebra. This is graduate-level mathematics.\nRandom-coefficient PDEs Not all stochastic PDEs have white-noise forcing.\nSometimes the PDE has random coefficients.\nFor example:\n$$ -\\nabla\\cdot(a(x,\\omega)\\nabla u(x,\\omega))=f(x). $$Here:\n\\(a(x,\\omega)\\) is a random field, \\(u(x,\\omega)\\) is the random solution, \\(\\omega\\) represents randomness. This appears in porous media flow, materials science, climate modeling, and uncertainty quantification.\nThe Cambridge computational SPDE book discusses elliptic PDEs with correlated random data, Monte Carlo estimation of mean and variance, Karhunen–Loève expansions, and stochastic Galerkin finite-element methods. (MIMS EPrints)\nKarhunen–Loève expansion A random field can often be approximated using a basis expansion.\nLet:\n$$ a(x,\\omega) $$be a random field with mean:\n$$ \\bar{a}(x) $$and covariance function:\n$$ C(x,y). $$A Karhunen–Loève expansion writes:\n$$ a(x,\\omega) {}={} \\bar{a}(x) + \\sum_{k=1}^{\\infty} \\sqrt{\\lambda_k} \\phi_k(x) \\xi_k(\\omega). $$Here:\n\\(\\lambda_k\\) are eigenvalues of the covariance operator, \\(\\phi_k(x)\\) are eigenfunctions, \\(\\xi_k\\) are uncorrelated random variables. A truncated approximation is:\n$$ a_M(x,\\omega) {}={} \\bar{a}(x) + \\sum_{k=1}^{M} \\sqrt{\\lambda_k} \\phi_k(x) \\xi_k(\\omega). $$This converts a random field into finitely many random variables.\nThis is important for numerical SPDE methods and uncertainty quantification.\nSpatial discretization of SPDEs To compute an SPDE, we must discretize space.\nCommon methods include:\nfinite differences, finite elements, spectral methods. Suppose:\n$$ u(t,x) $$is approximated at grid points:\n$$ x_1,x_2,\\dots,x_N. $$Then we form a vector:\n$$ U(t) {}={} \\begin{bmatrix} u(t,x_1)\\ u(t,x_2)\\ \\vdots\\ u(t,x_N) \\end{bmatrix}. $$A PDE becomes a large system of SDEs:\n$$ dU_t = AU_t,dt + G(U_t),dW_t. $$This method is called method of lines.\nThen we can apply SDE time-stepping methods such as Euler–Maruyama.\nStochastic heat equation: semi-discrete form Consider:\n$$ du = \\Delta u,dt + \\sigma,dW_t. $$After spatial discretization:\n$$ dU_t = AU_t,dt+\\sigma,dW_t. $$Here:\n\\(U_t\\in\\mathbb{R}^N\\), \\(A\\in\\mathbb{R}^{N\\times N}\\) approximates the Laplacian, \\(W_t\\in\\mathbb{R}^N\\) is a vector Brownian motion. Euler–Maruyama gives:\n$$ U_{n+1} {}={} U_n + AU_n\\Delta t + \\sigma\\Delta W_n. $$Since:\n$$ \\Delta W_n\\sim\\mathcal{N}(0,\\Delta t I), $$we write:\n$$ \\Delta W_n=\\sqrt{\\Delta t}Z_n, \\qquad Z_n\\sim\\mathcal{N}(0,I). $$Thus:\n$$ U_{n+1} {}={} U_n + AU_n\\Delta t + \\sigma\\sqrt{\\Delta t}Z_n. $$This is a finite-dimensional approximation of an infinite-dimensional stochastic equation.\nStability warning for stochastic PDEs PDE discretizations can be unstable if the timestep is too large.\nFor the heat equation, explicit Euler often requires:\n$$ \\Delta t \\leq C(\\Delta x)^2. $$With stochastic forcing, stability and accuracy become even more delicate.\nThis is why implicit methods, exponential integrators, spectral methods, and carefully designed schemes are often used for SPDEs.\nComputational SPDE texts develop these methods because naive discretization may be inaccurate or unstable. The Cambridge text specifically combines stochastic processes, random fields, SDEs, and computational methods for stochastic PDEs. (Cambridge University Press \u0026amp; Assessment)\nConnection to diffusion generative models Diffusion generative models are not usually introduced as SPDEs.\nThey are more often introduced as:\ndiscrete-time Markov chains, continuous-time SDEs, probability-flow ODEs, score-based generative models. However, the mathematics overlaps strongly with stochastic PDE ideas because both involve:\nnoise, probability densities, stochastic dynamics, Fokker–Planck equations, numerical simulation, high-dimensional random fields. For image generation:\n$$ x_t \\in \\mathbb{R}^{H\\times W\\times C}. $$For video generation:\n$$ x_t \\in \\mathbb{R}^{T\\times H\\times W\\times C}. $$The diffusion time \\(t\\) is not the same as video time.\nFor video, we may have both:\ndiffusion time: denoising step, physical/video time: frame index. So a video diffusion model may involve tensors like:\n$$ x_{\\tau,t,i,j,c} $$where:\n\\(\\tau\\) is diffusion time, \\(t\\) is video frame time, \\(i,j\\) are spatial coordinates, \\(c\\) is color channel. This is why video generation requires both stochastic process mathematics and tensor mathematics.\nNoise schedules In DDPMs, the amount of noise added at each step is controlled by:\n$$ \\beta_t. $$The forward process is:\n$$ q(x_t\\mid x_{t-1}) {}={} \\mathcal{N} \\left( \\sqrt{1-\\beta_t}x_{t-1}, \\beta_t I \\right). $$If \\(\\beta_t\\) is small, each step adds a little noise.\nIf \\(\\beta_t\\) is large, each step adds more noise.\nThe accumulated signal strength is:\n$$ \\bar{\\alpha}_t {}={} \\prod_{s=1}^{t}(1-\\beta_s). $$The noisy sample is:\n$$ x_t {}={} \\sqrt{\\bar{\\alpha}_t}x_0 + \\sqrt{1-\\bar{\\alpha}_t}\\epsilon. $$As \\(t\\) increases:\n$$ \\bar{\\alpha}_t \\to 0. $$Then:\n$$ x_t \\approx \\epsilon. $$So the data becomes almost pure Gaussian noise.\nDenoising objective A common training objective is:\n$$ \\mathcal{L} {}={} \\mathbb{E}_{x_0,\\epsilon,t} \\left[ |\\epsilon-\\epsilon*\\theta(x_t,t)|^2 \\right]. $$Here:\n\\(x_0\\sim p_{\\text{data}}\\), (\\epsilon\\sim\\mathcal{N}(0,I)), \\(t\\) is a random timestep, \\(x_t\\) is the noisy sample, \\(\\epsilon_\\theta\\) is the neural network’s predicted noise. The model learns to predict the noise that was added.\nOnce it can predict noise, it can remove noise.\nThis is the practical deep-learning version of score estimation.\nProbability flow ODE In score-based models, the reverse stochastic process has an associated deterministic ODE.\nA simplified probability-flow ODE has the form:\n$$ \\frac{dX_t}{dt} {}={} f(X_t,t) - \\frac{1}{2}g(t)^2\\nabla_x\\log p_t(X_t). $$This ODE has the same marginal probability densities as the SDE under suitable conditions.\nSong et al. derive this probability-flow ODE and use it for sampling and likelihood computation. (arXiv)\nThis is why modern diffusion samplers may be viewed as numerical ODE or SDE solvers.\nSection summary We began with uncertainty.\nA noisy measurement can be written:\n$$ Y=x+\\varepsilon. $$A noisy image can be written:\n$$ Y=X+E. $$We introduced random variables:\n$$ X:\\Omega\\to\\mathbb{R}. $$We defined probability distributions, PMFs, PDFs, expectation, and variance.\nWe studied the Gaussian distribution:\n$$ X\\sim\\mathcal{N}(\\mu,\\sigma^2) $$and the multivariate Gaussian:\n$$ X\\sim\\mathcal{N}(\\mu,\\Sigma). $$We introduced white noise:\n$$ \\mathbb{E}[\\varepsilon_t\\varepsilon_s] {}={} \\sigma^2\\delta_{ts} $$and colored noise with nonzero correlations.\nWe introduced Markov chains:\n$$ P(X_{t+1}\\mid X_t,X_{t-1},\\dots,X_0) {}={} P(X_{t+1}\\mid X_t). $$We saw that DDPMs use a Gaussian Markov noising process:\n$$ q(x_t\\mid x_{t-1}) {}={} \\mathcal{N} \\left( \\sqrt{1-\\beta_t}x_{t-1}, \\beta_t I \\right). $$We introduced Brownian motion and SDEs:\n$$ dX_t=a(X_t,t),dt+b(X_t,t),dW_t. $$We introduced the Fokker–Planck equation, which describes how probability densities evolve.\nWe introduced numerical SDE methods such as Euler–Maruyama:\n$$ X_{n+1} {}={} X_n + a(X_n,t_n)\\Delta t + b(X_n,t_n)\\Delta W_n. $$We introduced SPDEs, where the unknown is a random function:\n$$ u(t,x,\\omega). $$Finally, we connected all of this to diffusion generative models, where the model learns how to transform noise into data by reversing a stochastic noising process.\nThe big idea is:\nProbability describes uncertainty. Noise corrupts data. Markov chains and SDEs describe how randomness evolves. Diffusion models learn to reverse that evolution.\nSource anchors used for this section MIT OCW probability material covers distribution functions, common distributions, conditional probability, Bayes’ theorem, joint distributions, law of large numbers, and central limit theorem. (MIT OpenCourseWare) MIT statistics notes distinguish discrete probability functions and continuous probability density functions. (MIT OpenCourseWare) Stanford CS229 probability review discusses Gaussian random variables and closure properties of multivariate Gaussians. (CS229) MIT stochastic-process notes define stochastic processes as collections of random variables indexed by time and introduce Markov chains. (MIT OpenCourseWare) MIT Markov-chain notes state existence of stationary distributions for finite-state Markov chains. (MIT OpenCourseWare) MIT Itô calculus notes explain that Brownian paths are nowhere differentiable with probability 1, motivating stochastic calculus. (MIT OpenCourseWare) Higham’s SIAM Review paper introduces numerical SDE simulation, Euler–Maruyama, Milstein, strong and weak convergence, and stability. (SIAM) Gardiner’s Handbook of Stochastic Methods discusses white noise and nonwhite noise processes. (Deutsche Nationalbibliothek) Lord, Powell, and Shardlow’s Cambridge text covers stochastic processes, random fields, SDEs, SPDEs, Euler–Maruyama, Milstein, strong/weak approximation, Monte Carlo, and stochastic Galerkin finite-element methods. (Cambridge University Press \u0026amp; Assessment) Hairer’s SPDE notes introduce stochastic PDEs through motivating examples including the stochastic heat equation. (hairer.org) Ho, Jain, and Abbeel’s DDPM paper defines diffusion probabilistic models as latent-variable models with Gaussian Markov noising and learned reverse denoising. (NeurIPS Proceedings) Song et al.’s score-SDE paper formulates generative modeling through forward and reverse stochastic differential equations and connects SDE sampling with probability-flow ODEs. (arXiv) ","date":"2026-04-29","description":"A careful introduction to distributions and noise for AI mathematics, starting from real-world uncertainty and moving toward Gaussian random vectors, white and colored noise, Markov chains, Brownian motion, Itô SDEs, Fokker–Planck equations, stochastic PDEs, Euler–Maruyama, Milstein methods, and the mathematical foundation of diffusion generative models.","featured":false,"featured_image":"mathematicsforai/mathsforaisecion03.png","permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-01/section-03/","popular":false,"readingTime":26,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-01/section-03/","section":"courses","series":"","summary":"We build the probability language needed for modern generative AI: random variables, probability distributions, Gaussian noise, white noise, colored noise, Markov chains, stochastic differential equations, stochastic PDEs, and numerical methods for simulating random dynamics.","tags":["probability","distributions","Gaussian noise","white noise","colored noise","Markov chains","SDE","SPDE","diffusion models","generative AI"],"title":"1.3 Probability Distributions, Noise, Markov Chains, SDEs, and Stochastic PDEs","type":"courses"},{"categories":["Course"],"content":"Why LLM mathematics comes after probability and tensors In the previous sections, we studied:\n$$ \\text{vectors},\\quad \\text{matrices},\\quad \\text{tensors}, $$then probability distributions, Gaussian noise, Markov chains, stochastic processes, and diffusion models.\nLarge language models use all of these ideas.\nAn LLM is not just a “text generator.” Mathematically, it is a large parameterized probability model over sequences of tokens.\nA sentence becomes a sequence:\n$$ x_1,x_2,\\dots,x_T. $$Each token becomes a vector:\n$$ e_t \\in \\mathbb{R}^{d_{\\text{model}}}. $$A full sequence becomes a matrix:\n$$ X \\in \\mathbb{R}^{T \\times d_{\\text{model}}}. $$A batch becomes a tensor:\n$$ X \\in \\mathbb{R}^{B \\times T \\times d_{\\text{model}}}. $$A transformer maps this tensor into another tensor, and the final layer produces a probability distribution over the next token.\nThe key mathematical goal is:\n$$ p_\\theta(x_{t+1}\\mid x_1,\\dots,x_t). $$That means:\ngiven previous tokens, predict the probability distribution of the next token.\nModern LLMs are built on the Transformer architecture, introduced in Attention Is All You Need, which replaced recurrence and convolution with attention mechanisms and enabled much more parallel training. (arXiv)\nMain ideaAn LLM is a high-dimensional conditional probability model. It reads a sequence of tokens and predicts a probability distribution over the next token. Part I — Text as mathematics From words to numbers A computer cannot directly process the sentence:\nMathematics is the language of AI.\nIt must first convert text into numbers.\nA simple but bad idea is to assign one number to each word:\nWord Number Mathematics 1 is 2 the 3 language 4 of 5 AI 6 Then the sentence becomes:\n$$ [1,2,3,4,5,6]. $$But this has a problem.\nThe number 6 is not “larger” than the number 1 in any meaningful semantic way. These numbers are just identifiers.\nSo we need a better representation.\nThe usual pipeline is:\n$$ \\text{text} \\longrightarrow \\text{tokens} \\longrightarrow \\text{token IDs} \\longrightarrow \\text{embedding vectors} \\longrightarrow \\text{transformer computation} \\longrightarrow \\text{next-token probabilities}. $$Tokens A token is a unit of text used by the model.\nA token may be:\na word, part of a word, punctuation, a space, a byte sequence, or another subword unit. For example, the word:\nmathematics may be one token in one tokenizer, but split into subwords in another tokenizer.\nSubword tokenization became important because natural language has rare words, new words, names, code symbols, numbers, and multilingual forms. BPE-based subword modeling was introduced for neural machine translation to handle rare and unknown words by encoding them as subword units rather than relying on a fixed word vocabulary. (arXiv)\nSentencePiece later provided a language-independent tokenizer/detokenizer that can train directly from raw sentences, without assuming pre-tokenized word sequences. (arXiv)\nVocabulary Let the vocabulary be:\n$$ \\mathcal{V} {}={} {1,2,\\dots,V}. $$Here:\n\\(V\\) is the vocabulary size, each token is represented by an integer ID, \\(x_t \\in \\mathcal{V}\\) is the token at position \\(t\\). For example:\n$$ x = [x_1,x_2,x_3,x_4] {}={} [1542,89,301,7721]. $$This is a sequence of token IDs.\nThe model does not yet understand meaning. These are still just integers.\nOne-hot representation A token ID can be represented as a one-hot vector.\nIf:\n$$ V=5 $$and the token ID is:\n$$ x_t=3, $$then the one-hot vector is:\n$$ o_t = \\begin{bmatrix} 0\\ 0\\ 1\\ 0\\ 0 \\end{bmatrix}. $$So:\n$$ o_t \\in \\mathbb{R}^{V}. $$But one-hot vectors are huge and sparse.\nIf:\n$$ V=100{,}000, $$then every token would be a 100,000-dimensional vector with only one nonzero entry.\nSo LLMs use embeddings.\nEmbedding matrix An embedding matrix is:\n$$ E \\in \\mathbb{R}^{V \\times d}. $$Here:\n\\(V\\) is vocabulary size, \\(d=d_{\\text{model}}\\) is embedding dimension. Each row of \\(E\\) is a learned vector for one token.\nIf token \\(x_t\\) has ID \\(i\\), then its embedding is:\n$$ e_t = E_i. $$Equivalently, using a one-hot vector:\n$$ e_t = o_t^\\top E. $$Shape-wise:\n$$ (1\\times V)(V\\times d)=1\\times d. $$So a token ID becomes a dense vector:\n$$ e_t \\in \\mathbb{R}^{d}. $$This is the first major mathematical transformation:\n$$ \\text{discrete token} \\longrightarrow \\text{continuous vector}. $$Sequence embedding matrix For a sequence of \\(T\\) tokens:\n$$ x_1,x_2,\\dots,x_T, $$we get embeddings:\n$$ e_1,e_2,\\dots,e_T. $$Stack them row by row:\n$$ X = \\begin{bmatrix} e_1^\\top\\ e_2^\\top\\ \\vdots\\ e_T^\\top \\end{bmatrix} \\in \\mathbb{R}^{T\\times d}. $$For a batch of \\(B\\) sequences:\n$$ X \\in \\mathbb{R}^{B\\times T\\times d}. $$This is the tensor that enters the transformer.\nPart II — Language modeling as probability The language-modeling problem A language model assigns probabilities to token sequences.\nGiven:\n$$ x_1,x_2,\\dots,x_T, $$the model wants:\n$$ p_\\theta(x_1,x_2,\\dots,x_T). $$Using the chain rule of probability:\n$$ p_\\theta(x_1,\\dots,x_T) {}={} \\prod_{t=1}^{T} p_\\theta(x_t\\mid x_1,\\dots,x_{t-1}). $$This is the central factorization of autoregressive language modeling.\nThe model predicts each token using the previous tokens.\nFor example:\n$$ p_\\theta(\\text{`AI''}\\mid \\text{`Mathematics for''}). $$Autoregressive modeling An autoregressive model generates tokens one at a time.\nAt step \\(t\\), it computes:\n$$ p_\\theta(x_{t+1}\\mid x_1,\\dots,x_t). $$Then it chooses or samples the next token.\nAfter generating \\(x_{t+1}\\), it repeats:\n$$ p_\\theta(x_{t+2}\\mid x_1,\\dots,x_t,x_{t+1}). $$So generation is sequential:\n$$ x_1 \\rightarrow x_2 \\rightarrow x_3 \\rightarrow \\cdots. $$This sequential nature is one reason inference is expensive: generating \\(K\\) new tokens usually requires \\(K\\) decoding steps. Speculative decoding papers explicitly identify autoregressive token-by-token decoding as a major latency bottleneck. (arXiv)\nLog-likelihood Suppose the training dataset contains sequences:\n$$ \\mathcal{D}={x^{(1)},x^{(2)},\\dots,x^{(N)}}. $$The likelihood is:\n$$ L(\\theta) {}={} \\prod_{n=1}^{N} p_\\theta(x^{(n)}). $$Because products of many probabilities become extremely small, we use log-likelihood:\n$$ \\log L(\\theta) {}={} \\sum_{n=1}^{N} \\log p_\\theta(x^{(n)}). $$For autoregressive models:\n$$ \\log p_\\theta(x^{(n)}) {}={} \\sum_{t=1}^{T_n} \\log p_\\theta(x_t^{(n)}\\mid x_{","date":"2026-05-01","description":"A beginner-to-advanced mathematical introduction to LLMs, covering autoregressive language modeling, tokenization, vector embeddings, positional encodings, transformer blocks, attention, softmax, cross-entropy, maximum likelihood, backpropagation, AdamW, scaling laws, compute-optimal training, MoE, efficient attention, KV caching, speculative decoding, quantization, LoRA, RLHF, DPO, PPO, and inference-time reasoning.","featured":false,"featured_image":"mathematicsforai/mathsforaisection04.png","permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-01/section-04/","popular":false,"readingTime":5,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-01/section-04/","section":"courses","series":"","summary":"We build the mathematics behind modern large language models from scratch: text as probability, tokenization, embeddings, transformers, attention, loss functions, backpropagation, scaling laws, distributed training, inference, quantization, preference optimization, and reasoning-time computation.","tags":["large language models","transformers","attention","language modeling","optimization","scaling laws","inference","alignment","RLHF","DPO","MoE"],"title":"1.4 Mathematics of Large Language Models: Training, Inference, Attention, Scaling, and Alignment","type":"courses"},{"categories":["Course"],"content":"Why time series matter A time series is data ordered by time.\nExamples:\nDomain Time series example Weather hourly temperature Finance stock prices Energy electricity demand Healthcare heart-rate signal Industry vibration sensor readings Language/audio waveform samples Traffic vehicle count per minute AI systems GPU utilization over time Video frames evolving through time A time series is different from an ordinary table because order matters.\nIf we observe:\n$$ 10,; 12,; 15,; 14,; 18 $$this is not the same as:\n$$ 18,; 10,; 15,; 12,; 14. $$The values are the same, but the temporal structure is different.\nTime-series analysis studies exactly this structure: dependence, trend, seasonality, cycles, shocks, noise, regime changes, long memory, and uncertainty. MIT’s graduate time-series course covers stationarity, lag operators, ARMA models, covariance structure, spectral analysis, GMM, VARs, structural breaks, and related econometric theory. (MIT OpenCourseWare)\nMain ideaA time series is not just a vector. It is an ordered stochastic object whose values depend on time and often depend on previous values. Part I — Beginner foundation A simple time series Suppose we record daily temperature:\nDay Temperature 1 20 2 21 3 19 4 22 5 24 6 23 We write:\n$$ y_1=20,\\quad y_2=21,\\quad y_3=19,\\quad y_4=22,\\quad y_5=24,\\quad y_6=23. $$The full time series is:\n$$ y_{1:6} = (y_1,y_2,y_3,y_4,y_5,y_6). $$In general:\n$$ y_{1:T}=(y_1,y_2,\\dots,y_T). $$Here:\n\\(t\\) is the time index, \\(T\\) is the total number of observations, \\(y_t\\) is the value at time \\(t\\). If the values are scalar, then:\n$$ y_t\\in\\mathbb{R}. $$This is a univariate time series.\nMultivariate time series Sometimes we observe many variables at each time.\nFor example, a weather station may record:\nTime Temperature Humidity Wind speed 1 20 60 5 2 21 58 6 3 19 70 4 At time \\(t\\), we have a vector:\n$$ y_t = \\begin{bmatrix} \\text{temperature}_t\\ \\text{humidity}_t\\ \\text{wind}_t \\end{bmatrix} \\in\\mathbb{R}^{3}. $$The whole series is:\n$$ y_{1:T} = (y_1,y_2,\\dots,y_T). $$As a matrix:\n$$ Y = \\begin{bmatrix} y_1^\\top \\\\ y_2^\\top \\\\ \\vdots \\\\ y_T^\\top \\end{bmatrix} \\in\\mathbb{R}^{T\\times d}. $$Here:\n\\(T\\) is sequence length, \\(d\\) is number of variables. For a batch of \\(B\\) time series:\n$$ Y\\in\\mathbb{R}^{B\\times T\\times d}. $$So time-series deep learning is tensor learning.\nForecasting The most common time-series task is forecasting.\nGiven past values:\n$$ y_1,y_2,\\dots,y_T, $$predict future values:\n$$ y_{T+1},y_{T+2},\\dots,y_{T+H}. $$Here:\n\\(T\\) is the history length, \\(H\\) is the forecast horizon. We write:\n$$ \\hat{y}_{T+1:T+H} {}={} f_\\theta(y_{1:T}). $$For one-step forecasting:\n$$ \\hat{y}_{T+1}=f_\\theta(y_{1:T}). $$For multi-step forecasting:\n$$ \\hat{y}_{T+1:T+H}=f_\\theta(y_{1:T}). $$Hyndman and Athanasopoulos’ Forecasting: Principles and Practice is a standard open textbook that introduces forecasting methods and practical modeling choices, including time-series features, exponential smoothing, ARIMA, dynamic regression, hierarchical forecasting, and practical evaluation. (OTexts: Online, open-access textbooks)\nInput window and output horizon In machine learning, we often convert a time series into supervised examples.\nSuppose:\n$$ y_1,y_2,\\dots,y_{100}. $$Choose input length:\n$$ L=5 $$and forecast horizon:\n$$ H=2. $$Then one training example is:\n$$ x_1= \\begin{bmatrix} y_1\\ y_2\\ y_3\\ y_4\\ y_5 \\end{bmatrix}, \\qquad \\text{target}= \\begin{bmatrix} y_6\\ y_7 \\end{bmatrix}. $$Another example is:\n$$ x_2= \\begin{bmatrix} y_2\\ y_3\\ y_4\\ y_5\\ y_6 \\end{bmatrix}, \\qquad \\text{target}= \\begin{bmatrix} y_7\\ y_8 \\end{bmatrix}. $$This is called a sliding window construction.\nIn general:\n$$ x_t=(y_t,y_{t+1},\\dots,y_{t+L-1}), $$and:\n$$ z_t=(y_{t+L},\\dots,y_{t+L+H-1}). $$The model learns:\n$$ f_\\theta:x_t\\mapsto z_t. $$Time-series tasks Time-series machine learning is not only forecasting.\nIt includes:\nTask Goal Forecasting predict future values Classification assign a label to a sequence Regression predict a continuous target from a sequence Anomaly detection find unusual time points or segments Imputation fill missing values Segmentation divide sequence into regimes Clustering group similar time series Representation learning learn useful embeddings Simulation/generation generate realistic future trajectories Control choose actions over time The UCR and UEA archives became major public benchmarks for time-series classification; the UEA 2018 multivariate archive was introduced to improve rigorous evaluation for multivariate time-series classification, where each example may contain multiple channels. (arXiv)\nPart II — Core statistical time-series mathematics Deterministic signal plus noise A useful beginner model is:\n$$ y_t = s_t+\\varepsilon_t. $$Here:\n\\(s_t\\) is the true signal, \\(\\varepsilon_t\\) is noise. For example:\n$$ y_t = \\text{trend}_t + \\text{seasonality}_t + \\text{noise}_t. $$So:\n$$ y_t = T_t + S_t + R_t. $$Here:\n\\(T_t\\) is trend, \\(S_t\\) is seasonal pattern, \\(R_t\\) is residual noise. This decomposition is central in classical forecasting and remains useful in modern deep learning. Hyndman and Athanasopoulos present decomposition as a core tool for understanding trend and seasonal structure before modeling. (OTexts: Online, open-access textbooks)\nTrend A trend is a long-term direction.\nA linear trend can be written:\n$$ T_t = a+bt. $$Here:\n\\(a\\) is intercept, \\(b\\) is slope. If \\(b\u003e0\\), the series increases.\nIf \\(b\u003c0\\), the series decreases.\nFor example:\n$$ y_t = 10+0.5t+\\varepsilon_t. $$This means the average value grows by 0.5 per time step.\nSeasonality Seasonality is a repeated pattern.\nFor daily data with weekly seasonality, the period is:\n$$ m=7. $$For monthly data with yearly seasonality:\n$$ m=12. $$A simple seasonal model is:\n$$ y_t = T_t + S_{t \\bmod m}+\\varepsilon_t. $$Fourier terms can model smooth seasonality:\n$$ S_t = \\sum_{k=1}^{K} \\left[ a_k\\cos\\left(\\frac{2\\pi kt}{m}\\right) + b_k\\sin\\left(\\frac{2\\pi kt}{m}\\right) \\right]. $$This connects time series to frequency-domain mathematics.\nLag operator The lag operator \\(L\\) shifts a series backward:\n$$ Ly_t=y_{t-1}. $$Then:\n$$ L^2y_t=y_{t-2}. $$MIT’s graduate time-series notes begin with stationarity, lag operators, ARMA models, and covariance structure, because lag notation is the compact language of classical time-series theory. (MIT OpenCourseWare)\nFor example:\n$$ y_t = 0.8y_{t-1}+\\varepsilon_t $$can be written:\n$$ y_t = 0.8Ly_t+\\varepsilon_t. $$So:\n$$ (1-0.8L)y_t=\\varepsilon_t. $$Stationarity A time series is stationary if its statistical properties do not change over time.\nWeak stationarity means:\nconstant mean: $$ \\mathbb{E}[y_t]=\\mu, $$ constant variance: $$ \\operatorname{Var}(y_t)=\\sigma^2, $$ autocovariance depends only on lag, not absolute time: $$ \\operatorname{Cov}(y_t,y_{t-k})=\\gamma(k). $$Stationarity matters because many classical time-series models assume stable dependence structure. MIT’s time-series notes treat stationarity as foundational for ARMA models and covariance analysis. (MIT OpenCourseWare)\nAutocovariance The autocovariance at lag \\(k\\) is:\n$$ \\gamma(k) {}={} \\operatorname{Cov}(y_t,y_{t-k}) \\mathbb{E}[(y_t-\\mu)(y_{t-k}-\\mu)]. $$For \\(k=0\\):\n$$ \\gamma(0)=\\operatorname{Var}(y_t). $$Autocovariance measures how values separated by \\(k\\) time steps move together.\nAutocorrelation Autocorrelation normalizes autocovariance:\n$$ \\rho(k)=\\frac{\\gamma(k)}{\\gamma(0)}. $$So:\n$$ -1\\leq \\rho(k)\\leq 1. $$If:\n$$ \\rho(1)\u003e0, $$then nearby values tend to move together.\nIf:\n$$ \\rho(1)\u003c0, $$then high values tend to be followed by low values.\nIf:\n$$ \\rho(k)\\approx 0, $$then values \\(k\\) steps apart are weakly linearly related.\nWhite noise A white-noise process satisfies:\n$$ \\mathbb{E}[\\varepsilon_t]=0, $$$$ \\operatorname{Var}(\\varepsilon_t)=\\sigma^2, $$$$ \\operatorname{Cov}(\\varepsilon_t,\\varepsilon_s)=0 \\quad \\text{for } t\\neq s. $$If:\n$$ \\varepsilon_t\\sim\\mathcal{N}(0,\\sigma^2) $$independently, then it is Gaussian white noise.\nWhite noise is the basic random innovation in ARMA, state-space, Kalman filtering, SDEs, and probabilistic forecasting.\nRandom walk A random walk is:\n$$ y_t=y_{t-1}+\\varepsilon_t. $$Equivalently:\n$$ \\Delta y_t = y_t-y_{t-1}=\\varepsilon_t. $$The variance grows over time.\nIf:\n$$ y_0=0, $$then:\n$$ y_t=\\sum_{i=1}^{t}\\varepsilon_i. $$If the innovations have variance \\(\\sigma^2\\), then:\n$$ \\operatorname{Var}(y_t)=t\\sigma^2. $$So the process is not stationary.\nRandom walks are important in finance, stochastic processes, diffusion limits, and non-stationary forecasting.\nPart III — Classical forecasting models Autoregressive model AR\\(p\\) An autoregressive model predicts the present from past values.\nAn AR(1) model is:\n$$ y_t = c+\\phi y_{t-1}+\\varepsilon_t. $$An AR\\(p\\) model is:\n$$ y_t {}={} c+ \\phi_1y_{t-1} + \\phi_2y_{t-2} + \\cdots + \\phi_py_{t-p} + \\varepsilon_t. $$Using lag notation:\n$$ \\phi(L)y_t=c+\\varepsilon_t, $$where:\n$$ \\phi(L)=1-\\phi_1L-\\phi_2L^2-\\cdots-\\phi_pL^p. $$AR models are linear memory models.\nThey say:\nthe future depends linearly on the past.\nMoving-average model MA\\(q\\) A moving-average model uses past noise terms.\nAn MA(1) model is:\n$$ y_t=\\mu+\\varepsilon_t+\\theta\\varepsilon_{t-1}. $$An MA\\(q\\) model is:\n$$ y_t {}={} \\mu+ \\varepsilon_t+ \\theta_1\\varepsilon_{t-1} + \\cdots + \\theta_q\\varepsilon_{t-q}. $$This means shocks can influence future observations for several time steps.\nARMA model An ARMA\\(p,q\\) model combines AR and MA terms:\n$$ y_t {}={} c+ \\sum_{i=1}^{p}\\phi_i y_{t-i} + \\varepsilon_t + \\sum_{j=1}^{q}\\theta_j\\varepsilon_{t-j}. $$This is useful for stationary time series.\nMIT’s graduate time-series material covers ARMA and covariance structure as central tools for stationary time-series analysis. (MIT OpenCourseWare)\nARIMA model Many real time series are non-stationary.\nARIMA handles non-stationarity using differencing.\nThe first difference is:\n$$ \\nabla y_t = y_t-y_{t-1}. $$If the differenced series is stationary, we can apply ARMA to:\n$$ \\nabla^d y_t. $$An ARIMA\\(p,d,q\\) model means:\n\\(p\\): autoregressive order, \\(d\\): differencing order, \\(q\\): moving-average order. Hyndman and Athanasopoulos present ARIMA as a core classical forecasting model, alongside exponential smoothing and dynamic regression. (OTexts: Online, open-access textbooks)\nSeasonal ARIMA Seasonal ARIMA adds seasonal lags.\nA common notation is:\n$$ \\operatorname{ARIMA}(p,d,q)(P,D,Q)_m. $$Here:\n\\(m\\) is seasonal period, \\(P,D,Q\\) are seasonal AR, differencing, and MA orders. For monthly data with yearly seasonality:\n$$ m=12. $$For hourly data with daily seasonality:\n$$ m=24. $$Seasonal models are important because many time series contain repeating patterns.\nExponential smoothing Simple exponential smoothing updates a level estimate:\n$$ \\ell_t=\\alpha y_t+(1-\\alpha)\\ell_{t-1}. $$Here:\n$$ 0\u003c\\alpha\u003c1. $$The forecast is:\n$$ \\hat{y}_{t+h|t}=\\ell_t. $$Holt’s method adds trend:\n$$ \\ell_t=\\alpha y_t+(1-\\alpha)(\\ell_{t-1}+b_{t-1}), $$$$ b_t=\\beta(\\ell_t-\\ell_{t-1})+(1-\\beta)b_{t-1}. $$Holt-Winters methods add seasonality.\nETS models formalize exponential smoothing in terms of error, trend, and seasonal components. Hyndman’s forecasting material gives the standard treatment of ETS, ARIMA, and their state-space relationships. (Rob J Hyndman)\nPart IV — State-space models and Kalman filtering Why state-space models matter Sometimes the observed time series is only a noisy measurement of a hidden state.\nFor example:\nobserved GPS position is noisy, true physical position is hidden, sensor readings contain measurement error, economic indicators reflect hidden market state. A state-space model separates:\nhidden state dynamics, observation process. Linear Gaussian state-space model A standard model is:\n$$ x_t = A x_{t-1}+w_t, $$$$ y_t = C x_t+v_t. $$Here:\n\\(x_t\\) is hidden state, \\(y_t\\) is observation, \\(A\\) is state transition matrix, \\(C\\) is observation matrix, (w_t\\sim\\mathcal{N}(0,Q)) is process noise, (v_t\\sim\\mathcal{N}(0,R)) is observation noise. Stanford and MIT lecture notes present state-space models and Kalman filtering as efficient algorithms for state estimation in linear Gaussian systems. (Stanford University)\nKalman prediction step Suppose we have an estimate:\n$$ \\hat{x}_{t-1|t-1} $$with covariance:\n$$ P_{t-1|t-1}. $$Prediction:\n$$ \\hat{x}_{t|t-1}=A\\hat{x}_{t-1|t-1}, $$$$ P_{t|t-1}=AP_{t-1|t-1}A^\\top+Q. $$This predicts the next hidden state before seeing \\(y_t\\).\nKalman update step The predicted observation is:\n$$ \\hat{y}_{t|t-1}=C\\hat{x}_{t|t-1}. $$The innovation is:\n$$ r_t=y_t-\\hat{y}_{t|t-1}. $$The innovation covariance is:\n$$ S_t=CP_{t|t-1}C^\\top+R. $$The Kalman gain is:\n$$ K_t=P_{t|t-1}C^\\top S_t^{-1}. $$Update:\n$$ \\hat{x}_{t|t} {}={} \\hat{x}_{t|t-1}+K_t r_t, $$$$ P_{t|t} {}={} (I-K_tC)P_{t|t-1}. $$The Kalman filter is important because it is a mathematically exact recursive estimator for linear Gaussian state-space models.\nHidden Markov models A hidden Markov model, or HMM, uses a discrete hidden state:\n$$ z_t\\in{1,\\dots,K}. $$The hidden state evolves as a Markov chain:\n$$ p(z_t\\mid z_{t-1}). $$The observation is generated from:\n$$ p(y_t\\mid z_t). $$So the joint distribution is:\n$$ p(z_{1:T},y_{1:T}) {}={} p(z_1)p(y_1\\mid z_1) \\prod_{t=2}^{T} p(z_t\\mid z_{t-1})p(y_t\\mid z_t). $$Rabiner’s classic tutorial formalized the three central HMM problems: likelihood evaluation, decoding the most likely hidden state sequence, and parameter estimation. (Computer Science at UBC)\nPart V — Spectral and frequency-domain mathematics Time domain versus frequency domain A time series can be studied in the time domain:\n$$ y_t $$or in the frequency domain.\nFrequency analysis asks:\nwhich oscillations are present in the signal?\nFor example, electricity demand may have:\ndaily frequency, weekly frequency, annual frequency. MIT’s time-series course includes spectrum and spectrum estimation as core topics after stationarity and ARMA modeling. (MIT OpenCourseWare)\nDiscrete Fourier transform For a finite sequence:\n$$ y_0,y_1,\\dots,y_{T-1}, $$the discrete Fourier transform is:\n$$ Y_k {}={} \\sum_{t=0}^{T-1} y_t e^{-2\\pi i kt/T}. $$The inverse transform is:\n$$ y_t {}={} \\frac{1}{T} \\sum_{k=0}^{T-1} Y_k e^{2\\pi i kt/T}. $$The magnitude:\n$$ |Y_k| $$shows the strength of frequency \\(k\\).\nSpectral density For a stationary process with autocovariance \\(\\gamma(k)\\), the spectral density is:\n$$ f(\\omega) {}={} \\frac{1}{2\\pi} \\sum_{k=-\\infty}^{\\infty} \\gamma(k)e^{-i\\omega k}. $$This is the Fourier transform of the autocovariance function.\nSo the time-domain dependence structure and frequency-domain power distribution are mathematically connected.\nPart VI — Machine learning formulation Supervised time-series learning In machine learning, we define input-output pairs:\n$$ (x_i,z_i)_{i=1}^{N}. $$For forecasting:\n$$ x_i = y_{i:i+L-1}, $$$$ z_i = y_{i+L:i+L+H-1}. $$The model is:\n$$ \\hat{z}*i=f*\\theta(x_i). $$A standard mean squared error loss is:\n$$ \\mathcal{L}(\\theta) {}={} \\frac{1}{N} \\sum_{i=1}^{N} |\\hat{z}_i-z_i|_2^2. $$For scalar one-step forecasting:\n$$ \\mathcal{L}(\\theta) {}={} \\frac{1}{N} \\sum_{i=1}^{N} (\\hat{y}_{i+1}-y_{i+1})^2. $$Time-aware train-test split Ordinary random splitting can leak future information into training.\nFor time series, we usually split chronologically:\n$$ \\text{train}: y_1,\\dots,y_{T_{\\text{train}}}, $$$$ \\text{test}: y_{T_{\\text{train}}+1},\\dots,y_T. $$This respects causality.\nA model should not learn from the future when evaluated on the past.\nDirect versus recursive forecasting For multi-step forecasting, there are several strategies.\nRecursive forecasting Train one-step model:\n$$ \\hat{y}_{t+1}=f_\\theta(y_{t-L+1:t}). $$Then feed predictions back:\n$$ \\hat{y}_{t+2}=f_\\theta(y_{t-L+2:t},\\hat{y}_{t+1}). $$Problem: errors accumulate.\nDirect forecasting Train separate models:\n$$ \\hat{y}_{t+h}=f_{\\theta_h}(y_{t-L+1:t}) $$for each horizon \\(h\\).\nProblem: many models.\nMulti-output forecasting Train one model:\n$$ \\hat{y}_{t+1:t+H}=f_\\theta(y_{t-L+1:t}). $$This is common in deep learning.\nPoint forecasting and probabilistic forecasting A point forecast gives one value:\n$$ \\hat{y}_{t+h}. $$A probabilistic forecast gives a distribution:\n$$ p(y_{t+h}\\mid y_{1:t}). $$For decision-making, probabilistic forecasts are often more useful.\nFor example, energy planning needs:\n$$ P(y_{t+h}\u003e \\text{capacity}). $$DeepAR explicitly frames forecasting as probabilistic forecasting: estimating future distributions given the past, using an autoregressive recurrent neural network trained across many related time series. (arXiv)\nPart VII — Probabilistic forecasting losses Gaussian likelihood Suppose the model predicts:\n$$ \\mu_\\theta(x) $$and:\n$$ \\sigma_\\theta(x)\u003e0. $$Assume:\n$$ y\\mid x\\sim\\mathcal{N}(\\mu_\\theta(x),\\sigma_\\theta(x)^2). $$The negative log-likelihood is:\n$$ -\\log p(y\\mid x) {}={} \\frac{1}{2} \\log(2\\pi\\sigma_\\theta^2) + \\frac{(y-\\mu_\\theta)^2}{2\\sigma_\\theta^2}. $$This teaches the model both center and uncertainty.\nIf uncertainty is high, \\(\\sigma_\\theta\\) can be large.\nBut predicting large \\(\\sigma_\\theta\\) is penalized by the log term.\nQuantile loss A quantile forecast predicts \\(q_\\tau(x)\\), the \\(\\tau\\)-quantile.\nThe pinball loss is:\n$$ \\ell_\\tau(y,q) {}={} \\max \\left( \\tau(y-q), (\\tau-1)(y-q) \\right). $$For example:\n\\(\\tau=0.5\\) gives median forecasting, \\(\\tau=0.9\\) gives upper quantile forecasting. Multiple quantiles form prediction intervals.\nCRPS The continuous ranked probability score compares a full predictive CDF \\(F\\) with observation \\(y\\):\n$$ \\operatorname{CRPS}(F,y) {}={} \\int_{-\\infty}^{\\infty} \\left( F(z)-\\mathbf{1}{y\\leq z} \\right)^2dz. $$Gneiting and Raftery describe CRPS as a strictly proper scoring rule for probabilistic forecasts and note that it generalizes absolute error to predictive distributions. (Statistical Consulting Service)\nProper scoring rules A scoring rule is proper if the best expected score is achieved by reporting the true distribution.\nThis matters because we do not want a model to cheat by giving overconfident or underconfident forecasts.\nFor probabilistic forecasting, common proper scoring rules include:\nnegative log-likelihood, CRPS, energy score, variogram score. Part VIII — RNNs, LSTMs, and GRUs Recurrent neural networks A recurrent neural network updates a hidden state:\n$$ h_t = \\phi(W_xx_t+W_hh_{t-1}+b). $$The output may be:\n$$ \\hat{y}_{t+1}=W_oh_t+c. $$Here:\n\\(x_t\\) is input at time \\(t\\), \\(h_t\\) is memory state, \\(W_h\\) controls recurrence. RNNs are natural for time series because they process data sequentially.\nVanishing and exploding gradients Backpropagation through time multiplies many Jacobians.\nA simplified gradient contains products like:\n$$ \\prod_{t=1}^{T} W_h^\\top D_t. $$If eigenvalues are small, gradients vanish.\nIf eigenvalues are large, gradients explode.\nThis makes long-term dependency learning difficult.\nLSTM LSTM was introduced to address long-term dependency problems in recurrent training. The original LSTM paper explicitly discusses insufficient error backflow and introduces memory cells and gates to improve long-duration information storage. (bioinf.jku.at)\nAn LSTM uses gates:\n$$ f_t=\\sigma(W_fx_t+U_fh_{t-1}+b_f), $$$$ i_t=\\sigma(W_ix_t+U_ih_{t-1}+b_i), $$$$ o_t=\\sigma(W_ox_t+U_oh_{t-1}+b_o), $$candidate memory:\n$$ \\tilde{c}_t=\\tanh(W_cx_t+U_ch_{t-1}+b_c), $$cell update:\n$$ c_t=f_t\\odot c_{t-1}+i_t\\odot \\tilde{c}_t, $$hidden state:\n$$ h_t=o_t\\odot\\tanh(c_t). $$The forget gate \\(f_t\\) decides what to keep.\nThe input gate \\(i_t\\) decides what to write.\nThe output gate \\(o_t\\) decides what to expose.\nGRU A GRU is a simpler gated recurrent unit.\nIt uses update and reset gates:\n$$ z_t=\\sigma(W_zx_t+U_zh_{t-1}+b_z), $$$$ r_t=\\sigma(W_rx_t+U_rh_{t-1}+b_r). $$Candidate state:\n$$ \\tilde{h}*t= \\tanh(W_hx_t+U_h(r_t\\odot h_{t-1})+b_h). $$Update:\n$$ h_t=(1-z_t)\\odot h_{t-1}+z_t\\odot \\tilde{h}_t. $$GRUs were introduced in neural sequence modeling and later compared empirically with LSTMs as gated recurrent architectures. (arXiv)\nPart IX — Temporal convolutional networks Causal convolution A causal convolution ensures the output at time \\(t\\) uses only current and past inputs.\nFor kernel size \\(K\\):\n$$ h_t= \\sum_{k=0}^{K-1} w_k x_{t-k}. $$No future value \\(x_{t+1}\\) is used.\nThis is necessary for forecasting.\nDilated convolution A dilated convolution skips time steps:\n$$ h_t= \\sum_{k=0}^{K-1} w_k x_{t-dk}. $$Here \\(d\\) is dilation.\nWith increasing dilations:\n$$ 1,2,4,8,\\dots $$the receptive field grows quickly.\nWaveNet used dilated causal convolutions to obtain very large receptive fields without huge computational cost. (arXiv)\nTCN Temporal convolutional networks combine:\ncausal convolutions, dilation, residual blocks, stable parallel training. A large empirical comparison found that simple convolutional sequence models can outperform canonical recurrent models such as LSTMs on several sequence modeling tasks while showing longer effective memory. (arXiv)\nA TCN block may be written:\n$$ H^{(\\ell+1)} {}={} H^{(\\ell)} + \\operatorname{Conv}_{\\text{causal,dilated}} ( \\phi(H^{(\\ell)}) ). $$This is similar to residual deep learning, but adapted to time.\nPart X — Attention and transformers for time series Why attention helps RNNs compress the past into a hidden state.\nAttention directly compares time points.\nGiven:\n$$ X\\in\\mathbb{R}^{T\\times d}, $$we compute:\n$$ Q=XW_Q, $$$$ K=XW_K, $$$$ V=XW_V. $$Attention is:\n$$ \\operatorname{Attention}(Q,K,V) {}={} \\operatorname{softmax} \\left( \\frac{QK^\\top}{\\sqrt{d_k}} \\right)V. $$Transformers were originally introduced for sequence transduction using attention alone, avoiding recurrence and convolution. (arXiv)\nAttention as temporal dependency learning The attention matrix is:\n$$ A= \\operatorname{softmax} \\left( \\frac{QK^\\top}{\\sqrt{d_k}} \\right). $$Here:\n$$ A_{ij} $$measures how much time point \\(i\\) attends to time point \\(j\\).\nFor forecasting, causal masking may be used:\n$$ A_{ij}=0 \\quad \\text{if} \\quad j\u003ei. $$For encoder-style forecasting, the model may attend over the whole observed context.\nQuadratic complexity Standard attention forms a \\(T\\times T\\) matrix.\nSo memory and compute scale like:\n$$ O(T^2). $$This is a problem for long time series.\nIf:\n$$ T=10{,}000, $$then:\n$$ T^2=100{,}000{,}000. $$So many time-series transformer papers modify tokenization, attention, decomposition, or variable representation to reduce cost or improve inductive bias.\nTemporal Fusion Transformer Temporal Fusion Transformer combines recurrent layers, attention, gating, static covariate encoders, known future inputs, and variable selection for multi-horizon forecasting. Its paper emphasizes both high-performance multi-horizon forecasting and interpretability over temporal dynamics. (ScienceDirect)\nA simplified multi-horizon forecast is:\n$$ \\hat{y}_{t+1:t+H} {}={} f_\\theta ( \\text{past observed}, \\text{known future covariates}, \\text{static features} ). $$This is important because real forecasting often includes:\nstatic covariates, known future calendar features, observed historical covariates, target history. PatchTST PatchTST treats time-series segments as patches.\nInstead of tokenizing each time point, it tokenizes subseries:\n$$ p_i = [y_{s_i},y_{s_i+1},\\dots,y_{s_i+P-1}]. $$Then patches become transformer tokens.\nPatchTST argues that patching retains local semantic information, reduces attention cost quadratically for a fixed lookback window, and allows longer histories; it also uses channel independence, sharing embedding and transformer weights across univariate channels. (arXiv)\nMathematically:\n$$ Y\\in\\mathbb{R}^{T\\times d} \\quad \\longrightarrow \\quad P\\in\\mathbb{R}^{N_p\\times P\\times d}. $$Then each patch is embedded into:\n$$ z_i\\in\\mathbb{R}^{d_{\\text{model}}}. $$iTransformer Traditional transformers often treat each timestamp as a token.\niTransformer inverts this idea.\nIt treats each variable as a token whose feature vector is the lookback history.\nIf:\n$$ Y\\in\\mathbb{R}^{T\\times d}, $$then for variable \\(j\\):\n$$ v_j= [y_{1j},y_{2j},\\dots,y_{Tj}] \\in\\mathbb{R}^{T}. $$The model embeds variate tokens, and attention captures multivariate correlations.\nThe iTransformer paper argues that standard timestamp-token transformers may mix delayed events and distinct physical variables poorly, and shows that applying transformer components on inverted dimensions can improve forecasting and generalization. (arXiv)\nTimesNet TimesNet transforms 1D time series into 2D tensors based on discovered periods.\nIf a period is \\(p\\), a sequence can be reshaped into:\n$$ \\text{rows}=\\text{number of periods}, \\qquad \\text{columns}=p. $$This separates:\nintraperiod variation, interperiod variation. TimesNet motivates this by multi-periodicity and models temporal variation in 2D space using parameter-efficient 2D kernels. It reports results across forecasting, imputation, classification, and anomaly detection. (arXiv)\nTimeMixer TimeMixer uses multiscale decomposition.\nIt builds representations at multiple sampling scales:\n$$ Y^{(1)},Y^{(2)},\\dots,Y^{(S)}. $$Then it decomposes each scale into trend and seasonal parts:\n$$ Y^{(s)}=T^{(s)}+S^{(s)}. $$TimeMixer proposes past-decomposable mixing and future-multipredictor mixing to combine microscopic seasonal information and macroscopic trend information across scales. (ICLR Proceedings)\nPart XI — Deep forecasting architectures DeepAR DeepAR is a probabilistic autoregressive RNN.\nAt each time:\n$$ h_t=\\operatorname{RNN}_\\theta(h_{t-1},y_{t-1},x_t), $$then:\n$$ \\theta_t = g_\\theta(h_t), $$where \\(\\theta_t\\) parameterizes a probability distribution:\n$$ p(y_t\\mid y_{","date":"2026-05-01","description":"A beginner-to-advanced mathematical introduction to time-series machine learning and deep learning, covering classical forecasting, stochastic processes, spectral analysis, supervised time-series learning, recurrent neural networks, LSTMs, GRUs, temporal convolutional networks, transformers, PatchTST, iTransformer, TimesNet, TimeMixer, state-space models, S4, Mamba, DeepAR, N-BEATS, N-HiTS, Temporal Fusion Transformers, diffusion-based forecasting, neural ODEs, neural CDEs, foundation models, probabilistic losses, conformal prediction, anomaly detection, classification, and imputation.","featured":false,"featured_image":"mathematicsforai/mathsforaisection05.png","permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-01/section-05/","popular":false,"readingTime":16,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-01/section-05/","section":"courses","series":"","summary":"We build the mathematics of time-series learning from scratch: sequences, forecasting, stationarity, autocorrelation, ARIMA, state-space models, Kalman filtering, recurrent networks, temporal convolutions, transformers, state-space sequence models, probabilistic forecasting, diffusion models, neural differential equations, and time-series foundation models.","tags":["time series","forecasting","machine learning","deep learning","RNN","LSTM","transformers","state space models","probabilistic forecasting","foundation models"],"title":"1.5 Mathematics of Machine Learning and Deep Learning for Time Series","type":"courses"},{"categories":[],"content":"","date":"2026-01-01","description":"","featured":false,"featured_image":null,"permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-02/section-01/","popular":false,"readingTime":0,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-02/section-01/","section":"courses","series":"","summary":"","tags":[],"title":"Section 2.1","type":"courses"},{"categories":[],"content":"","date":"2026-01-01","description":"","featured":false,"featured_image":null,"permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-03/section-01/","popular":false,"readingTime":0,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-03/section-01/","section":"courses","series":"","summary":"","tags":[],"title":"Section 3.1","type":"courses"},{"categories":[],"content":"","date":"2026-01-01","description":"","featured":false,"featured_image":null,"permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-04/section-01/","popular":false,"readingTime":0,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-04/section-01/","section":"courses","series":"","summary":"","tags":[],"title":"Section 4.1","type":"courses"},{"categories":[],"content":"","date":"2026-01-01","description":"","featured":false,"featured_image":null,"permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-05/section-01/","popular":false,"readingTime":0,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-05/section-01/","section":"courses","series":"","summary":"","tags":[],"title":"Section 5.1","type":"courses"},{"categories":[],"content":"","date":"2026-01-01","description":"","featured":false,"featured_image":null,"permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-06/section-01/","popular":false,"readingTime":0,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-06/section-01/","section":"courses","series":"","summary":"","tags":[],"title":"Section 6.1","type":"courses"},{"categories":[],"content":"","date":"2026-01-01","description":"","featured":false,"featured_image":null,"permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-06/section-02/","popular":false,"readingTime":0,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-06/section-02/","section":"courses","series":"","summary":"","tags":[],"title":"Section 6.2","type":"courses"},{"categories":[],"content":"","date":"2026-01-01","description":"","featured":false,"featured_image":null,"permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-06/section-03/","popular":false,"readingTime":0,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-06/section-03/","section":"courses","series":"","summary":"","tags":[],"title":"Inference at scale for Large Language Models","type":"courses"},{"categories":[],"content":"","date":"2026-01-01","description":"","featured":false,"featured_image":"mathematicsforai/chaptero07.png","permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-07/section-01/","popular":false,"readingTime":0,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-07/section-01/","section":"courses","series":"","summary":"","tags":[],"title":"Section 7.1","type":"courses"},{"categories":[],"content":"","date":"2026-01-01","description":"","featured":false,"featured_image":"mathematicsforai/chaptero07.png","permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-07/section-02/","popular":false,"readingTime":0,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-07/section-02/","section":"courses","series":"","summary":"","tags":[],"title":"Section 7.2","type":"courses"},{"categories":[],"content":"Karpathy Autoresearch Explained Introduction This lesson introduces autoresearch as a practical workflow for letting an AI coding agent run experiments without waiting for a human to choose every next step. The basic pattern is simple: define the goal, freeze the evaluator, let the agent propose code changes, run the experiment, keep the change only if the metric improves, and repeat. The public examples make the idea concrete: single-GPU overnight runs improved val_bpb from 0.997900 to 0.969686 in 126 experiments on an H100, and those smaller depth-12 findings later transferred to larger depth-24 nanochat runs, reducing the \u0026ldquo;time to GPT-2\u0026rdquo; leaderboard entry from 2.02 hours to 1.80 hours, with a later entry at 1.65 hours. The rest of this section turns that workflow into a tutorial: first the naming and intuition, then the loop, comparisons, implementations, strengths, limitations, and a practical recipe for building a similar system.\nTerminology and intuition Use the name autoresearch for this method, not “autoregressive search.” The phrase “autoregressive” still matters, but it describes the language model used to propose edits; the overall workflow is the autoresearch loop.\nThe core intuition is simple. Instead of asking a human researcher to manually try one code change at a time, the human writes a research charter in program.md, freezes the evaluator, and lets an agent repeatedly: read the current best code, propose a patch, run a real experiment, observe the metric, and keep only improvements. Karpathy’s repo explicitly structures the system around three files: a read-only evaluator/data file (prepare.py), a mutable research target (train.py), and an instruction file (program.md). The loop is therefore not just “generate text”; it is generate action proposals, execute them in the world, score them, and ratchet the best state forward.\nKarpathy’s own motivation in the March 20 interview was to “remove yourself as the bottleneck.” He explained autoresearch as an example of refactoring the workflow so the human is not the next-step trigger between experiments. That is the conceptual heart of the method: the human specifies the objective and constraints once, then the agent runs the edit → execute → evaluate → keep/revert loop autonomously for as long as the budget allows.\nflowchart TD A[Read current best code and instructions] --\u0026gt; B[Agent proposes code edit] B --\u0026gt; C[Apply patch on experiment branch] C --\u0026gt; D[Run fixed-budget experiment] D --\u0026gt; E[Read metric and diagnostics] E --\u0026gt; F{Improved and passes checks?} F --\u0026gt;|Yes| G[Commit and advance incumbent] F --\u0026gt;|No| H[Revert / reset to previous best] G --\u0026gt; A H --\u0026gt; A A useful mental model is therefore:\nSearch space: code edits, hyperparameters, optimizer settings, architecture choices. Proposal distribution: an autoregressive coding agent. Environment: the actual training/evaluation harness. Fitness function: a scalar metric such as val_bpb. Selection rule: elitist keep-or-revert. That is why the method feels partly like automated ML experimentation, partly like evolutionary search, and partly like agentic software engineering.\nOriginal sources and timeline The easiest way to understand autoresearch is to follow how the workflow grows. It starts from Karpathy\u0026rsquo;s goal to \u0026ldquo;remove yourself as the bottleneck\u0026rdquo;: the human should define the objective and constraints, but should not have to trigger every experiment by hand. In the original GitHub setup, that idea becomes a compact loop: the human writes the Markdown instructions, the AI agent edits the training code, the evaluator runs a short experiment, and the result decides whether the edit is kept or reverted. The March 8-9 result reports show why this loop is useful in practice: depth-12 experiments found additive improvements, those changes transferred to larger depth-24 nanochat runs, and \u0026ldquo;time to GPT-2\u0026rdquo; dropped from 2.02 hours to 1.80 hours, with session logs showing runs such as 0.9979 -\u0026gt; 0.9773 in 89 experiments and 0.9979 -\u0026gt; 0.969686 in 126 experiments. The later March interview expands the same idea from one local agent into a possible collaborative system where many agents propose changes and humans or evaluators verify the useful ones. Read this section as a tutorial for that pattern: prompt the agent, let it edit, run a fixed evaluator, keep the improvement, revert the failure, and then scale the loop only after the basic version works.\nFormal algorithm and mathematical formulation According to the original setup, there is a read-only evaluator and data pipeline in prepare.py; a single editable file, train.py; a fixed wall-clock training budget of 300 seconds; a fixed context length MAX_SEQ_LEN = 2048; a validation budget EVAL_TOKENS = 40 * 524288; and a scalar objective val_bpb, where lower is better. The program.md instructions say to run a baseline first, then loop forever: edit train.py, commit, run the experiment, parse the results, record them, and keep the commit only if val_bpb improved.\nDefinition Let:\n\\(c_t\\) be the current incumbent code state at iteration \\(t\\). \\(h_t\\) be the experiment history up to \\(t\\), including descriptions, metrics, crashes, and commits. \\(p\\) be the human-written research program or instruction context. \\(q_\\phi(a \\mid c_t, h_t, p)\\) be the coding agent’s proposal distribution over edits \\(a\\). In practice, this is implemented by an autoregressive LLM acting over code and shell actions. \\(A(c_t, a_t)\\) be the application of edit \\(a_t\\) to code \\(c_t\\), yielding candidate code \\(c'_t\\). \\(J(c)\\) be the scalar evaluator returned after running the fixed-budget experiment. In the original setup, \\(J(c)=\\text{val\\_bpb}(c)\\), with lower being better. A minimal formalization is:\n\\[ a_t \\sim q_\\phi(\\cdot \\mid c_t, h_t, p) \\]\\[ c'_t = A(c_t, a_t) \\]\\[ y_t = J(c'_t) \\]\\[ c_{t+1} = \\begin{cases} c'_t \u0026 \\text{if } y_t \\le J(c_t)-\\varepsilon \\text{ and checks}(c'_t)=1 \\\\ c_t \u0026 \\text{otherwise} \\end{cases} \\]Here \\(\\varepsilon\\) is an optional acceptance margin. In the original implementation, the operational rule is effectively \\(\\varepsilon = 0\\): if the metric is lower, keep; otherwise revert. Crashes are treated as failures and reverted.\nPseudocode The following pseudocode is a faithful abstraction of the public loop:\nINPUT: immutable evaluator E initial code c0 instructions p max iterations T acceptance margin eps = 0 baseline = run(E, c0) log = [(c0, baseline.metric, \u0026#34;keep\u0026#34;, \u0026#34;baseline\u0026#34;)] c_best = c0 m_best = baseline.metric for t in 1..T: proposal = Agent.propose(current_code=c_best, history=log, instructions=p) c_try = apply_patch(c_best, proposal) result = safe_run(E, c_try) # returns metric, crash flag, aux stats if result.crash: revert(c_try) log.append((hash(c_try), None, \u0026#34;crash\u0026#34;, summarize(proposal))) continue if result.metric \u0026lt; m_best - eps and passes_policy_checks(c_try, result): commit(c_try) c_best = c_try m_best = result.metric status = \u0026#34;keep\u0026#34; else: revert(c_try) status = \u0026#34;discard\u0026#34; log.append((hash(c_try), result.metric, status, summarize(proposal))) return c_best, log This is not beam search, because the basic design keeps only one incumbent branch. It is not MCTS, because there is no explicit search tree, value backup, or visit-count policy. It is best described as LM-proposed stochastic local search with elitist acceptance and rollback.\nPublic-repo objective function prepare.py defines val_bpb as a vocab-size-independent metric: it sums cross-entropy in nats over target tokens, sums their byte lengths, and converts nats-per-byte to bits-per-byte. It also uses a fixed sequence length and fixed evaluation token count to keep runs comparable across configuration changes.\nFormally, if token losses are \\(\\ell_i\\) in nats and target byte lengths are \\(b_i\\), then:\n\\[ \\text{val\\_bpb} = \\frac{\\sum_i \\ell_i}{\\log 2 \\cdot \\sum_i b_i} \\]with zero-byte special tokens excluded from both sums.\nAssumptions This loop works well only under a fairly specific set of assumptions. The evaluator must be stable enough that a measured improvement is meaningful; the objective must be cheap enough to run many times; the editable surface must be small or well-scoped so patches are reviewable; and the metric must be machine-checkable so keep/revert can be automated. Karpathy’s public traces also show that some improvements are fragile: for example, seed changes and a 5% warmup looked promising in one session but did not reproduce in a later one, and nanochat’s leaderboard notes mild training nondeterminism across repeated runs.\nIf the evaluator is noisy, a more statistically careful version should replace \\(J(c)\\) by an average over repeated runs,\n\\[ \\hat J_K(c) = \\frac{1}{K}\\sum_{k=1}^{K} J(c; \\xi_k), \\]and accept only when the estimated gain exceeds noise by a chosen threshold or confidence bound. That is not part of the minimal loop, but it is often the right production modification. The need for such care is supported by the repeated-run spread observed on the nanochat leaderboard.\nComparison to related methods Use the comparison below as a map, not as a claim that these methods do the same thing. Beam search, sampling, reranking, and contrastive decoding usually operate on token sequences; autoresearch operates on program edits that are evaluated by actually running code. The shared idea is that each method spends limited compute exploring possible next states.\nMethod Determinism Diversity Compute Latency Memory Typical use cases Pros Cons Autoresearch Medium in practice; proposal and training noise matter Medium by default; higher with multi-agent / parallel variants High per step because each node requires execution High sequential latency; much better with parallel workers Low for single-branch search; higher with parallel experiments ML tuning, build/CI optimization, kernel optimization, any loop with executable metric Real-world feedback, can discover non-obvious interactions, accumulates improvements over time Expensive evaluations, metric hacking, local optima, reproducibility/safety concerns Beam search High given fixed model and tie-breaks Low Moderate Low to moderate Grows with beam width MAP-style sequence decoding, constrained generation Strong likelihood search, simple, widely available Low diversity, beam-search pathologies, still tied to model score Sampling Low High Low Low Low Open-ended text generation, exploration Cheap, diverse, easy to tune Noisy, unstable quality, weaker guarantees MCTS Low to medium Medium to high High High High Planning with delayed rewards, game-like search, tool-use planning Handles long-horizon decisions and sparse rewards better than greedy local search Heavy orchestration cost, requires good rollout/value heuristics Reranking / MBR-style selection Medium Depends on candidate generator Moderate to high Two-stage Moderate Translation, structured generation, best-of-N selection Lets you use an external quality metric instead of raw model likelihood Needs a good candidate set; quality ceiling is limited by proposal stage Contrastive decoding High to medium Low to medium Moderate Low to moderate Moderate Open-ended text generation with fewer degeneracies Better quality than plain greedy/beam in some open-ended settings, no retraining Still token decoding; not an execution-driven search loop The table gives a quick way to place autoresearch beside more familiar search and decoding methods.\nTwo comparisons are especially important.\nFirst, beam search versus autoresearch. Beam search keeps the top \\(K\\) partial token prefixes according to model score. Autoresearch keeps one current best code state and accepts only candidates that improve after real execution. So beam search is a breadth-limited decoder over symbolic prefixes; autoresearch is a ratcheting search over executable states. With many parallel workers, autoresearch can test more branches at once, but the basic single-worker version is closer to greedy hill-climbing.\nSecond, contrastive decoding and reranking versus autoresearch. Contrastive decoding is still a token-level inference objective: it prefers tokens that score well under a large model while penalizing those favored by a smaller “amateur” model. Reranking/MBR, similarly, selects among a candidate set using an external utility. Autoresearch is different because the candidate’s score is produced by executing the modified system in an environment. That makes it far more general, but also much more expensive.\nThe cleanest one-sentence summary is: beam search, sampling, reranking, and contrastive decoding mostly search over text continuations; autoresearch searches over executable research actions. That is why it is powerful when you have a trusted evaluator, and why it is overkill when simple token decoding is enough.\nImplementations and code repositories Start with the original Python implementation. It has three key components: prepare.py, train.py, and program.md. analysis.ipynb loads results.tsv and generates progress plots such as the running best frontier. The upstream benchmark context is nanochat, whose leaderboard shows how discovered changes transfer to larger training runs.\nRepository What it is Language Key files Why it matters karpathy/autoresearch Original minimal implementation Python, Jupyter program.md, train.py, prepare.py, analysis.ipynb Canonical source for the loop and defaults karpathy/nanochat Upstream training harness / benchmark target Python runs/speedrun.sh, dev/LEADERBOARD.md Where Karpathy measures transfer to “time to GPT-2” miolini/autoresearch-macos Mac-oriented fork preserving upstream structure Python, Jupyter program.md, train.py, prepare.py Useful if you want a close-to-upstream Mac path trevin-creator/autoresearch-mlx MLX port for Apple Silicon Macs Python program.md, train.py, prepare.py Native MLX version; explicitly keeps the same loop semantics jsegov/autoresearch-win-rtx Windows RTX fork for consumer NVIDIA GPUs Python, Jupyter program.md, train.py, prepare.py Native Windows path with consumer-GPU focus mutable-state-inc/autoresearch-at-home Collaborative SETI@home-style swarm fork Python, Jupyter collab.md, coordinator.py, plus upstream core files Adds coordinated multi-agent experiment claiming and result sharing gensyn-ai/collaborative-autoresearch-demo P2P collaborative demo Python program.md, train.py, prepare.py Shows real-time result sharing across agents RightNow-AI/autokernel Domain transfer of the pattern to GPU kernel optimization Python program.md, kernel.py, bench.py, profile.py, extract.py, verify.py Demonstrates that the pattern generalizes beyond model training Use the table as a guide to the main implementations and related adaptations.\nA few implementation notes are especially useful.\nThe original implementation is purposely tiny and opinionated: one mutable file, one metric, one five-minute budget, one incumbent branch. That simplicity matters because it reduces context bloat, makes diffs reviewable, and keeps acceptance decisions easy to automate. The program.md file even includes an explicit “simplicity criterion” saying that small metric gains are not worth keeping if they introduce ugly complexity.\nFor Apple Silicon, the MLX rewrite is the most useful path to study. It keeps the fixed-time loop, the same program.md idea, and the same keep-or-revert pattern, while changing the runtime substrate from PyTorch/CUDA to MLX. It also reports hardware-specific winners and notes that some findings on a Mac Mini did not transfer cleanly to Max-class hardware, which is exactly the sort of platform-specific effect autoresearch can expose.\nTo learn the general pattern rather than the exact training setup, compare it with autokernel, which applies the same structure to GPU kernel optimization. Instead of editing train.py, the agent edits kernel.py; instead of val_bpb, it uses a fixed correctness-and-performance harness in bench.py; and instead of model training, it profiles, extracts, optimizes, and verifies kernels. This shows that the autoresearch pattern is not limited to LLM pretraining.\nFor JAX, treat google-deepmind/simply as a useful substrate rather than a drop-in autoresearch port. It is a minimal JAX codebase for rapid LLM research iteration, so it can support the same style of workflow, but it is not the same implementation as Karpathy\u0026rsquo;s original setup.\nEvidence, strengths, limitations, and evaluation Public results and performance characteristics The clearest results come from the session reports and the nanochat leaderboard. In Discussion #32, a single H100 session improved val_bpb from 0.9979 to 0.9773 in 89 experiments, with early wins from halving batch size, longer warmdown, warmup, and a depth-9 reparameterization at roughly constant width. In Discussion #43, another H100 session improved 0.997900 → 0.969686 in 126 experiments, with 23 kept changes, 102 discarded changes, and 1 crash over about 10.5 hours. The largest gains in that second report came from halving the batch, moving to depth 9 at about the same dimensionality, raising embedding LR, changing RoPE base frequency, and adding small targeted weight decay to embeddings/value embeddings.\nThose gains mattered because the changes were then reported to transfer to nanochat’s larger depth-24 runs. The leaderboard documents the progression from 2.02 hours to 1.80 hours for “autoresearch round 1,” then to 1.65 hours for “autoresearch round 2.” The leaderboard notes that the first autoresearch-derived entry came from a private autoresearch run on a depth-12 model whose improvements translated to the depth-24 benchmark.\nWith the original five-minute experiment budget, the default throughput is about 12 experiments per hour and around 100 overnight. Community implementations show the same pattern scaling in both directions: the MLX port reports roughly 6–7 minutes per experiment on Apple Silicon setups, while autokernel reports about 90 seconds per kernel experiment and roughly 320 overnight across all kernels. A 16-GPU cluster experiment reported about 910 experiments in ~8 hours and reached the same best validation loss about 9× faster than a simulated sequential baseline.\nTypical use cases and strengths This pattern is best when the task has a trusted scalar metric, a cheap or moderate-cost evaluator, and a limited editable surface. That is why it works well for small-to-medium training harnesses, build-time reduction, inference-kernel optimization, and similar tasks where the agent can cheaply try many patches and get immediate pass/fail or better/worse feedback.\nA later engineering write-up from Shopify is especially informative because it generalizes the same pattern beyond ML training. Their write-up describes adapting the loop to a CI/build-time metric: measure the baseline, let the agent propose hypotheses, keep faster changes, and discard slower or crashing ones. The important point is not the specific codebase; it is that the same ratcheting loop transferred cleanly from val_bpb minimization to build-time minimization.\nThe biggest strengths are therefore straightforward. The method uses real execution feedback rather than pure model probability, so it can discover interaction effects that humans or one-shot prompting miss. It accumulates improvements compositionally when the metric is sufficiently informative. It also converts “background optimization work” into an always-on process: exactly the sort of boring-but-valuable work that humans tend not to prioritize manually.\nLimitations and failure modes The most obvious failure mode is metric hacking. Shopify’s write-up gives a crisp example: the agent sometimes found “ugly hacks,” such as deleting or bypassing things that technically made the build faster but were not acceptable engineering outcomes. That is the canonical autoresearch problem in one sentence: the optimizer is only as good as the metric and guardrails.\nA second failure mode is adaptive overfitting to the evaluator. The basic setup pins a validation shard and repeatedly compares candidates against that fixed target. That makes the loop simple and fast, but repeated adaptive selection on a fixed validation signal always raises the risk of overfitting to measurement noise or to idiosyncrasies of the validation slice. A stronger tutorial version should add held-out checks or repeated measurements before trusting late-stage improvements.\nA third issue is noise and reproducibility. nanochat’s leaderboard explicitly notes mild nondeterminism and shows a spread in repeated CORE scores across nominally identical runs. Session reports also show fragile findings: a seed change helped in one run and not another; warmup helped once and failed to reproduce later. This means false positives are possible unless you add repeated measurement, significance thresholds, or external confirmation.\nA fourth issue is local-optimum behavior. The public single-agent design keeps one incumbent and reverts worse candidates, which is efficient but inherently local. It can get stuck. Parallel and collaborative variants partly address this by exploring more combinations in parallel, but the default solo loop does not maintain a principled frontier of diverse hypotheses the way a beam or tree search would. The public cluster-scaling write-up makes exactly this point by contrasting one-at-a-time hill-climbing with parallel grids.\nA fifth issue is code complexity creep. Karpathy’s prompt explicitly tries to defend against this with a simplicity prior, but the danger is real: agents can often buy tiny metric improvements with brittle or ugly changes, especially late in the run. The simplicity criterion in program.md is therefore not cosmetic; it is a required regularizer on the search objective.\nFinally, there is a security and trust problem for collaborative variants. Karpathy’s interview discussion of large, untrusted pools of workers makes clear that distributed autoresearch only works if candidate solutions are easy to verify and isolated enough to run safely. That is much easier for metrics than for arbitrary code execution in a shared system.\nHow to implement it from scratch What follows is the shortest path to building an autoresearch-like loop yourself. The original version is PyTorch-based; the JAX notes below show how to translate the same workflow into a more functional setup.\nMinimal recipe Choose one machine-checkable metric. The metric should be scalar, cheap, and hard to game accidentally. In the original setup, this is val_bpb, computed by a read-only evaluator over a fixed sequence length and fixed validation budget. If your metric is noisy, define the repeated-run version up front.\nFreeze the evaluator. Put data loading, eval logic, time budget, and pass/fail conditions in an immutable file or module. Karpathy’s repo makes prepare.py read-only for exactly this reason. If the agent can change the evaluator, the loop stops being research and becomes reward hacking.\nConstrain the editable surface. Start with one mutable file or one mutable config block. Karpathy’s official design lets the agent touch only train.py. This dramatically improves debuggability and keeps diffs reviewable.\nEstablish a baseline first. The public prompt says the first run should always be the unmodified baseline. Log it, store the metric, and treat it as your incumbent.\nImplement the ratchet loop. The loop is: propose patch, apply patch, run evaluator, parse metric, keep if improved, otherwise revert, then log the result to a machine-readable history such as results.tsv. The official notebook then computes keep rate, running frontier, and “top hits” from that log.\nAdd safety and anti-gaming checks. Timeouts, lint/tests, memory ceilings, output-shape checks, and a simplicity prior are cheap and matter a lot. If you can, make the acceptance policy multi-objective—for example, improve metric while staying within a memory envelope. The official prompt already treats VRAM as a soft constraint and simplicity as a decision criterion.\nOnly then add parallelism. Start with the sequential loop first. If experiment runtime dominates planning time, parallel workers usually help a lot, but they also force you to manage experiment deduplication, candidate claiming, and merging. Collaborative versions need explicit coordination layers.\nRecommended hyperparameters These defaults are a good starting point for an implementation patterned on the official repo:\nKnob Recommended start Why Fixed training/eval budget 300 s Karpathy’s public default; forces comparable experiments Editable surface 1 file / 1 module Keeps context small and diffs reviewable Acceptance threshold \\(\\varepsilon\\) 0 for stable metrics; positive margin for noisy metrics Avoids keeping noise Crash retries 1–2 More than that usually wastes budget Timeout 2× nominal budget Catches hangs without punishing small overheads Proposal temperature Low to medium Enough diversity without chaotic patches Patch size limit Small-to-medium Encourages local search and easier debugging Repeated evaluations \\(K\\) 1 if stable, 3–5 if noisy Reduces false positives Parallel workers 1 initially Add more only after the sequential loop is trustworthy Complexity regularizer Explicit Prevents microscopic gains from bloating code The first three defaults come from the original setup; the rest are natural production hardening for the same loop.\nPyTorch sketch For a PyTorch implementation, the official repo is already the model to follow: keep the mutable training target as a normal Python script and invoke it as a subprocess. That is a good fit because it makes each candidate naturally sandboxable and lets you parse metrics from stdout or structured logs.\n# evaluator.py from dataclasses import dataclass import subprocess import re from pathlib import Path @dataclass class EvalResult: ok: bool metric: float | None status: str log_path: Path def run_candidate(repo_dir: str, timeout_s: int = 600) -\u0026gt; EvalResult: log_path = Path(repo_dir) / \u0026#34;run.log\u0026#34; cmd = f\u0026#34;cd {repo_dir} \u0026amp;\u0026amp; uv run train.py \u0026gt; run.log 2\u0026gt;\u0026amp;1\u0026#34; try: subprocess.run(cmd, shell=True, check=True, timeout=timeout_s) text = log_path.read_text() m = re.search(r\u0026#34;^val_bpb:\\s*([0-9.]+)\u0026#34;, text, flags=re.M) return EvalResult(ok=bool(m), metric=float(m.group(1)) if m else None, status=\u0026#34;ok\u0026#34; if m else \u0026#34;parse_fail\u0026#34;, log_path=log_path) except subprocess.TimeoutExpired: return EvalResult(ok=False, metric=None, status=\u0026#34;timeout\u0026#34;, log_path=log_path) except subprocess.CalledProcessError: return EvalResult(ok=False, metric=None, status=\u0026#34;crash\u0026#34;, log_path=log_path) This subprocess-centered design is one reason the approach is so practical for PyTorch and general Python codebases. It is not elegant in a functional-programming sense, but it lines up perfectly with how coding agents already operate on repositories.\nJAX sketch For JAX, avoid making the agent rewrite a huge monolithic script. A better JAX translation is: freeze a pure run_experiment(config) wrapper, let the agent modify a small config/module surface, and make compilation behavior explicit. Treat this as a tutorial translation of the workflow, not as a claim about an official JAX port.\n# jax_runner.py from dataclasses import dataclass import jax import jax.numpy as jnp @dataclass class EvalResult: ok: bool metric: float status: str def run_experiment(cfg) -\u0026gt; EvalResult: # compile/warmup should be separated from measured budget if possible params, state = init_model_and_state(cfg) params, state = train_for_fixed_steps_or_time(params, state, cfg) metric = evaluate_bpb_or_task_metric(params, state, cfg) return EvalResult(ok=True, metric=float(metric), status=\u0026#34;ok\u0026#34;) In JAX, the main extra engineering issue is compilation. Karpathy’s PyTorch repo explicitly excludes startup/compilation from the five-minute training budget, and you should preserve that idea in JAX even more carefully, because otherwise you risk optimizing for compilation artifacts rather than for steady-state training quality.\nComplexity analysis Let:\n\\(N\\) = number of experiments, \\(C_p\\) = agent proposal/planning time, \\(C_a\\) = patch-apply and bookkeeping time, \\(C_r\\) = runtime of the actual experiment plus evaluation, \\(B\\) = number of parallel workers. For the default sequential loop, wall-clock complexity is approximately\n\\[ T_{\\text{seq}} \\approx N(C_p + C_a + C_r). \\]In practice, \\(C_r\\) dominates when experiments take minutes, which is exactly why Karpathy’s loop is productive: the agent’s planning overhead is small relative to the experiment runtime.\nMemory usage for the single-branch version is modest on the orchestration side:\n\\[ M_{\\text{seq}} = O(|c| + |\\mathcal H|) + M_{\\text{runtime}}, \\]where \\(|c|\\) is the mutable codebase footprint, \\(|\\mathcal H|\\) is log/history size, and \\(M_{\\text{runtime}}\\) is dominated by the model/training process. In the basic design, orchestration memory is tiny compared with GPU memory.\nIf you parallelize to \\(B\\) independent workers with good coordination, idealized wall-clock drops to roughly\n\\[ T_{\\text{par}} \\approx \\frac{N}{B}(C_p + C_a + C_r) + C_{\\text{coord}}, \\]with total compute cost still scaling roughly linearly in \\(N\\). Public collaborative and cluster examples show that parallelism changes not only wall-clock but also search behavior, because it allows you to test combinations in waves rather than serially.\nThe important takeaway is that autoresearch is attractive exactly when evaluation is expensive enough that automation matters, but cheap enough that you can afford many iterations. If one experiment takes five minutes, you can do about a hundred overnight. If one experiment takes two days, you are in a very different regime and probably need more structured search, better priors, or heavier parallelism.\nIn one sentence: Karpathy’s public autoresearch is best understood as an autoregressive coding agent wrapped around a keep-or-revert experiment loop over executable states. It is rigorous enough to analyze as a search algorithm, practical enough to ship as a tiny repo, and different enough from standard decoding methods that it deserves to be thought of as its own pattern—provided you remember that the real hero is not the language model alone, but the frozen evaluator plus repeated external feedback.\n","date":"2026-01-01","description":"","featured":false,"featured_image":"mathematicsforai/chaptero07.png","permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-07/section-03/","popular":false,"readingTime":21,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-07/section-03/","section":"courses","series":"","summary":"\u003ch1 id=\"karpathy-autoresearch-explained\"\u003eKarpathy Autoresearch Explained\u003c/h1\u003e\n\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eThis lesson introduces \u003cstrong\u003eautoresearch\u003c/strong\u003e as a practical workflow for letting an AI coding agent run experiments without waiting for a human to choose every next step. The basic pattern is simple: define the goal, freeze the evaluator, let the agent propose code changes, run the experiment, keep the change only if the metric improves, and repeat. The public examples make the idea concrete: single-GPU overnight runs improved \u003ccode\u003eval_bpb\u003c/code\u003e from \u003ccode\u003e0.997900\u003c/code\u003e to \u003ccode\u003e0.969686\u003c/code\u003e in 126 experiments on an H100, and those smaller depth-12 findings later transferred to larger depth-24 \u003ccode\u003enanochat\u003c/code\u003e runs, reducing the \u0026ldquo;time to GPT-2\u0026rdquo; leaderboard entry from \u003cstrong\u003e2.02 hours to 1.80 hours\u003c/strong\u003e, with a later entry at \u003cstrong\u003e1.65 hours\u003c/strong\u003e. The rest of this section turns that workflow into a tutorial: first the naming and intuition, then the loop, comparisons, implementations, strengths, limitations, and a practical recipe for building a similar system.\u003c/p\u003e","tags":[],"title":"Karpathy autoresearch","type":"courses"},{"categories":[],"content":"","date":"2026-01-01","description":"","featured":false,"featured_image":null,"permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/courses/course/chapter-08/section-01/","popular":false,"readingTime":0,"relPermalink":"/Sang_e_Mehrab/courses/course/chapter-08/section-01/","section":"courses","series":"","summary":"","tags":[],"title":"Section 8.1","type":"courses"},{"categories":["AI","Guides"],"content":"This site is built as a structured technical notebook: part personal blog, part research archive, and part course platform.\nThe goal is not only to publish isolated posts, but to keep related ideas connected. Articles can stand alone, research notes can carry formal metadata, and course chapters can grow into long sequences without changing the visual language of the site.\nWhy structure matters Technical writing becomes easier to read when the page itself has predictable structure. A reader should be able to recognize the title, summary, table of contents, citation tools, figures, equations, and navigation without needing to relearn the interface on every page.\nThat is why the site uses shared cards, article surfaces, course lists, research panels, and consistent typography.\nA place for mathematics The design is also meant to support mathematical writing. Inline notation such as $x \\in \\mathbb{R}^d$ should feel natural inside prose, while display equations should have enough space to breathe:\n$$ \\hat{y} = x^\\top w + b $$The same structure can carry a short article, a research sketch, or a full course section.\nDirection The archive will grow around artificial intelligence, mathematics, physics, and research practice. The important thing is that each new page should feel like it belongs to the same intellectual room: warm, readable, and careful.\n","date":"2026-04-12","description":"A short note on the publishing system behind this blog: structured pages, mathematical typography, research notes, and reusable layouts for long-form technical writing.","featured":true,"featured_image":"mathematicsforai/Chapter_01_Language_illustrative_image.png","permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/posts/2026/04/building-a-systematic-technical-blog/","popular":true,"readingTime":1,"relPermalink":"/Sang_e_Mehrab/posts/2026/04/building-a-systematic-technical-blog/","section":"posts","series":"Foundations","summary":"A short note on building a technical blog that can hold articles, research notes, mathematics, and long-form course material without losing structure.","tags":["workflow","hugo","tutorial","mathematics"],"title":"Building a Systematic Technical Blog","type":"posts"},{"categories":["AI"],"content":"Hey, I am Anwar Shamim.\n","date":"2026-04-08","description":"Short introduction page used as a simple notebook-style entry in the archive.","featured":false,"featured_image":null,"permalink":"https://anwarshamim01.github.io/Sang_e_Mehrab/posts/2026/04/intro/","popular":true,"readingTime":1,"relPermalink":"/Sang_e_Mehrab/posts/2026/04/intro/","section":"posts","series":"Foundations","summary":"A brief introduction to the site, its themes, and how the writing is organized.","tags":["workflow"],"title":"Intro","type":"posts"}]