--- title: "IndexNumR: A Package for Index Number Calculation" author: "Graham White" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true bibliography: "indexnumrbib.bib" vignette: > %\VignetteIndexEntry{IndexNumR: A Package for Index Number Calculation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} library(IndexNumR) library(tidyr) #devtools::load_all() ``` # Introduction *IndexNumR* is a package for computing indices of aggregate prices or quantities using information on the prices and quantities on multiple products over multiple time periods. Such numbers are routinely computed by statistical agencies to measure, for example, the change in the general level of prices, production inputs and productivity for an economy. Well known examples are consumer price indices and producer price indices. In recent years, advances have been made in index number theory to address biases in many well known and widely used index number methods. One area of development has been the adaptation of multilateral methods, commonly used in cross-sectional comparisons, to the time series context. This typically involves more computational complexity than bilateral methods. *IndexNumR* provides functions that make it easy to estimate indices using common index number methods, as well as multilateral methods. ## Packages providing related functionality * micEconIndex: produces price or quantity indices using the Paasche, Laspeyres or Fisher index methods. * multilaterals: provides multilateral indices for cross-section and panel data. Can also produce bilateral indices using Paasche, Laspeyres, Fisher and Tornqvist methods. * productivity: calculates indices of productivity and profitability using common index number methods. * IndexNumber: calculates Laspeyres, Paasche or Fisher indexes. * PriceIndices: produces indices for a wide range of bilateral and multilateral methods, but limited to a monthly frequency. Provides splicing methods for extending multilateral indices as well as functions for preparation of the data. # The *IndexNumR* package ## Data organisation This first section covers the inputs into the main index number functions and how the data are to be organised to use these functions. ### Index number input dataframe The index number functions such as `priceIndex`, `quantityIndex` and `GEKSIndex` all take a dataframe as their first argument. This dataframe should contain everything needed to compute the index. In general this includes columns for, * prices * quantities * a time period variable (more on this below) * a product identifier that uniquely identifies each product. One exception to the above is when elementary indexes are estimated using the `priceIndex` function. A quantity variable is not required in this case because the index is unweighted, and in many cases quantities may not be available (for example, when statistical agencies collect sample prices on individual products). The dataframe must have column names, since character strings are used in other arguments to the index number functions to specify which columns contain the data listed above. Column names can be set with the `colnames` function of base R. The sample dataset CES_sigma_2 is an example of the minimum dataframe required to compute an index. ```{r head_CES} head(CES_sigma_2) ``` In this case, the dataframe is sorted by the product identifier *prodID*, but it need not be sorted at all. ### The time period variable To be able to compute indices, the data need to be subset in order to extract all observations on products for given periods. The approach used in *IndexNumR* is to require a time period variable as an input into many of its functions that will be used for subsetting. This time period variable must satisfy the following, * start at 1 * increase in integer increments of 1 * continuous (that is, no gaps). The variable may, and in fact likely will, have many observations for a given time period, since there are generally multiple items with price and quantity information. For example, the CES_sigma_2 dataset has observations on 4 products for each time period. We can see this by observing the first few rows of the dataset sorted by the time period. ```{r head_CES_ordered} head(CES_sigma_2[order(CES_sigma_2$time),]) ``` The user can provide their own time variable, or if a date variable is available, *IndexNumR* has four functions that can compute the required time variable: `yearIndex`, `quarterIndex`, `monthIndex` and `weekIndex`. Users should be aware that if there are a very large number of observations then these functions can take some time to compute, but once it has been computed it is easier and faster to work with than dates. ### Time aggregation A related issue is that of aggregating data collected at some higher frequency, to a lower frequency. When computing index numbers, this is often done by computing a *unit value* as follows, \begin{equation} UV_{t} = \frac{\sum_{i=1}^{N}p^{t}_{n}q^{t}_{n}}{\sum_{i=1}^{N}q^{t}_{n}} \end{equation} That is, sum up total expenditure on each item over the required period, and divide by the total quantity. Provided that a time period variable as described above is available, the unit values can be computed using the function `unitValues`. This function returns the unit values, along with the aggregate quantities for each time period and each product. The output will also include the product identifier and time period variable so the output dataframe from the `unitvalues` function contains everything needed to compute an index number. ## Sample data *IndexNumR* provides a sample dataset, `CES_sigma_2` that contains prices and quantities on four products over twelve time periods that are consistent with consumers displaying CES preferences with an elasticity of substitution equal to two. This dataset is calculated using the method described in [@df:2017]. We start with prices for each of $n$ products in each of $T$ time periods, an n-dimensional vector of preference parameters $\alpha$, and a T-dimensional vector of total expenditures. Then calculate the expenditure shares for each product in each time period using, \begin{equation} s_{tn} = \frac{\alpha_{n}p_{tn}^{1-\sigma}}{\sum_{n=1}^{N}\alpha_{n}p_{tn}^{1-\sigma}} \end{equation} and use those shares to calculate the quantities, \begin{equation} q_{tn} = \frac{e_{t}s_{tn}}{p_{tn}} \end{equation} *IndexNumR* provides the function `CESData` to produce datasets assuming CES preferences as above for any elasticity of substitution $\sigma$, using the prices, $\alpha$, and expenditure values assumed in [@df:2017]. The vector $\alpha$ is, \begin{equation} \alpha = \begin{bmatrix} 0.2 & 0.2 & 0.2 & 0.4 \end {bmatrix} \end{equation} and the prices and expenditures are, ----------------------------------- t p1 p2 p3 p4 e ----- ----- ----- ----- ----- ----- 1 2.00 1.00 1.00 0.50 10 2 1.75 0.50 0.95 0.55 13 3 1.60 1.05 0.90 0.60 11 4 1.50 1.10 0.85 0.65 12 5 1.45 1.12 0.40 0.70 15 6 1.40 1.15 0.80 0.75 13 7 1.35 1.18 0.75 0.70 14 8 1.30 0.60 0.72 0.65 17 9 1.25 1.20 0.70 0.70 15 10 1.20 1.25 0.40 0.75 18 11 1.15 1.28 0.70 0.75 16 12 1.10 1.30 0.65 0.80 17 ----------------------------------- ## Matched-sample indexes A common issue when computing index numbers is that the sample of products over which the index is computed changes over time. Since price and quantity information is generally needed on the same set of products for each pair of periods being compared, the index calculation functions provided in *IndexNumR* provide the option `sample="matched"` to use only a matched sample of products. How this performs the matching depends on whether the index is bilateral or multilateral. For bilateral indices the price and quantity information will be extracted for a pair of periods, any non-overlapping products removed, and the index computed over these matched products. This is repeated for each pair of periods over which the index is being computed. For multilateral indexes it is somewhat different. For the GEKS index, the matching is performed for each bilateral comparison that enters into the calculation of the multilateral index (see section on the GEKS index below). For the Geary-Khamis and Weighted-Time-Product-Dummy methods, matching can be performed over each window of data. That is, only products that appear in all time periods within each calculation window are kept. For these two indexes a matched sample is not required; by default, *IndexNumR* will set price and quantity to zero for all missing observations, to allow the index to be computed. For the WTPD index, this can be shown to give the same result as running a weighted least squares regression on the available pooled data. Matched-sample indexes may suffer from bias. As a simple assessment of the potential bias, the function `evaluateMatched` calculates the proportion of total expenditure that the matched sample covers in each time period. The function provides output for expenditure as well as counts and can evaluate overlap using either a chained or fixed base index. The first four columns of the output presents the base period information base_index (the time index of the base period), base (total base period expenditure or count), base_matched (the expenditure or count of the base period for matched products), base_share (share of total expenditure in the base period that remains after matching). Columns 5-8 report the same information for the current period. Columns 4 and 8 can be expressed as, \begin{equation} \lambda_{t} = \frac{\sum_{I\in I(1)\cap I(0)}p_{n}^{t}q_{n}^{t}}{\sum_{I\in I(t)}p_{n}^{t}q_{n}^{t}} \quad \text{for } t \in \{1,0\}, \end{equation} where $I(t)$ is the set of products available in period $t$, $t=1$ refers to the current period as is used to compute column 8 and $t=0$ refers to the comparison period, which is used to compute column 4. The count matrix has two additional columns, "new" and "leaving". The new column gives the number of products that exist in the current period but not the base period (products entering the sample). The leaving column gives the count of products that exist in the base period but not the current period (products leaving the sample). Matching removes both of these types of products. ## Data imputation ### Carry forward/backward prices An alternative to using a matched sample of products is to impute the missing data. One technique for doing this is to replace missing values with the last actual price observation. If the data has both prices and quantities then the corresponding quantities are set to zero. If the missing observations occur at the beginning of the time series then the first actual observation is carried backward to the first time period. *IndexNumR* performs this carry price imputation with the `imputeCarryPrices` function; however, this is only needed if the imputed data themselves are of interest. Otherwise, the price index functions can use carry price imputation by setting the parameter `imputePrices = "carry"`. In the example below, the first two observations on product 1 are missing, so the price from the third period is carried backwards to fill the missing observations. Observations 3 and 4 are missing on product 2, so the price in period 2 is carried forward to fill them. The corresponding quantities are set to zero. ```{r carry} # create a dataset with some missing observations on product 1 and 2 df <- CES_sigma_2[-c(1,2,15,16),] df <- df[df$prodID %in% 1:2 & df$time <= 6,] dfMissing <- df[, c("time", "prices", "prodID")] %>% tidyr::pivot_wider(id_cols = time, names_from = prodID, values_from = prices) dfMissing[order(dfMissing$time),] # compute carry prices carryPrices <- imputeCarryPrices(df, pvar = "prices", qvar = "quantities", pervar = "time", prodID = "prodID") # print the data with the product prices in columns to see the filled data carryPrices[, c("time", "prices", "prodID")] %>% tidyr::pivot_wider(id_cols = time, names_from = prodID, values_from = prices) # print the data with the product quantities in columns to see the corresponding zeros carryPrices[, c("time", "quantities", "prodID")] %>% tidyr::pivot_wider(id_cols = time, names_from = prodID, values_from = quantities) ``` ## Bilateral index numbers Bilateral index numbers are those that examine the movement between two periods. All of the bilateral index numbers can be computed as period-on-period, chained or fixed base. Period-on-period simply measures the change from one period to the next. Chained indices give the cumulative change, and it is calculated as the cumulative product of the period-on-period index. The fixed base index compares each period to the base period. This is also called a direct index, because unlike a chained index, it does not go through other periods to measure the change since the base period. Formulae used to compute the bilateral index numbers from period t-1 to period t are as below. * Carli index [@carli:1804], \begin{equation*} P(p^{t-1},p^{t}) = \frac{1}{N}\sum_{n=1}^{N}\left(\frac{p^{t}_{n}}{p^{t-1}_{n}}\right) \end{equation*} * Jevons index [@jevons:1865], \begin{equation*} P(p^{t-1},p^{t}) = \prod_{n=1}^{N}\left(\frac{p^{t}_{n}}{p^{t-1}_{n}}\right)^{(1/N)} \end{equation*} * Dutot index [@dutot:1738], \begin{equation*} P(p^{t-1},p^{t}) = \frac{\sum_{n=1}^{N}p^{t}_{n}}{\sum_{n=1}^{N}p^{t-1}_{n}} \end{equation*} * Laspeyres index [@lasp:1871], \begin{equation*} P(p^{t-1},p^{t},q^{t-1}) = \frac{\sum_{n=1}^{N}p^{t}_{n}q^{t-1}_{n}}{\sum_{n=1}^{N}p^{t-1}_{n}q^{t-1}_{n}} \end{equation*} * Paasche index [@paasche:1874] \begin{equation*} P(p^{t-1},p^{t},q^{t}) = \frac{\sum_{n=1}^{N}p^{t}_{n}q^{t}_{n}}{\sum_{n=1}^{N}p^{t-1}_{n}q^{t}_{n}} \end{equation*} * Geometric Laspeyres index [@konusByush:1926] \begin{equation*} P(p^{t-1},p^{t},q^{t-1}) = \prod_{n=1}^{N}\left(\frac{p^{t}_{n}}{p^{t-1}_{n}}\right)^{s^{t-1}_{n}}, \end{equation*} where $s^{t}_{n} = \frac{p^{t}_{n}q^{t}_{n}}{\sum_{n=1}^{N}p^{t}_{n}q^{t}_{n}}$ is the share of period $t$ expenditure on good $n$. * Geometric Paasche index [@konusByush:1926] \begin{equation*} P(p^{t-1},p^{t},q^{t}) = \prod_{n=1}^{N}\left(\frac{p^{t}_{n}}{p^{t-1}_{n}}\right)^{s^{t}_{n}}, \end{equation*} where $s^{t}_{n}$ is defined as above for the geometric laspeyres index. * Lowe index [@lowe:1823] \begin{equation*} P(p^{t-1},p^{t},q^{b}) = \frac{\sum_{n=1}^{N}p^{t}_{n}q^{b}_{n}}{\sum_{n=1}^{N}p^{t-1}_{n}q^{b}_{n}}, \end{equation*} where $b$ can be any period, or range of periods, in the dataset. * Young index [@young:1812] \begin{equation*} P(p^{t-1},p^{t},p^{b},q^{b}) = \sum_{n=1}^{N}s^{b}_{n}\frac{p^{t}_{n}}{p^{t-1}_{n}}, \end{equation*} where $b$ can be any period, or range of periods, in the dataset. * Drobish index [@drobish:1871] \begin{equation*} P(p^{t-1},p^{t},q^{t-1},q^{t}) = (P_{L}+P_{P})/2, \end{equation*} where $P_{L}$ is the Laspeyres price index and $P_{P}$ is the Paasche price index. * Marshall-Edgeworth index [@marshall:1887], [@edgeworth:1925] \begin{equation*} P(p^{t-1},p^{t},q^{t-1},q^{t}) = \frac{\sum_{n=1}^{N}p_{n}^{t}(q_{n}^{t-1}+q_{n}^{t})}{\sum_{n=1}^{N}p_{n}^{t-1}(q_{n}^{t-1}+q_{n}^{t})} \end{equation*} * Palgrave index [@palgrave:1886] \begin{equation*} P(p^{t-1},p^{t},q^{t-1},q^{t}) = \sum_{n=1}^{N}s^{t}_{n}\frac{p^{t}_{n}}{p^{t-1}_{n}}, \end{equation*} where $s^{t}_{n}$ is defined as above for the geometric laspeyres index. * Fisher index [@fisher:1921], \begin{equation*} P(p^{t-1},p^{t},q^{t-1},q^{t}) = [P_{P}P_{L}]^{\frac{1}{2}}, \end{equation*} where $P_{P}$ is the Paasche index and $P_{L}$ is the Laspeyres index. The Fisher index has other representations, but this is the one used by *IndexNumR* in its computations. * Tornqvist index [@torn:1936; @torntorn:1937], \begin{equation*} P(p^{t-1},p^{t},q^{t-1},q^{t}) = \prod_{n=1}^{N}\left(\frac{p^{t}_{n}}{p^{t-1}_{n}}\right)^{\left(s^{t-1}_{n}+s^{t}_{n}\right)/2}, \end{equation*} where $s^{t}_{n}$ is defined as above for the geometric laspeyres index. * Walsh index, \begin{equation*} P(p^{t-1},p^{t},q^{t-1},q^{t}) = \frac{\sum_{n=1}^{N}\sqrt{q^{t-1}_{n}q^{t}_{n}}\cdot p^{t}_{n}}{\sum_{n=1}^{N}\sqrt{q^{t-1}_{n}q^{t}_{n}}\cdot p^{t-1}_{n}} \end{equation*} * Sato-Vartia index [@sato:1976; @vartia:1976], \begin{equation*} P(p^{t-1},p^{t},q^{t-1},q^{t}) = \prod_{n=1}^{N}\left(\frac{p^{t}_{n}}{p^{t-1}_{n}}\right)^{w_{n}} \end{equation*} where the weights are normalised to sum to one, \begin{equation*} w_{n} = \frac{w^{*}_{n}}{\sum_{n=1}^{N}w^{*}_{n}} \end{equation*} and $w^{*}_{n}$ is the logarithmic mean of the shares, \begin{equation*} w^{*}_{n} = \frac{s^{t}_{n}-s^{t-1}_{n}}{\log (s^{t}_{n}) - \log (s^{t-1}_{n})} \end{equation*} * Geary-Khamis [@khamis:1972] \begin{equation*} P(p^{t-1},p^{t},q^{t-1},q^{t}) = \frac{\sum_{n=1}^{N}h(q^{t-1}_{n}, q^{t}_{n})p^{t}_{n}}{\sum_{n=1}^{N}h(q^{t-1}_{n}, q^{t}_{n})p^{t-1}_{n}} \end{equation*} where h() is the harmonic mean. * Stuvel index [@stuvel:1957] \begin{equation*} P(p^{t-1},p^{t},q^{t-1},q^{t}) = A + \sqrt{A^2 + V^{t}/V^{t-1}}, \end{equation*} where $V^{t}$ is value of total sales in period $t$, $A = (P_{L}-Q_{L})/2$, $P_{L}$ is the laspeyres price index and $Q_{L}$ is the laspeyres quantity index. * CES index, also known as the Lloyd-Moulton index [@lloyd:1975; @Moult:1996], \begin{equation*} P(p^{t-1},p^{t},q^{t-1}) = \left[\sum_{n=1}^{N}s_{n}^{t-1}\left(\frac{p^{t}_{n}}{p^{t-1}_{n}}\right)^{(1-\sigma)}\right]^{\left(\frac{1}{1-\sigma}\right)}, \end{equation*} where $\sigma$ is the elasticity of substitution. ### Time dummy methods * Time-product-dummy This is a regression model approach where log prices are modelled as a function of time and product dummies. The regression equation is given by, \begin{equation*} \ln{p_{n}^{t}} = \alpha + \beta_{1} D^{t} + \sum_{n = 2}^{N}\beta_{n}D_{n} + \epsilon_{n}^{t}, \end{equation*} where $D^{t}$ is equal to 1 in period $t$ and 0 in period $t-1$, and $D_{n}$ is equal to 1 if the product is product $n$ and 0 otherwise. The price index is then given by, \begin{equation*} P(p^{t-1},p^{t},q^{t-1},q^{t}) = \exp({\hat{\beta_{1}}}) \end{equation*} However, this is a biased estimate [@kennedy:1981], so *IndexNumR* optionally calculates the following adjusted estimate, \begin{equation*} P(p^{t-1},p^{t},q^{t-1},q^{t}) = \exp({\hat{\beta_{1}} - 0.5 \times Var(\hat{\beta_{1}})}) \end{equation*} The time-product-dummy equation can be estimated using three methods in IndexNumR using the `weights` parameter: ordinary least squares; weighted least squares where the weights are the product expenditure shares; or weighted least squares where the weights are the average of the expenditure shares in the two periods. In the first case, the index produced is the same as the matched sample Jevons index, which does not use quantity information. The second option produces a matched sample harmonic share weights index, and the last option produces the matched sample Tornqvist index. See [@diewert:2005b] for a discussion of these results. ### Examples To estimate a simple chained Laspeyres price index, ```{r bilateral_examples} priceIndex(CES_sigma_2, pvar = "prices", qvar = "quantities", pervar = "time", prodID = "prodID", indexMethod = "laspeyres", output = "chained") ``` Estimating multiple different index numbers on the same data is straight-forward, ```{r multiple_bilateral} methods <- c("laspeyres","paasche","fisher","tornqvist") prices <- lapply(methods, function(x) {priceIndex(CES_sigma_2, pvar = "prices", qvar = "quantities", pervar = "time", prodID = "prodID", indexMethod = x, output = "chained")}) as.data.frame(prices, col.names = methods) ``` This illustrates the Laspeyres index's substantial positive bias, the Paasche index's substantial negative bias, and the similar estimates produced by the Fisher and Tornqvist superlative index numbers. ## The elasticity of substitution parameter The CES index number method requires an elasticity of substitution parameter in order to be calculated. *IndexNumR* provides a function `elasticity` to estimate the elasticity of substitution parameter, following the method of [@balk:2000]. The basic method is to solve for the value of the elasticity of substitution that equates the CES index to a comparison index. One comparison index noted by Balk is the 'current period' CES index, \begin{equation} \left[\sum_{n=1}^{N}s_{n}^{t}\left(\frac{p^{t}_{n}}{p^{t-1}_{n}}\right)^{-(1-\sigma)}\right]^{\left(\frac{-1}{1-\sigma}\right)}. \end{equation} Therefore, we numerically calculate the value of $\sigma$ that solves, \begin{equation} \left[\sum_{n=1}^{N}s_{n}^{t-1}\left(\frac{p^{t}_{n}}{p^{t-1}_{n}}\right)^{(1-\sigma)}\right]^{\left(\frac{1}{1-\sigma}\right)} - \left[\sum_{n=1}^{N}s_{n}^{t}\left(\frac{p^{t}_{n}}{p^{t-1}_{n}}\right)^{-(1-\sigma)}\right]^{\left(\frac{-1}{1-\sigma}\right)} = 0. \end{equation} This is done using the `uniroot` function of the *stats* package distributed with base R. Note that this equation can be used to solve for sigma for any $t=2,\cdots,T$, so there are $T-1$ potential estimates of sigma. The `elasticity` function will return all $T-1$ estimates as well as the arithmetic mean of the estimates. In addition to the current period CES index, Balk also notes that the Sato-Vartia index can be used, while [@idf:2010] note that a Fisher index could be used. Any of these three indexes can be used as the comparison index by specifying the `compIndex` option as either `"fisher"`, `"ces"` or `"satovartia"`. The current period CES index is the default. The dataset available with *IndexNumR*, CES_sigma_2, was calculated assuming a CES cost function with an elasticity of substitution equal to 2. Running the `elasticity` function on this dataset, ```{r elasticity} elasticity(CES_sigma_2, pvar="prices", qvar="quantities", pervar="time", prodID="prodID", compIndex="ces") ``` which recovers the value of $\sigma$ used to construct the dataset. There is one additional item of output labelled 'diff'. This is the value of the difference between the CES index and the comparison index and is returned so that the user can check that the value of this difference is indeed zero. If it is non-zero then it may indicate that `uniroot` was not able to find a solution, within the specified upper and lower bounds for $\sigma$. These bounds can be changed with the options `upper` and `lower` of the `elasticity` function. The defaults are 20 and -20 respectively. ## Chain-linked indices and the linking period problem One problem with chain-linked indices is the potential for chain drift. Take an example where prices increase in one period and then return to their original level in the next period. An index suffering from chain-drift will increase when prices increase, but won't return to its original level when prices do. In the above examples, it was noted that there is substantial positive bias in the Laspeyres index and substantial negative bias in the Paasche index. Part of this is due to chain drift. One way of reducing the amount of chain drift is to choose linking periods that are 'similar' in some sense (alternatively, use a multilateral method). This method of linking has been mentioned by Diewert and Fox [@df:2017], and Hill [@hill:2001] takes the concept further to choose the link period based on a minimum cost spanning tree. ### Dissimilarity measures To choose the linking period we need a measure of the similarity between two periods. For each period we have information on prices and quantities. The Hill (2001) method compares the two periods based on the **Paasche-Laspeyres spread**, \begin{equation} PL (p^{t},p^{T+1},q^{t},q^{T+1}) = \Bigg|{ln\Bigg(\frac{P_{T+1,t}^{L}}{P_{T+1,t}^{P}}\Bigg)}\Bigg|, \end{equation} where $P^{L}$ is a Laspeyres price index and $P^{P}$ is a Paasche price index. Since the Laspeyres and Paasche indices are biased in opposite directions, this choice of similarity measure is designed to choose linking periods that minimise the influence of index number method choice. Alternative measures exist that compute the dissimilarity of two vectors. Two such measures, recommended by Diewert [@Diewert:2002] are the **weighted log-quadratic index of relative price dissimilarity** and the **weighted asymptotically linear index of relative price dissimilarity**, given by the following, \begin{align} LQ(p^{t},p^{T+1},q^{t},q^{T+1}) = \sum_{n=1}^{N}\frac{1}{2}&(s_{T+1,n} + s_{t,n})[ln(p_{T+1,n}/P(p^{t},p^{T+1},q^{t},q^{T+1})p_{t,n})]^{2} \label{eq:logQuadratic} \\ AL(p^{t},p^{T+1},q^{t},q^{T+1}) = \sum_{n=1}^{N}\frac{1}{2}&(s_{T+1,n} + s_{t,n})[(p_{T+1,n}/P(p^{t},p^{T+1},q^{t},q^{T+1})p_{t,n}) + \nonumber \\ & (P(p^{t},p^{T+1},q^{t},q^{T+1})p_{t,n}/p_{T+1,n}) - 2] \end{align} where $P(p^{t},p^{T+1},q^{t},q^{T+1})$ is a superlative index number. Another measure proposed by Fox, Hill and Diewert [@fhd:2004] is a measure of **absolute dissimilarity** given by, \begin{equation} AD(x_{j},x_{k}) = \frac{1}{M+N}\sum_{l=1}^{M+N}\Bigg[ln\Bigg(\frac{x_{kl}}{x_{jl}}\Bigg) - \frac{1}{M+N}\sum_{i=1}^{M+N}ln\Bigg(\frac{x_{ki}}{x_{ji}}\Bigg)\Bigg]^{2} + \Bigg[\frac{1}{M+N}\sum_{i=1}^{M+N}ln\Bigg(\frac{x_{ki}}{x_{ji}}\Bigg)\Bigg]^{2}, \end{equation} where $M+N$ is the total number of items in the vector and $x_{j}$ and $x_{k}$ are the two vectors being compared. The authors use this in the context of detecting outliers, but it can be used to compare the price and quantity vectors of two time periods. One way to do this is to only use price information, or only use quantity information. There are two ways to use both price and quantity information: stack the price and quantity vectors for each time period into a single vector and compare the two `stacked' vectors; or calculate separate measures of absolute dissimilarity for prices and quantities before combining these into a single measure. The former method is simple to implement, but augments the price vector with a quantity vector that may be of considerably different magnitude and variance. Another option is to compute the absolute dissimilarity using prices and quantities separately, then combine them by taking the geometric average. The final measure is the **predicted share measure of relative price dissimilarity** employed by Diewert, Finkel, Sayag and White in the Seasonal Products chapter of the Update of the Consumer Price Index Manual, Consumer Price Index Theory (draft available [here](https://www.imf.org/-/media/Files/Data/CPI/companion-publication/seasonal-products.ashx)). To introduce this measure, first we define some notation. The share of expenditure on product $n$ in period $t$ is given by $s_{t,n} = p_{t,n}q_{t,n}/ \sum_{i=1}^{K}p_{t,i}q_{t,i}$. The 'predicted' share of expenditure on product $n$ in period $t$, using the quantities of period $t$ and the prices of period $r$ is given by $s_{r,t,n} = p_{r,n}q_{t,n}/ \sum_{i=1}^{K}p_{r,i}q_{t,i}$. We also define the predicted share error $e_{r,t,n}$ as the actual share, minus the predicted share $s_{t,n} - s_{r,t,n}$. The predicted share measure of relative price dissimilarity between the periods $t$ and $r$ is given by: \begin{equation} PS_{r,t} = \sum_{n=1}^{N} (e_{r,t,n})^2 + \sum_{n=1}^{N} (e_{t,r,n})^2 \end{equation} When the dataset being used does not have quantities, and an elementary index is being constructed, we cannot compute the shares in the above formulas. In this case, quantities are imputed in such a way that the expenditure shares on each product available in a time period are equal. The quantities are constructed by setting quantity equal to $1/P_{n,t}\times N_{t}$ where $N_{t}$ is the number of products available in period $t$. *IndexNumR* does this with the function `imputeQuantities`; however price indexes can be estimated without calling this function directly. Calling `priceIndex` and setting `qvar = ""` will trigger IndexNumR to impute the quantities used in the estimation of the predicted share relative price dissimilarity measure. ### Estimating similarity-linked indexes *IndexNumR* provides three functions, enabling the estimation of the dissimilarity measures above. The first function `relativeDissimilarity` calculates the Paasche-Laspeyres spread, log-quadratic, asymptotically linear and predicted share relative dissimilarity measures, and the second function `mixScaleDissimilarity` computes the mix, scale and absolute measures of dissimilarity. All three functions provide the same output - a data frame with three columns containing the indices of the pairs of periods being compared in the first two columns and the value of the dissimilarity measure in the third column. Once these have been computed, the function `maximiumSimilarityLinks` can take the output data frame from these functions and compute the maximum similarity linking periods. The function `priceIndex` effectively computes a similarity-linked index as follows, * Compute the measure of dissimilarity between all possible combinations of time periods. * Set the price index to 1 in the first period. * Compute the price index for the second period and chain it with the first period, \begin{equation*} P_{chain}^{2} = P_{chain}^{1} \times P(p^{1},p^{2},q^{1},q^{2}), \end{equation*} where $P(p^{1},p^{2},q^{1},q^{2})$ is any bilateral index number formula. * For each period $t$ from $3,\dots,T$, find the period $t^{min}$ with the minimum dissimilarity, comparing period $t$ to all periods $1, \dots, t-1$. * Compute the similarity chain-linked index number, \begin{equation*} P_{chain}^{t} = P_{chain}^{t^{min}} \times P(p^{t^{min}},p^{t},q^{t^{min}},q^{t}) \end{equation*} ### Examples Using the log-quadratic measure of relative dissimilarity, the dissimilarity between the periods in the `CES_sigma_2` dataset is as follows, ```{r relative_dissimilarity} lq <- relativeDissimilarity(CES_sigma_2, pvar="prices", qvar="quantities", pervar = "time", prodID = "prodID", indexMethod = "fisher", similarityMethod = "logquadratic") head(lq) ``` The output from estimating the dissimilarity between periods can than be used to estimate the maximum similarity links, ```{r similarity_links} maximumSimilarityLinks(lq) ``` To estimate a chained Laspeyres index linking together the periods with maximum similarity as estimated above, ```{r index_similarity} priceIndex(CES_sigma_2, pvar = "prices", qvar = "quantities", pervar = "time", prodID = "prodID", indexMethod = "laspeyres", output = "chained", chainMethod = "logquadratic") ``` ## Multilateral index numbers Multilateral index number methods use data from multiple periods to compute each term in the index. *IndexNumR* provides the functions `GEKSIndex`, `GKIndex` and `WTPDIndex` to use the GEKS, Geary-Khamis or Weighted Time-Product-Dummy multilateral index number methods respectively. ### The GEKS method The GEKS method is attributable to Gini [@gini:1931], Eltito and Koves [@ek:1964], and Szulc [@szulc:1964] in the cross-sectional context. The idea of adapting the method to the time series context is due to Balk [@balk:1981], and developed further by Ivancic, Diewert and Fox [@idf:2011]. The user must choose the size of the window over which to apply the GEKS method, typically one or two years of data plus one period to account for seasonality. Denote this as $w$.The basic method followed by the function `GEKSIndex` is as follows. Choose a period, denoted period $k$, within the window as the base period. Calculate a bilateral index number between period $k$ and every other period in the window. Repeat this for all possible choices of $k$. This gives a matrix of size $w\times w$ of bilateral indexes between all possible pairs of periods within the window. Then compute the GEKS indexes for the first $w$ periods as, \begin{equation} \left[ \prod_{k=1}^{w}P^{k,1} \right]^{1/w}, \left[ \prod_{k=1}^{w}P^{k,2} \right]^{1/w}, \cdots, \left[ \prod_{k=1}^{w}P^{k,w} \right]^{1/w}, \end{equation} where the term $P^{k,t}$ is the bilateral index between period $t$ and base period $k$. *IndexNumR* offers the Fisher, Tornqvist, Walsh, Jevons and time-product-dummy index number methods for the index $P$ via the `indexMethod` option. The Tornqvist index method is the default. The $w\times w$ matrix of bilateral indexes is as follows, \[P = \begin{pmatrix} P^{1,1} & \cdots & P^{1,w} \\ \vdots & \ddots & \vdots \\ P^{w,1} & \cdots & P^{w,w} \end{pmatrix} \] So that the first term of the GEKS index is the geometric mean of the elements in the first column of the above matrix, the second term is the geometric mean of the second column, and so on. Note that *IndexNumR* makes use of two facts about the matrix above to speed up computation: it is (inversely) symmetric so that $P^{j,k} = 1/P^{k,j}$; and the diagonal elements are 1. #### Intersection GEKS (int-GEKS) The intersection GEKS (int-GEKS) was developed by Claude Lamboray and Frances Krsinich [@lambkrsinich:2015] to deal with the asymmetry with which products enter the index, when there are appearing or disappearing products. The issue arises because when calculating the GEKS index comparing two adjacent periods, products that are not matched for the two periods may still contribute to the index, either in period $t-1$ or period $t$, but not both. To see this, note that the GEKS index between periods $t-1$ and $t$, using the window going from period 1 to $w$, can be written as: \begin{equation} P^{t-1,t}_{[1:w]} = \prod_{k=1}^{w}(P_{t-1,k}\times P_{k,t})^{1/w}, \end{equation} where $P_{t-1,k}$ is the bilateral price index between periods $t-1$ and $k$, and $P_{k,t}$ is similarly defined. The usual GEKS procedure performed by *IndexNumR* when using the function `GEKSIndex` and specifying `sample = matched`, performs matching between period $t$ and period $k$ only. The int-GEKS method performs matching between periods $t-1$, $t$ and $k$. Since more matching is performed, fewer data points are used in estimating the index, particularly if product turnover is high. It is also computationally somewhat slower, as more matching must be performed. ### The Geary-Khamis method The Geary-Khamis, or GK method, was introduced by Geary [@geary:1958] and extended by Khamis [@khamis:1970; @khamis:1972]. This method involves calculating a set of quality adjustment factors, $b_{n}$, simultaneously with the price levels, $P_{t}$. The two equations that determine both of these are: \begin{equation} b_{n} = \sum_{t=1}^{T}\left[\frac{q_{tn}}{q_{n}}\right]\left[\frac{p_{tn}}{P_{t}}\right] \end{equation} \begin{equation} P_{t} = \frac{p^{t} \cdot q^{t}} {b \cdot q^{t}} \end{equation} These equations can be solved by an iterative method, where a set of $b_{n}$ are arbitrarily chosen, which can then be used to calculate an initial vector of price levels. This vector of prices is then used to generate a new $b$ vector, and so on until the changes become smaller than some threshold. *IndexNumR* can use the iterative method by specifying the parameter `solveMethod = "iterative"`. However, there is an alternative method using matrix algebra that is significantly more efficient. To use the more efficient method discussed below, specify `solveMethod = "inverse"`. As discussed in [@df:2017] and following Diewert [@diewert:1999], the problem of finding the $b$ vector can be solved using the following system of equations: \begin{equation} \left[I_{N} - C\right]b = 0_{N}, \end{equation} where $I_{N}$ is the $N \times N$ identity matrix, $0_{N}$ is an n-dimensional vector of zeros and the $C$ matrix is given by, \begin{equation} C = \hat{q}^{-1} \sum_{t=1}^{T}s^{t}q^{t\textbf{T}}, \end{equation} where $\hat{q}^{-1}$ is the inverse of an $N \times N$ diagonal matrix $q$, where the diagonal elements are the total quantities purchased for each good over all time periods, $s^{t}$ is a vector of the expenditure shares for time period $t$, and $q^{t\textbf{T}}$ is the transpose of the vector of quantities purchased in time period $t$. It can be shown that the matrix $[I-C]$ is singular so a normalisation is required to solve for $b$. *IndexNumR* follows the method discussed by Irwin Collier Jr. in his comment on [@diewert:1999] and assumes the following normalisation, \begin{equation} \sum_{n=1}^{N}b_{n}q_{n} = 1, \end{equation} which is, in matrix form, \begin{equation} c = R\begin{bmatrix} b_{1}q_{1} \\ \vdots \\ b_{n}q_{n} \end{bmatrix}, \end{equation} where $c$ is the $N \times 1$ vector $\begin{bmatrix} 1 & 0 & \dots & 0 \end{bmatrix}^{\textbf{T}}$, and $R$ is the $N \times N$ matrix, \begin{equation} R = \begin{bmatrix} 1 & 1 & \dots & 1 \\ 0 & \dots & \dots & 0 \\ \vdots & & & \vdots \\ 0 & \dots & \dots & 0 \end{bmatrix} \end{equation} Adding the constraint to the original equation we now have the solution for $b$, \begin{equation} b = [I_{N} - C + R]^{-1}c. \end{equation} Once the $b$ vector has been calculated, the price levels can be computed from the GK equations above. ### The Weighted Time-Product-Dummy method The weighed time-product-dummy method can be seen as the country-product-dummy method [@summers:1973] adapted to the time-series context and supposes the following model for prices: \begin{equation} p_{tn} = \alpha_{t}b_{n}e_{tn}, \end{equation} where $\alpha_{t}$ can be interpreted as the price level in period $t$, $b_{n}$ is the quality adjustment factor for product $n$ and $e_{tn}$ is a stochastic error term. The problem is to solve for $\alpha$ and $b$ using least squares minimisation. Following [@rao:1995], it is formulated as a weighted least squares minimisation, where the weights are based on economic importance. Diewert and Fox show that this can be written as the solution to the system of equations, \begin{equation} [I_{N} - F]\beta = f, \end{equation} where $I_{N}$ is the $N \times N$ identity matrix, $F$ is the following $N \times N$ matrix, \begin{equation} F = \begin{bmatrix} f_{11} & \dots & f_{1N} \\ \vdots & \dots & \vdots \\ f_{N1} & \dots & f_{NN} \end{bmatrix}, \end{equation} the elements of $F$ are the following, \begin{equation} f_{nj} = w_{nj}/\sum_{k=1}^{N}w_{nk} \quad n,j = 1, \dots, N, \end{equation} with the $w_{nj}$ given by, \begin{equation} w_{nj} = \sum_{t=1}^{T}w_{tnj} \quad n,j = 1, \dots, N, \end{equation} and the $w_{tnj}$ given by, \begin{equation} w_{tnj} = s_{tn}s_{tj} \quad n \neq j, n = 1, \dots, N; j = 1, \dots, N; t = 1, \dots, T. \end{equation} $f$ on the right-hand-side is the following, \begin{equation} f = [f_{1}, \dots, f_{N}]^{\textbf{T}}, \end{equation} where the $f_{n}$ are given by, \begin{equation} \sum_{t=1}^{T}\sum_{j=1}^{N}f_{tnj}(y_{tn} - y_{tj}) \quad for \space n = 1, \dots, N \end{equation} and $y_{tn} = log(p_{tn})$. The matrix $[I_{N} - F]$ is singular so a normalisation must be used to solve the system of equations. *IndexNumR* uses the method discussed in [@df:2017]; $\beta_{N}$ is assumed to be zero and the last equation is dropped to solve for the remaining coefficients. ### Extending multilateral indexes The multilateral indexes are normalised by dividing by the first term, to give an index for the first $w$ periods that starts at 1. If the index only covers $w$ periods then no further calculation is required. However, if there are $T>w$ periods in the dataset then the index must be extended. Extending a multilateral index can be done in a multitude of ways. Statistical agencies generally do not revise price indices like the consumer price index, so the methods offered by *IndexNumR* to extend multilateral indexes are methods that do not lead to revisions. More specifically, these are called *splicing methods* and the options available are the *movement*, *window*, *half*, *mean*, *fbew (Fixed Base Expanding Window)*, *fbmw (Fixed Base Moving Window)*, *wisp (window splice on published data)*, *hasp (half-splice on published data)* and *mean splice on published data*. The idea behind most of these methods is that we start by moving the window forward by one period and calculate the index for the new window. There will be $w-1$ overlapping periods between the initial index and the index computed on the window that has been rolled forward one period. Any one of these overlapping periods can be used to extend the multilateral index. The variants of window, half and mean splice that are on published data use the same method as the classical counterparts, but splice onto the published series instead of the previously calculated window. Let $P_{OLD}$ be the index computed over periods $1$ to $w$ and let $P_{NEW}$ be the index computed over the window rolled forward one period, from periods $2$ to $w+1$. Let the final index simply be $P$. For the first $w$ periods $P = P_{OLD}$, then $P^{w+1}$ is computed using the splicing methods as follows. * Movement splice [@idf:2011] \begin{equation} P^{w+1} = P^{w} \times \frac{P_{NEW}^{w+1}}{P_{NEW}^{w}} \end{equation} That is, the movement between the final two periods of the index computed over the new window is used to extend the original index from period $w$ to $w+1$. * Window splice [@krsinich:2016] \begin{equation} P^{w+1} = P^{w} \times \frac{P_{NEW}^{w+1}/P_{NEW}^{2}}{P_{OLD}^{w}/P_{OLD}^{2}} \end{equation} In this case, the ratio of the movement between the first and last periods computed using the new window, to the movement between the first and last periods using the old window is used to extend the original index. * Half splice \begin{equation} P^{w+1} = P^{w} \times \frac{P_{NEW}^{w+1}/P_{NEW}^{\frac{w-1}{2}+1}}{P_{OLD}^{w}/P_{OLD}^{\frac{w-1}{2}+1}} \end{equation} The half splice uses the period in the middle of the window as the overlapping period to calculate the splice. * Mean splice [@idf:2011] \begin{equation} P^{w+1} = P^{w} \times \left( \prod_{t=1}^{w-1} \frac{P_{NEW}^{w+1}/P_{NEW}^{t+1}}{P_{OLD}^{w}/P_{OLD}^{t+1}} \right)^{\frac{1}{(w-1)}} \end{equation} The mean splice uses the geometric mean of the movements between the last period and every other period in the window to extend the original index. * FBMW [@lamboray:2017] \begin{equation} P^{w+1} = P^{base} \times \frac{P_{NEW}^{w+1}}{P_{NEW}^{base}} \end{equation} This method uses a fixed base period that is updated periodically. For example, if the data are monthly then the base period could be each December month, which would be achieved by ensuring that December is the first period in the data and specifying a window length of 13. The splice is calculated by using the movement between the final data point and the base period in the new window to extend the index. If the new data point being calculated is the first period after the base period, then this method produces the same price growth the same as the movement splice. Using the same example, if each December is the base period, then this method will produce the same price growth for January on December as the movement splice. * FBEW [@chessa:2016] This method uses the same calculation as FBMW, but uses a different set of data for the calculation. It expands the size of the window used to compute the new data point each period to include the latest period of data. If the data are monthly and the base period is each December, then the window used to compute the new data point in January includes only the December and January months. In February it includes the December, January and February months, and so on until the next December where it includes the full 13 months (assuming a window length of 13). This method produces the same result as the FBMW method when the new period being calculated is the base period. Using the same example, if each December is the base period, then each December this will produce the same result as the FBMW method. The splicing methods are used in this fashion to extend the series up to the final period in the data. ```{r geks_splicing} # Assume that the data in CES_sigma_2 are quarterly data with time period # 1 corresponding to the December quarter. splices <- c("window", "half", "movement", "mean", "fbew", "fbmw", "wisp", "hasp", "mean_pub") # estimate a GEKS index using the different splicing methods. Under # the above assumptions, the window must be 5 to ensure the base period is # each December quarter. result <- as.data.frame(lapply(splices, function(x){ GEKSIndex(CES_sigma_2, pvar = "prices", qvar = "quantities", pervar = "time", prodID = "prodID", indexMethod = "tornqvist", window=5, splice = x) })) colnames(result) <- splices result ``` On the assumptions in the above example, periods 1, 5 and 9 are Decembers. Periods 1-5 are computed using full information and periods 6-12 are computed using the splicing methods. Notice that fbew = fbmw in period 9 (December) and fbmw = movement in period 6 (January). ## The differences approach to index numbers The above index number methods are derived based on a ratio approach, which decomposes the value change from one period to the next into the product of a price index and a quantity index. An alternative approach is to decompose value change into the sum of a price indicator and a quantity indicator. The theory dates back to the 1920s, and an excellent paper on this approach has been written by Diewert [@diewert:2005a]. There are a number of methods available for computing the indicator, and *IndexNumR* exposes the following, via the `priceIndicator` function: * Laspeyres indicator \begin{equation} I(p^{t-1}, p^{t}) = \sum_{n=1}^{N}q_{n}^{t-1}\times(p_{n}^{t}-p_{n}^{t-1}) \end{equation} * Paasche indicator \begin{equation} I(p^{t-1}, p^{t}) = \sum_{n=1}^{N}q_{n}^{t}\times(p_{n}^{t}-p_{n}^{t-1}) \end{equation} * Bennet indicator [@bennet:1920] \begin{equation} I(p^{t-1}, p^{t}) = \sum_{n=1}^{N} \frac{(q_{n}^{t}+q_{n}^{t-1})}{2} \times(p_{n}^{t}-p_{n}^{t-1}) \end{equation} * Montgomery indicator [@montgomery:1929] \begin{equation} I(p^{t-1}, p^{t}) = \sum_{n=1}^{N} \frac{p_{n}^{t}q_{n}^{t}+p_{n}^{t-1}q_{n}^{t-1}}{log(p_{n}^{t}q_{n}^{t}) - log(p_{n}^{t-1}q_{n}^{t-1})} \times\left(\frac{p_{n}^{t}}{p_{n}^{t-1}}\right) \end{equation} ### Examples Price indicators for the `CES_sigma_2` dataset are as follows: ```{r indicators} methods <- c("laspeyres", "paasche", "bennet", "montgomery") p <- lapply(methods, function(x) {priceIndicator(CES_sigma_2, pvar = "prices", qvar = "quantities", pervar = "time", prodID = "prodID", method = x)}) as.data.frame(p, col.names = methods) ``` Quantity indicators can also be produced using the same methods as outlined above via the `quantityIndicator` function. This allows for the value change from one period to the next to be decomposed into price and quantity movements. To facilitate this, *IndexNumR* contains the `valueDecomposition` function, which can be used as follows to produce a decomposition of the value change for CES_sigma_2 using a Bennet indicator: ```{r value} valueDecomposition(CES_sigma_2, pvar = "prices", qvar = "quantities", pervar = "time", prodID = "prodID", priceMethod = "bennet") ``` Note that for this decomposition, the method is specified for the price indicator and *IndexNumR* uses the appropriate quantity indicator. For Bennet and Montgomery indicators the same method is used for the quantity indicator as for the price indicator. If a Laspeyres price indicator is requested then the corresponding volume indicator is a Paasche indicator. The reverse is true if the Paasche indicator is used for prices. ## Group indexes If a variable is available in the data set that identifies the group to which a product belongs, it is possible to estimate indexes on each of the groups in the sample using the function `groupIndexes`. An example is if products come from different geographic regions, or belong to different product categories. `groupIndexes` will split the data into the different groups and estimate a price index on each group. Any of the price index functions can be used by specifying the `indexFunction` parameter and then supplying the arguments to the price index function as a named list. ```{r groups} # add a group variable to the CES_sigma_2 dataset # products 1 and 2 will be in group 1, products 3 and 4 in group 2 df <- CES_sigma_2 df$group <- c(rep(1, 24), rep(2, 24)) # put the arguments to the priceIndex function into a named list argsList <- list(x = df, pvar = "prices", qvar = "quantities", pervar = "time", prodID = "prodID", indexMethod = "fisher", output = "chained") # estimate a bilateral chained fisher index on the groups groupIndexes("group", "priceIndex", argsList) # put the arguments for the GEKSIndex function in a named list argsGEKS <- list(x = df, pvar = "prices", qvar = "quantities", pervar = "time", prodID = "prodID", indexMethod = "fisher", window = 12) # estimate a GEKS index on the groups groupIndexes("group", "GEKSIndex", argsGEKS) ``` ## Year-over-year Indexes Year-over-year indexes are those that calculate the price change between the same periods across years. If the data are monthly then there are twelve indexes; one for each month of the year. Each element of the January index would measure the price movement from January in the base year to January in the comparison year. The second index does the same for February, and so on. *IndexNumR* provides the function `yearOverYearIndexes` to estimate these, given the frequency as either 'monthly' or 'quarterly'. This is effectively a form of group index, where the month or quarter gives the group to which the product belongs. *IndexNumR* will create a group variable based on the supplied frequency and call the `groupIndexes` function to estimate the indexes. The data must be structured in the frequency that you specify (if you specify 'quarterly' as the frequency in the `yearOverYearIndexes` function, the time period variable in the data set must be quarterly). The output from the `yearOverYearIndexes` function will have a column for the month or quarter. The quarter labelled as quarter 1 will have been constructed from the time periods 1, 5, 9, 13, etc of the data set. The quarter labelled 2 would have been estimated from time periods 2, 6, 10, 14, etc of the data set. ```{r yoy} # Assume the CES_sigma_2 data are quarterly observations over three years. # This results in 4 indexes (one for each quarter) of 3 periods each. # Estimate year-over-year chained fisher indexes. argsList <- list(x = CES_sigma_2, pvar = "prices", qvar = "quantities", pervar = "time", prodID = "prodID", indexMethod = "fisher", output = "chained") yearOverYearIndexes("quarterly", "priceIndex", argsList) ``` # Development *IndexNumR* is hosted on Github at [https://github.com/grahamjwhite/IndexNumR](https://github.com/grahamjwhite/IndexNumR). There users can find instructions to install the development version directly from Github, as well as report and view bugs or improvements. # References