Regression for M&V: Reference Guide
18
For use in a regression analysis, any categorical variable must be expressed in a binary form, such
as taking the value of 1 for Monday and taking the value of 0 for all other days. This is because all
the variables in a regression model must be linearly related to the dependent variable. A conceptual
category such as day-of-week therefore cannot be included in a regression if it takes values such
1 for Monday, 2 for Tuesday, on up through 7 for Sunday; Tuesday does not have twice the impact
on the dependent variable than Monday, nor does Wednesday have three times the impact.
As mentioned at the end of the prior section, one needs to take care in adding additional variables
– such as multiple binary variables to describe a composite concept (such as day-of-week) –
because the model can become over-specified, and the parameter estimates inaccurate and
imprecise. Thus, when needing to create a set of binary variables to capture a composite categorical
concept, the M&V practitioner should consider the most concise way to express the underlying
relationships between these categories and the dependent variable. Continuing with the day of
week example, it may be that activity ramps up during the week; appropriate categories might be
Monday/Tuesday, Wednesday/Thursday/Friday, and Saturday/Sunday, where Mon_Tues has the
value of 1 if the day is a Monday or Tuesday and 0 otherwise, and similarly for the other variables.
Finally, when working with binary variables describing composite categories, the modeler includes
one less binary variable in the equation than the total number of categories in the set. Continuing
with the example, when the variables Mon_Tues and Wed_Thus_Fri both have the value of 0, the
day must be a Saturday or Sunday; it would be redundant (that is collinear) to add the variable
Sat_Sun.
According to ASHRAE RP-1050, practitioners using categorical variables commonly err by
inappropriately using them only to change the line’s intercept. The M&V practitioner needs to
carefully consider whether the categorical variable is expected to affect the model’s intercept term,
a slope term, or both. If the slope likely differs among categories, the model must include terms to
capture the interaction of the categorical and continuous variable, which can be tedious and error-
prone to accomplish in Microsoft Excel. (Another solution is to fit separate models for different
levels of the categorical variable.)
An appropriate statistical approach to apply with categorical variables is the General Linear Model
(GLM). Multiple regression is typically used where the independent variables are continuous, but
a general linear model can accommodate both categorical and continuous predictor variables. In
avoiding the common pitfall of all categories having the same slope, it is important to use the
proper GLM method. (Please refer to a statistics text for further discussion of general linear
models. Some resources are noted in the References and Resources section of this document.)
Instead of using a multiple regression of the format in ASHRAE RP-1050, you can create separate
models for each category or combination of categories, and then combine these individual models
into a complete model. The basic process is similar to using IF statements to determine, for each
data point, the category of the categorical independent variable, and then using the intercept and
slope that are appropriate for that category.