Technical Report

Introduction

This project analyzes Utah residential housing listings in several cities in Salt Lake County and Utah counties to explore, build, and evaluate predictive models for house price.

The motivation for this project is to provide data-driven guidance for buyers, sellers, and local planners by identifying key predictors of sale price and producing models that balance predictive accuracy and interpretability.

Some key goals of this project were (1) to produce clear exploratory analyses that reveal relationships between price and common housing attributes; (2) to fit defensible regression models (i.e. best-subsets by AIC and a LASSO-regularized model) and compare them using cross-validated PMSE; and (3) to document reproducible code and a small interactive Streamlit app for sharing results.

Data Source and Methodology

  1. Data acquisition:
  • Raw listings were collected from UtahRealEstate.com with the repository’s scraping scripts. The primary ingestion function used in this report is data_no_scape() which returns the raw DataFrame used for downstream processing.
  1. Cleaning pipeline:
  • The cleaning pipeline performs the following steps: removing obvious duplicates, dropping invalid or malformed entries, and standardizing numeric columns (e.g. beds, baths, sqft). For purposes of analysis, we dropped the original address and mls fields.
  1. Analysis workflow:
  • Exploratory data analysis involved visual checks (scatterplots and boxplots) to inspect relationships between price and candidate predictors and to assess outliers and heterogeneity across selected cities.
  • Model selection involved exhaustive best-subsets regression evaluated by AIC to identify an interpretable model, and LASSO (with cross-validated penalty) to perform automated variable selection and shrinkage. Models were fit with statsmodels and model summaries were constructed with patsy.
  • Model assessment: 5-fold cross-validation is used to estimate the predictive mean squared error (PMSE) for competing models.
  1. Tooling and reproducibility:
  • Primary libraries: pandas, numpy, matplotlib, seaborn, statsmodels, patsy, scikit-learn.
  • Project environment and packaging: see pyproject.toml

EDA

import matplotlib.pyplot as plt
import numpy as np

# Create a 2x2 figure with 4 subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('EDA: Price vs Selected Variables', fontsize=16, fontweight='bold')

# Plot 1 (top-left): Scatterplot - price vs bedrooms (or sqft, year_built, etc.)
axes[0, 0].scatter(df['beds'], df['price'], alpha=0.5, s=50, color='steelblue')
axes[0, 0].set_xlabel('Bedrooms', fontsize=11)
axes[0, 0].set_ylabel('Price ($)', fontsize=11)
axes[0, 0].set_title('Price vs Bedrooms', fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_xlim(0, 10)
axes[0, 0].set_ylim(0, 3_000_000)

# Plot 2 (top-right): Scatterplot - price vs bathrooms (or sqft, lot_size, etc.)
axes[0, 1].scatter(df['baths'], df['price'], alpha=0.5, s=50, color='coral')
axes[0, 1].set_xlabel('Bathrooms', fontsize=11)
axes[0, 1].set_ylabel('Price ($)', fontsize=11)
axes[0, 1].set_title('Price vs Bathrooms', fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_xlim(0, 10)
axes[0, 1].set_ylim(0, 3_000_000)

# Plot 3 (bottom-left): Boxplot - price by garage (categorical)
axes[1, 0].scatter(df['sqft'], df['price'], alpha=0.5, s=50)
axes[1, 0].set_xlabel('Square Feet', fontsize=11)
axes[1, 0].set_ylabel('Price ($)', fontsize=11)
axes[1, 0].set_title('Price vs Square Footage', fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_xlim(0, 15_000)
axes[1, 0].set_ylim(0, 4_000_000)

# Plot 4 (bottom-right): Boxplot - price by city (categorical)
sub_df = df[df['city'].isin(['lindon', 'orem', 'provo', 'spanish-fork'])]
sub_df.boxplot(column='price', by='city', ax=axes[1, 1])
axes[1, 1].set_xlabel('City', fontsize=11)
axes[1, 1].set_ylabel('Price ($)', fontsize=11)
axes[1, 1].set_title('Price by City', fontweight='bold')
axes[1, 1].get_figure().suptitle('')  # remove automatic title from boxplot
axes[1, 1].set_ylim(0, 2_500_000)

plt.tight_layout()
plt.show()

Analysis

# Import necessary modules
import itertools
import statsmodels.formula.api as smf
from patsy import dmatrices
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression, LassoCV, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

# -----------------------------
# Treat ONLY city and zipcode as categorical
# -----------------------------
cat_vars = ['city', 'zipcode']
for c in cat_vars:
    if c in df.columns:
        df[c] = df[c].astype('category')

# Candidate predictors
candidates = [c for c in df.columns if c != 'price']

# -----------------------------
# Best subsets selection via AIC
# -----------------------------
best_aic = np.inf
best_formula = None
best_model = None

for k in range(1, len(candidates) + 1):
    for subset in itertools.combinations(candidates, k):
        terms = [f"C({v})" if v in cat_vars else v for v in subset]
        formula = "price ~ " + " + ".join(terms)
        try:
            model = smf.ols(formula, data=df).fit()
            if model.aic < best_aic:
                best_aic = model.aic
                best_formula = formula
                best_model = model
        except Exception:
            continue

print("\n==============================")
print("BEST SUBSETS (AIC) MODEL")
print("==============================")
print("Best AIC:", best_aic)
print("Best formula:", best_formula)
print(best_model.summary())

# -----------------------------
# CV PMSE for Best-AIC model
# -----------------------------
y_aic, X_aic = dmatrices(best_formula, data=df, return_type='dataframe')
y_aic = np.ravel(y_aic)

kf = KFold(n_splits=5, shuffle=True, random_state=1)
mses_aic = []

for tr, te in kf.split(X_aic):
    lr = LinearRegression(fit_intercept=False)
    lr.fit(X_aic.iloc[tr], y_aic[tr])
    preds = lr.predict(X_aic.iloc[te])
    mses_aic.append(mean_squared_error(y_aic[te], preds))

pmse_aic = np.mean(mses_aic)

# -----------------------------
# LASSO (lambda = 1 SE rule)
# -----------------------------
# Build full design matrix with dummies
full_formula = "price ~ " + " + ".join(
    [f"C({v})" if v in cat_vars else v for v in candidates]
)

y_lasso, X_lasso = dmatrices(full_formula, data=df, return_type='dataframe')
y_lasso = np.ravel(y_lasso)

# LASSO with standardization
lasso_cv = Pipeline([
    ("scaler", StandardScaler()),
    ("lasso", LassoCV(cv=5, random_state=1))
])

lasso_cv.fit(X_lasso, y_lasso)

# Lambda_1se
mse_path = lasso_cv.named_steps["lasso"].mse_path_.mean(axis=1)
mse_std = lasso_cv.named_steps["lasso"].mse_path_.std(axis=1)
idx_min = np.argmin(mse_path)
mse_1se = mse_path[idx_min] + mse_std[idx_min]
idx_1se = np.where(mse_path <= mse_1se)[0][-1]

alpha_1se = lasso_cv.named_steps["lasso"].alphas_[idx_1se]

# Fit LASSO at lambda_1se
lasso_1se = Pipeline([
    ("scaler", StandardScaler()),
    ("lasso", Lasso(alpha=alpha_1se))
])

lasso_1se.fit(X_lasso, y_lasso)

# Selected variables
coef = lasso_1se.named_steps["lasso"].coef_
selected = X_lasso.columns[coef != 0]

print("\n==============================")
print("LASSO SELECTED VARIABLES (λ_1se)")
print("==============================")
print(list(selected))

# -----------------------------
# Refit OLS using LASSO-selected variables (FIXED)
# -----------------------------

selected_cols = X_lasso.columns[coef != 0]

# Recover original variable names
selected_vars = set()

for col in selected_cols:
    if col.startswith("C("):
        # categorical: extract variable name inside C(...)
        var = col.split("[")[0]          # C(city)
        var = var.replace("C(", "").replace(")", "")
        selected_vars.add(f"C({var})")
    else:
        selected_vars.add(col)

selected_vars = sorted(selected_vars)

print("\nVariables used in LASSO refit:")
print(selected_vars)

lasso_formula = "price ~ " + " + ".join(selected_vars)
lasso_refit = smf.ols(lasso_formula, data=df).fit()

print("\n==============================")
print("OLS REFIT USING LASSO VARIABLES")
print("==============================")
print(lasso_refit.summary())

# -----------------------------
# CV PMSE for LASSO-selected model
# -----------------------------
y_lasso2, X_lasso2 = dmatrices(lasso_formula, data=df, return_type='dataframe')
y_lasso2 = np.ravel(y_lasso2)

mses_lasso = []
for tr, te in kf.split(X_lasso2):
    lr = LinearRegression(fit_intercept=False)
    lr.fit(X_lasso2.iloc[tr], y_lasso2[tr])
    preds = lr.predict(X_lasso2.iloc[te])
    mses_lasso.append(mean_squared_error(y_lasso2[te], preds))

pmse_lasso = np.mean(mses_lasso)

# -----------------------------
# Final comparison
# -----------------------------
print("\n==============================")
print("CROSS-VALIDATED PMSE COMPARISON")
print("==============================")
print(f"Best Subsets (AIC) PMSE : {pmse_aic:.3f}")
print(f"LASSO (1SE) PMSE        : {pmse_lasso:.3f}")

==============================
BEST SUBSETS (AIC) MODEL
==============================
Best AIC: 30231.181453641035
Best formula: price ~ beds + baths + sqft + year_built + lot_size
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.577
Model:                            OLS   Adj. R-squared:                  0.575
Method:                 Least Squares   F-statistic:                     265.4
Date:                Wed, 17 Dec 2025   Prob (F-statistic):          6.79e-179
Time:                        19:34:30   Log-Likelihood:                -15110.
No. Observations:                 978   AIC:                         3.023e+04
Df Residuals:                     972   BIC:                         3.026e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   1.709e+07   2.69e+06      6.345      0.000    1.18e+07    2.24e+07
beds       -3.414e+05   3.96e+04     -8.618      0.000   -4.19e+05   -2.64e+05
baths       1.452e+05   5.43e+04      2.672      0.008    3.86e+04    2.52e+05
sqft         645.0850     31.942     20.196      0.000     582.402     707.767
year_built -8704.3009   1362.142     -6.390      0.000   -1.14e+04   -6031.223
lot_size    5.773e+04   1.02e+04      5.671      0.000    3.78e+04    7.77e+04
==============================================================================
Omnibus:                     1750.557   Durbin-Watson:                   1.882
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          2673338.480
Skew:                          11.796   Prob(JB):                         0.00
Kurtosis:                     258.043   Cond. No.                     3.11e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.11e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

==============================
LASSO SELECTED VARIABLES (λ_1se)
==============================
['C(city)[T.american-fork]', 'C(city)[T.cottonwood-heights]', 'C(city)[T.draper]', 'C(city)[T.eagle-mountain]', 'C(city)[T.highland]', 'C(city)[T.holladay]', 'C(city)[T.lehi]', 'C(city)[T.lindon]', 'C(city)[T.midvale]', 'C(city)[T.millcreek]', 'C(city)[T.murray]', 'C(city)[T.orem]', 'C(city)[T.provo]', 'C(city)[T.salt-lake-city]', 'C(city)[T.sandy]', 'C(city)[T.saratoga-springs]', 'C(city)[T.south-jordan]', 'C(city)[T.south-salt-lake]', 'C(city)[T.spanish-fork]', 'C(city)[T.west-jordan]', 'beds', 'baths', 'sqft', 'year_built', 'lot_size', 'garage']

Variables used in LASSO refit:
['C(city)', 'baths', 'beds', 'garage', 'lot_size', 'sqft', 'year_built']

==============================
OLS REFIT USING LASSO VARIABLES
==============================
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.593
Model:                            OLS   Adj. R-squared:                  0.582
Method:                 Least Squares   F-statistic:                     53.23
Date:                Wed, 17 Dec 2025   Prob (F-statistic):          2.01e-165
Time:                        19:34:30   Log-Likelihood:                -15091.
No. Observations:                 978   AIC:                         3.024e+04
Df Residuals:                     951   BIC:                         3.037e+04
Df Model:                          26                                         
Covariance Type:            nonrobust                                         
=================================================================================================
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept                      8.897e+06    3.4e+06      2.616      0.009    2.22e+06    1.56e+07
C(city)[T.american-fork]       6.363e+05   2.94e+05      2.168      0.030    6.02e+04    1.21e+06
C(city)[T.cottonwood-heights]    6.8e+05   2.88e+05      2.360      0.018    1.15e+05    1.25e+06
C(city)[T.draper]              2.199e+05   2.85e+05      0.773      0.440   -3.39e+05    7.78e+05
C(city)[T.eagle-mountain]      3.095e+05   2.94e+05      1.052      0.293   -2.68e+05    8.87e+05
C(city)[T.highland]            3.528e+05   2.86e+05      1.235      0.217   -2.08e+05    9.13e+05
C(city)[T.holladay]            9.416e+05   2.83e+05      3.331      0.001    3.87e+05     1.5e+06
C(city)[T.lehi]                4.983e+05   2.96e+05      1.682      0.093    -8.3e+04    1.08e+06
C(city)[T.lindon]              7.268e+04    3.2e+05      0.227      0.820   -5.56e+05    7.01e+05
C(city)[T.midvale]             8.057e+05      3e+05      2.683      0.007    2.16e+05     1.4e+06
C(city)[T.millcreek]           7.996e+05   2.95e+05      2.713      0.007    2.21e+05    1.38e+06
C(city)[T.murray]              6.676e+05   2.96e+05      2.257      0.024    8.71e+04    1.25e+06
C(city)[T.orem]                6.939e+05   2.93e+05      2.369      0.018    1.19e+05    1.27e+06
C(city)[T.provo]               1.093e+06   2.94e+05      3.714      0.000    5.15e+05    1.67e+06
C(city)[T.salt-lake-city]       7.65e+05   3.04e+05      2.516      0.012    1.68e+05    1.36e+06
C(city)[T.sandy]               6.291e+05    2.9e+05      2.169      0.030    5.98e+04     1.2e+06
C(city)[T.saratoga-springs]    1.296e+05   2.96e+05      0.439      0.661    -4.5e+05     7.1e+05
C(city)[T.south-jordan]        4.107e+05   2.91e+05      1.409      0.159   -1.61e+05    9.83e+05
C(city)[T.south-salt-lake]     8.755e+05   4.16e+05      2.105      0.036    5.91e+04    1.69e+06
C(city)[T.spanish-fork]        5.297e+05    2.9e+05      1.825      0.068      -4e+04     1.1e+06
C(city)[T.west-jordan]         7.617e+05   2.95e+05      2.578      0.010    1.82e+05    1.34e+06
baths                           1.22e+05   5.61e+04      2.177      0.030     1.2e+04    2.32e+05
beds                          -3.344e+05      4e+04     -8.363      0.000   -4.13e+05   -2.56e+05
garage                        -1.696e+04   2.63e+04     -0.644      0.520   -6.86e+04    3.47e+04
lot_size                       5.208e+04   1.03e+04      5.065      0.000    3.19e+04    7.23e+04
sqft                            678.9545     33.569     20.226      0.000     613.076     744.833
year_built                    -4906.1679   1718.494     -2.855      0.004   -8278.646   -1533.689
==============================================================================
Omnibus:                     1708.334   Durbin-Watson:                   1.931
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          2336371.338
Skew:                          11.218   Prob(JB):                         0.00
Kurtosis:                     241.392   Cond. No.                     3.96e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.96e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

==============================
CROSS-VALIDATED PMSE COMPARISON
==============================
Best Subsets (AIC) PMSE : 2125253174429.225
LASSO (1SE) PMSE        : 2219847110171.425

Conclusion

The report compares two model-building strategies: exhaustive best-subsets selection using AIC and LASSO (with the 1-SE rule). Fitted model summaries are generated through statsmodels, and the models are evaluated using 5-fold cross-validated PMSE. The best‐subsets AIC procedure selected a parsimonious model containing only five structural predictors—beds, baths, square footage, year built, and lot size—yielding an (\(R^2\)) of 0.577 and the lowest AIC (30231.2) among all candidate models. All included predictors are statistically significant at the \(\alpha=0.05\) level, with square footage and lot size positively associated with price, while the negative coefficient on beds likely reflects multicollinearity with square footage. This model emphasizes interpretability and achieves strong explanatory power with relatively few parameters.

In contrast, the LASSO model using the one–standard‐error rule selected a much richer specification that included the same core structural variables, added garage size, and incorporated city as a categorical factor, resulting in 26 fitted coefficients in the refit OLS model. This model achieved a slightly higher in‐sample fit (\(R^2 = 0.593\)), suggesting that location effects explain additional variation in housing prices beyond physical characteristics. However, many individual city coefficients are not statistically significant, indicating that while location matters collectively, its contribution is diffuse across levels rather than driven by a small number of dominant cities.

When comparing predictive performance via cross‐validation, the best‐subsets AIC model slightly outperformed the LASSO‐selected model, with a lower PMSE (\(2.13 \times 10^{12}\) vs. \(2.22 \times 10^{12}\)). This suggests that the additional complexity introduced by the LASSO model does not translate into improved out‐of‐sample prediction. Overall, the results highlight a classic bias–variance tradeoff: the LASSO model captures more structure through location effects but incurs higher variance, while the simpler AIC‐selected model provides comparable and slightly superior—predictive accuracy with greater interpretability.

There are several limitations to consider. The scraped dataset may suffer from selection bias, as online listings do not represent all transactions, and measurement errors could exist in features such as square footage or number of bedrooms. Additionally, some predictors are coarse (e.g. city) and may obscure spatial heterogeneity. Residual heteroscedasticity and nonlinearity are also possible, suggesting that transforming the price variable (e.g., using \(\log(\texttt{price})\)) or applying robust, heteroscedasticity-consistent standard errors may be appropriate.

Future work could involve incorporating additional predictors such as year built, lot size, and proximity to amenities, or exploring spatial models that explicitly account for location dependence. Alternative loss functions like MAE could be evaluated, and tree-based ensemble methods such as random forests or gradient boosting could be applied, with comparisons of their calibrated uncertainty against linear models. Finally, the best-performing model could potentially be packaged and deployed via an endpoint or dashboard, enabling users to query price predictions based on property attributes.

From a practical standpoint, the results suggest that a sizeable portion of the information needed to predict home prices comes from a small set of basic property features, especially square footage, lot size, and overall layout. It was observed tha adding many location indicators increases model complexity without meaningfully improving predictive accuracy. For real-world use, such as quick price estimates, reporting, or decision support, the simpler AIC-selected model is likely the better choice: it is easier to explain, easier to maintain, and performs just as well (or slightly better) on new data. More complex models may still be useful for deeper analysis of neighborhood effects.