Accessing the Data

This tutorial demonstrates how to use the utah_housing_stat386 package to scrape, clean, and analyze Utah housing data from UtahRealEstate.com. The package provides tools to collect data from cities in Utah County and Salt Lake County.

Installation

The first things that you will need to do is install the utah_housing_stat386 package.

pip install utah-housing-stat386

After installing, you’ll also need to install playwright browers to run the scraping functions.

Reading in the data

Using the package’s scraping functions get_data and get_cleaned_data are extremely memory intensive. Unless, dynamically updated data is required, it is highly recommended to use the functions data_no_scrape and cleaned_static_data, which are used to read in data that has been previously scraped (a process that took several hours). Details for using get_data and get_cleaned_data may be found on the Documentation page.

# Import libraries
import pandas as pd
import numpy as np
from utah_housing_stat386.cleaning import data_no_scape, cleaned_static_data

# Read in the raw, previously scraped data
df_raw = data_no_scape()
df_raw.head(10)
mls price address beds baths sqft year_built lot_size garage agent city
0 2124118 $3,495,973 769 W Ranch Cir, Alpine, UT 84004 8 9 14472 2008 1.02 Ac 2124118.0 Contact Agent Ned Chidester 801-420-7653 Co-Ag... alpine
1 2124041 $3,850,000 19 E Elk Ct, Alpine, UT 84004 7 7 8036 2020 0.61 Ac 2124041.0 Contact Agent Tayte Lackey 208-427-4460 Your N... alpine
2 2122882 $719,000 475 Grove Dr, Alpine, UT 84004 5 2 2150 1969 0.50 Ac 2122882.0 Contact Agent Caleb Eastman 801-419-4773 Your ... alpine
3 2121304 $1,500,000 48 N 100 E, Alpine, UT 84004 4 4 3808 1900 0.93 Ac 2121304.0 Contact Agent Will S Jones 801-376-7239 Your N... alpine
4 2119151 $1,300,000 177 W Canyon Crest Rd, Alpine, UT 84004 5 3 3818 2002 0.47 Ac 2119151.0 Contact Agent Cristie Berg 801-372-2751 Co-Age... alpine
5 2118246 $2,245,000 259 S Twin River Loop, Alpine, UT 84004 7 6 9119 2003 0.56 Ac 2.0 Contact Agent Kerry Oman 801-369-2507 Co-Agent... alpine
6 2116152 $1,350,000 1063 E Alpine Dr, Alpine, UT 84004 6 4 5102 1983 0.46 Ac 2116152.0 Contact Agent Kara Ragsdale 801-870-6073 Your ... alpine
7 2116014 $1,499,000 667 E 770 N, Alpine, UT 84004 9 5 6262 2022 0.55 Ac 2.0 Contact Agent Curry Jones 801-830-8199 Your Na... alpine
8 2115239 $1,235,000 19 N Lone Peak Dr, Alpine, UT 84004 5 5 5587 1993 0.90 Ac 1.0 Contact Agent Julie B. Pierce 801-830-9292 Co-... alpine
9 2115034 $2,400,000 1801 N Fort Canyon Rd, Alpine, UT 84004 6 3 4491 1980 1.74 Ac 2115034.0 Contact Agent Brett R Sellick 801-502-9955 You... alpine
# Read in the clean, previously scraped data
df_clean = cleaned_static_data()
df_clean.head(10)
mls price address beds baths sqft year_built lot_size garage city
0 2124118 3495973.0 769 W Ranch Cir, Alpine, UT 84004 8.0 9.0 14472.0 2008 1.02 0 alpine
1 2124041 3850000.0 19 E Elk Ct, Alpine, UT 84004 7.0 7.0 8036.0 2020 0.61 0 alpine
2 2122882 719000.0 475 Grove Dr, Alpine, UT 84004 5.0 2.0 2150.0 1969 0.50 0 alpine
3 2121304 1500000.0 48 N 100 E, Alpine, UT 84004 4.0 4.0 3808.0 1900 0.93 0 alpine
4 2119151 1300000.0 177 W Canyon Crest Rd, Alpine, UT 84004 5.0 3.0 3818.0 2002 0.47 0 alpine
5 2118246 2245000.0 259 S Twin River Loop, Alpine, UT 84004 7.0 6.0 9119.0 2003 0.56 2 alpine
6 2116152 1350000.0 1063 E Alpine Dr, Alpine, UT 84004 6.0 4.0 5102.0 1983 0.46 0 alpine
7 2116014 1499000.0 667 E 770 N, Alpine, UT 84004 9.0 5.0 6262.0 2022 0.55 2 alpine
8 2115239 1235000.0 19 N Lone Peak Dr, Alpine, UT 84004 5.0 5.0 5587.0 1993 0.90 1 alpine
9 2115034 2400000.0 1801 N Fort Canyon Rd, Alpine, UT 84004 6.0 3.0 4491.0 1980 1.74 0 alpine

Cleaning the data

If the data was scraped, the following cleaning functions may be used to clean it. For time and simplicity, the following example usage of these functions will be used with raw, previously scraped (i.e. static) data, df_raw.

Cleaning the data step-by-step, function-by-function:

from utah_housing_stat386.cleaning import clean_price, clean_numeric_field, clean_year_built, clean_lot_size, clean_garage, clean_address, clean_city

# Apply cleaning step-by-step
df_clean1 = df_raw.copy()
df_clean1.drop(columns=['agent'])
df_clean1['price'] = df_clean1['price'].apply(clean_price)
df_clean1['beds'] = df_clean1['beds'].apply(clean_numeric_field)
df_clean1['baths'] = df_clean1['baths'].apply(clean_numeric_field)
df_clean1['sqft'] = df_clean1['sqft'].apply(clean_numeric_field)
df_clean1['year_built'] = df_clean1['year_built'].apply(clean_year_built)
df_clean1['lot_size'] = df_clean1['lot_size'].apply(clean_lot_size)
df_clean1['garage'] = df_clean1['garage'].apply(clean_garage)
df_clean1['address'] = df_clean1['address'].apply(clean_address)
df_clean1['city'] = df_clean1['city'].apply(clean_city)

df_clean1.head(10)
mls price address beds baths sqft year_built lot_size garage agent city
0 2124118 3495973.0 769 W Ranch Cir, Alpine, UT 84004 8.0 9.0 14472.0 2008 1.02 0 Contact Agent Ned Chidester 801-420-7653 Co-Ag... alpine
1 2124041 3850000.0 19 E Elk Ct, Alpine, UT 84004 7.0 7.0 8036.0 2020 0.61 0 Contact Agent Tayte Lackey 208-427-4460 Your N... alpine
2 2122882 719000.0 475 Grove Dr, Alpine, UT 84004 5.0 2.0 2150.0 1969 0.50 0 Contact Agent Caleb Eastman 801-419-4773 Your ... alpine
3 2121304 1500000.0 48 N 100 E, Alpine, UT 84004 4.0 4.0 3808.0 1900 0.93 0 Contact Agent Will S Jones 801-376-7239 Your N... alpine
4 2119151 1300000.0 177 W Canyon Crest Rd, Alpine, UT 84004 5.0 3.0 3818.0 2002 0.47 0 Contact Agent Cristie Berg 801-372-2751 Co-Age... alpine
5 2118246 2245000.0 259 S Twin River Loop, Alpine, UT 84004 7.0 6.0 9119.0 2003 0.56 2 Contact Agent Kerry Oman 801-369-2507 Co-Agent... alpine
6 2116152 1350000.0 1063 E Alpine Dr, Alpine, UT 84004 6.0 4.0 5102.0 1983 0.46 0 Contact Agent Kara Ragsdale 801-870-6073 Your ... alpine
7 2116014 1499000.0 667 E 770 N, Alpine, UT 84004 9.0 5.0 6262.0 2022 0.55 2 Contact Agent Curry Jones 801-830-8199 Your Na... alpine
8 2115239 1235000.0 19 N Lone Peak Dr, Alpine, UT 84004 5.0 5.0 5587.0 1993 0.90 1 Contact Agent Julie B. Pierce 801-830-9292 Co-... alpine
9 2115034 2400000.0 1801 N Fort Canyon Rd, Alpine, UT 84004 6.0 3.0 4491.0 1980 1.74 0 Contact Agent Brett R Sellick 801-502-9955 You... alpine

Cleaning the data with clean_housing_data, which applies all cleaning functions to the DataFrame:

from utah_housing_stat386.cleaning import clean_housing_data, remove_duplicates, remove_invalid_entries

# Apply cleaning in consolidated steps
df_clean2 = clean_housing_data(df_raw)
df_clean2 = remove_duplicates(df_clean2)
df_clean2 = remove_invalid_entries(df_clean2)

df_clean2.head(10)
mls price address beds baths sqft year_built lot_size garage city
0 2124118 3495973.0 769 W Ranch Cir, Alpine, UT 84004 8.0 9.0 14472.0 2008 1.02 0 alpine
1 2124041 3850000.0 19 E Elk Ct, Alpine, UT 84004 7.0 7.0 8036.0 2020 0.61 0 alpine
2 2122882 719000.0 475 Grove Dr, Alpine, UT 84004 5.0 2.0 2150.0 1969 0.50 0 alpine
3 2121304 1500000.0 48 N 100 E, Alpine, UT 84004 4.0 4.0 3808.0 1900 0.93 0 alpine
4 2119151 1300000.0 177 W Canyon Crest Rd, Alpine, UT 84004 5.0 3.0 3818.0 2002 0.47 0 alpine
5 2118246 2245000.0 259 S Twin River Loop, Alpine, UT 84004 7.0 6.0 9119.0 2003 0.56 2 alpine
6 2116152 1350000.0 1063 E Alpine Dr, Alpine, UT 84004 6.0 4.0 5102.0 1983 0.46 0 alpine
7 2116014 1499000.0 667 E 770 N, Alpine, UT 84004 9.0 5.0 6262.0 2022 0.55 2 alpine
8 2115239 1235000.0 19 N Lone Peak Dr, Alpine, UT 84004 5.0 5.0 5587.0 1993 0.90 1 alpine
9 2115034 2400000.0 1801 N Fort Canyon Rd, Alpine, UT 84004 6.0 3.0 4491.0 1980 1.74 0 alpine

Perform some last cleaning steps:

import re

df = df_clean2.copy()

# Extract zipcodes
address_list = df['address'].tolist()
pattern = r"UT \d{5}"
zipcode_list = []
for address in address_list:
    zipcode_list.append(re.search(pattern, address).group()[3:])
df['zipcode'] = zipcode_list
df = df.drop(columns=['address', 'mls'])

df
price beds baths sqft year_built lot_size garage city zipcode
0 3495973.0 8.0 9.0 14472.0 2008 1.02 0 alpine 84004
1 3850000.0 7.0 7.0 8036.0 2020 0.61 0 alpine 84004
2 719000.0 5.0 2.0 2150.0 1969 0.50 0 alpine 84004
3 1500000.0 4.0 4.0 3808.0 1900 0.93 0 alpine 84004
4 1300000.0 5.0 3.0 3818.0 2002 0.47 0 alpine 84004
... ... ... ... ... ... ... ... ... ...
509 520000.0 4.0 3.0 2005.0 1995 0.16 0 west-jordan 84088
510 598990.0 3.0 3.0 2433.0 2025 0.08 0 west-jordan 84081
511 1054810.0 6.0 4.0 4903.0 2025 0.14 2 west-jordan 84081
512 589900.0 3.0 3.0 2861.0 2022 0.11 0 west-jordan 84081
513 1049990.0 5.0 4.0 5020.0 2025 0.18 0 south-jordan 84009

979 rows × 9 columns

The data is now ready for EDA and analysis (see the Technical Report page).