This tutorial demonstrates how to use the utah_housing_stat386 package to scrape, clean, and analyze Utah housing data from UtahRealEstate.com. The package provides tools to collect data from cities in Utah County and Salt Lake County.
Installation
The first things that you will need to do is install the utah_housing_stat386 package.
pip install utah-housing-stat386
After installing, you’ll also need to install playwright browers to run the scraping functions.
Reading in the data
Using the package’s scraping functions get_data and get_cleaned_data are extremely memory intensive. Unless, dynamically updated data is required, it is highly recommended to use the functions data_no_scrape and cleaned_static_data, which are used to read in data that has been previously scraped (a process that took several hours). Details for using get_data and get_cleaned_data may be found on the Documentation page.
# Import librariesimport pandas as pdimport numpy as npfrom utah_housing_stat386.cleaning import data_no_scape, cleaned_static_data# Read in the raw, previously scraped datadf_raw = data_no_scape()df_raw.head(10)
mls
price
address
beds
baths
sqft
year_built
lot_size
garage
agent
city
0
2124118
$3,495,973
769 W Ranch Cir, Alpine, UT 84004
8
9
14472
2008
1.02 Ac
2124118.0
Contact Agent Ned Chidester 801-420-7653 Co-Ag...
alpine
1
2124041
$3,850,000
19 E Elk Ct, Alpine, UT 84004
7
7
8036
2020
0.61 Ac
2124041.0
Contact Agent Tayte Lackey 208-427-4460 Your N...
alpine
2
2122882
$719,000
475 Grove Dr, Alpine, UT 84004
5
2
2150
1969
0.50 Ac
2122882.0
Contact Agent Caleb Eastman 801-419-4773 Your ...
alpine
3
2121304
$1,500,000
48 N 100 E, Alpine, UT 84004
4
4
3808
1900
0.93 Ac
2121304.0
Contact Agent Will S Jones 801-376-7239 Your N...
alpine
4
2119151
$1,300,000
177 W Canyon Crest Rd, Alpine, UT 84004
5
3
3818
2002
0.47 Ac
2119151.0
Contact Agent Cristie Berg 801-372-2751 Co-Age...
alpine
5
2118246
$2,245,000
259 S Twin River Loop, Alpine, UT 84004
7
6
9119
2003
0.56 Ac
2.0
Contact Agent Kerry Oman 801-369-2507 Co-Agent...
alpine
6
2116152
$1,350,000
1063 E Alpine Dr, Alpine, UT 84004
6
4
5102
1983
0.46 Ac
2116152.0
Contact Agent Kara Ragsdale 801-870-6073 Your ...
alpine
7
2116014
$1,499,000
667 E 770 N, Alpine, UT 84004
9
5
6262
2022
0.55 Ac
2.0
Contact Agent Curry Jones 801-830-8199 Your Na...
alpine
8
2115239
$1,235,000
19 N Lone Peak Dr, Alpine, UT 84004
5
5
5587
1993
0.90 Ac
1.0
Contact Agent Julie B. Pierce 801-830-9292 Co-...
alpine
9
2115034
$2,400,000
1801 N Fort Canyon Rd, Alpine, UT 84004
6
3
4491
1980
1.74 Ac
2115034.0
Contact Agent Brett R Sellick 801-502-9955 You...
alpine
# Read in the clean, previously scraped datadf_clean = cleaned_static_data()df_clean.head(10)
mls
price
address
beds
baths
sqft
year_built
lot_size
garage
city
0
2124118
3495973.0
769 W Ranch Cir, Alpine, UT 84004
8.0
9.0
14472.0
2008
1.02
0
alpine
1
2124041
3850000.0
19 E Elk Ct, Alpine, UT 84004
7.0
7.0
8036.0
2020
0.61
0
alpine
2
2122882
719000.0
475 Grove Dr, Alpine, UT 84004
5.0
2.0
2150.0
1969
0.50
0
alpine
3
2121304
1500000.0
48 N 100 E, Alpine, UT 84004
4.0
4.0
3808.0
1900
0.93
0
alpine
4
2119151
1300000.0
177 W Canyon Crest Rd, Alpine, UT 84004
5.0
3.0
3818.0
2002
0.47
0
alpine
5
2118246
2245000.0
259 S Twin River Loop, Alpine, UT 84004
7.0
6.0
9119.0
2003
0.56
2
alpine
6
2116152
1350000.0
1063 E Alpine Dr, Alpine, UT 84004
6.0
4.0
5102.0
1983
0.46
0
alpine
7
2116014
1499000.0
667 E 770 N, Alpine, UT 84004
9.0
5.0
6262.0
2022
0.55
2
alpine
8
2115239
1235000.0
19 N Lone Peak Dr, Alpine, UT 84004
5.0
5.0
5587.0
1993
0.90
1
alpine
9
2115034
2400000.0
1801 N Fort Canyon Rd, Alpine, UT 84004
6.0
3.0
4491.0
1980
1.74
0
alpine
Cleaning the data
If the data was scraped, the following cleaning functions may be used to clean it. For time and simplicity, the following example usage of these functions will be used with raw, previously scraped (i.e. static) data, df_raw.
Cleaning the data step-by-step, function-by-function:
The data is now ready for EDA and analysis (see the Technical Report page).
Source Code
---title: "Accessing the Data"jupyter: python3---This tutorial demonstrates how to use the utah_housing_stat386 package to scrape, clean,and analyze Utah housing data from UtahRealEstate.com.The package provides tools to collect data from cities in Utah County and Salt Lake County.## Installation ##*The first things that you will need to do is install the utah_housing_stat386 package.*```bashpip install utah-housing-stat386```After installing, you'll also need to install [playwright](https://playwright.dev/python/docs/intro) browersto run the scraping functions.## Reading in the data ##Using the package's scraping functions `get_data` and `get_cleaned_data` are extremely memory intensive. Unless, dynamically updated data is required, it is highly recommended to use the functions `data_no_scrape` and `cleaned_static_data`, which are used to read in data that has been previously scraped (a process that took several hours). Details for using `get_data` and `get_cleaned_data` may be found on the `Documentation` page.```{python}# Import librariesimport pandas as pdimport numpy as npfrom utah_housing_stat386.cleaning import data_no_scape, cleaned_static_data# Read in the raw, previously scraped datadf_raw = data_no_scape()df_raw.head(10)``````{python}# Read in the clean, previously scraped datadf_clean = cleaned_static_data()df_clean.head(10)```## Cleaning the data ##If the data was scraped, the following cleaning functions may be used to clean it. For time and simplicity, the following example usage of these functions will be used with raw, previously scraped (i.e. static) data, `df_raw`.Cleaning the data step-by-step, function-by-function:```{python}from utah_housing_stat386.cleaning import clean_price, clean_numeric_field, clean_year_built, clean_lot_size, clean_garage, clean_address, clean_city# Apply cleaning step-by-stepdf_clean1 = df_raw.copy()df_clean1.drop(columns=['agent'])df_clean1['price'] = df_clean1['price'].apply(clean_price)df_clean1['beds'] = df_clean1['beds'].apply(clean_numeric_field)df_clean1['baths'] = df_clean1['baths'].apply(clean_numeric_field)df_clean1['sqft'] = df_clean1['sqft'].apply(clean_numeric_field)df_clean1['year_built'] = df_clean1['year_built'].apply(clean_year_built)df_clean1['lot_size'] = df_clean1['lot_size'].apply(clean_lot_size)df_clean1['garage'] = df_clean1['garage'].apply(clean_garage)df_clean1['address'] = df_clean1['address'].apply(clean_address)df_clean1['city'] = df_clean1['city'].apply(clean_city)df_clean1.head(10)```Cleaning the data with `clean_housing_data`, which applies all cleaning functions to the DataFrame:```{python}from utah_housing_stat386.cleaning import clean_housing_data, remove_duplicates, remove_invalid_entries# Apply cleaning in consolidated stepsdf_clean2 = clean_housing_data(df_raw)df_clean2 = remove_duplicates(df_clean2)df_clean2 = remove_invalid_entries(df_clean2)df_clean2.head(10)```Perform some last cleaning steps:```{python}import redf = df_clean2.copy()# Extract zipcodesaddress_list = df['address'].tolist()pattern =r"UT \d{5}"zipcode_list = []for address in address_list: zipcode_list.append(re.search(pattern, address).group()[3:])df['zipcode'] = zipcode_listdf = df.drop(columns=['address', 'mls'])df```The data is now ready for EDA and analysis (see the `Technical Report` page).