41.2 Feature Engineering
Business-minded transforms often beat fancy models. The goal is interpretability and better signal-to-noise.
fe_tx <- tx %>%
mutate(
# Year-month numeric for time-aware models
t_index = as.integer(zoo::as.yearmon(date)),
# Price momentum: rolling median (by city) as a simple trend proxy
median_roll3 = dplyr::lag(slider::slide_dbl(median, mean, .before = 2, .complete = TRUE), 0),
# Volume-per-listing: marketing pipeline proxy (dollars per active listing)
vol_per_listing = ifelse(listings > 0, volume / listings, NA_real_),
# Seasonality flags
is_peak_summer = month(date) %in% 6:8,
is_december = month(date) == 12
)
head(fe_tx)
#> # A tibble: 6 × 18
#> city year month sales volume median listings inventory date avg_price
#> <fct> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <dbl>
#> 1 Abile… 2000 1 72 5.38e6 71400 701 6.3 2000-01-01 74722.
#> 2 Abile… 2000 2 98 6.51e6 58700 746 6.6 2000-02-01 66378.
#> 3 Abile… 2000 3 130 9.28e6 58100 784 6.8 2000-03-01 71423.
#> 4 Abile… 2000 4 98 9.73e6 68600 785 6.9 2000-04-01 99286.
#> 5 Abile… 2000 5 141 1.06e7 67300 794 6.8 2000-05-01 75106.
#> 6 Abile… 2000 6 156 1.39e7 66900 780 6.6 2000-06-01 89167.
#> # ℹ 8 more variables: absorption <dbl>, quarter <fct>, ym <fct>, t_index <int>,
#> # median_roll3 <dbl>, vol_per_listing <dbl>, is_peak_summer <lgl>,
#> # is_december <lgl>One-hot encode high-utility discrete features (keep top cities to avoid huge design matrices)
Note: dummify() will create binary indicators and drop the original variable by default.
To avoid exploding dimensions, we lump rare cities first.
fe_tx_small <- fe_tx %>%
mutate(city_lumped = forcats::fct_lump_n(city, n = 10, other_level = "Other"))
tx_dummy <-
dummify(fe_tx_small %>% select(-city), select = "city_lumped")
names(tx_dummy)[1:15]
#> [1] "year" "month" "sales" "volume"
#> [5] "median" "listings" "inventory" "avg_price"
#> [9] "absorption" "t_index" "median_roll3" "vol_per_listing"
#> [13] "date" "quarter" "ym"# Separate continuous and discrete columns for targeted transformations/modeling
spl <- split_columns(tx)
names(spl)
#> [1] "discrete" "continuous" "num_discrete" "num_continuous"
#> [5] "num_all_missing"Rules of thumb:
- Normalize skewed monetary variables (e.g., \(\log(\text{volume}+1)\)) before linear modeling.
- Group rare categories and avoid too many dummies relative to \(n\).
- Create business KPIs:
avg_price,absorption,vol_per_listing, and simple momentum features.