41.2 Feature Engineering

Business-minded transforms often beat fancy models. The goal is interpretability and better signal-to-noise.

fe_tx <- tx %>%
  mutate(
    # Year-month numeric for time-aware models
    t_index = as.integer(zoo::as.yearmon(date)),
    # Price momentum: rolling median (by city) as a simple trend proxy
    median_roll3 = dplyr::lag(slider::slide_dbl(median, mean, .before = 2, .complete = TRUE), 0), 
    # Volume-per-listing: marketing pipeline proxy (dollars per active listing)
    vol_per_listing = ifelse(listings > 0, volume / listings, NA_real_),
    # Seasonality flags
    is_peak_summer = month(date) %in% 6:8,
    is_december = month(date) == 12
  )
head(fe_tx)
#> # A tibble: 6 × 18
#>   city    year month sales volume median listings inventory date       avg_price
#>   <fct>  <int> <int> <dbl>  <dbl>  <dbl>    <dbl>     <dbl> <date>         <dbl>
#> 1 Abile…  2000     1    72 5.38e6  71400      701       6.3 2000-01-01    74722.
#> 2 Abile…  2000     2    98 6.51e6  58700      746       6.6 2000-02-01    66378.
#> 3 Abile…  2000     3   130 9.28e6  58100      784       6.8 2000-03-01    71423.
#> 4 Abile…  2000     4    98 9.73e6  68600      785       6.9 2000-04-01    99286.
#> 5 Abile…  2000     5   141 1.06e7  67300      794       6.8 2000-05-01    75106.
#> 6 Abile…  2000     6   156 1.39e7  66900      780       6.6 2000-06-01    89167.
#> # ℹ 8 more variables: absorption <dbl>, quarter <fct>, ym <fct>, t_index <int>,
#> #   median_roll3 <dbl>, vol_per_listing <dbl>, is_peak_summer <lgl>,
#> #   is_december <lgl>

One-hot encode high-utility discrete features (keep top cities to avoid huge design matrices)

Note: dummify() will create binary indicators and drop the original variable by default.

To avoid exploding dimensions, we lump rare cities first.

fe_tx_small <- fe_tx %>%
    mutate(city_lumped = forcats::fct_lump_n(city, n = 10, other_level = "Other"))
tx_dummy <-
    dummify(fe_tx_small %>% select(-city), select = "city_lumped")
names(tx_dummy)[1:15]
#>  [1] "year"            "month"           "sales"           "volume"         
#>  [5] "median"          "listings"        "inventory"       "avg_price"      
#>  [9] "absorption"      "t_index"         "median_roll3"    "vol_per_listing"
#> [13] "date"            "quarter"         "ym"
# Separate continuous and discrete columns for targeted transformations/modeling
spl <- split_columns(tx)
names(spl)
#> [1] "discrete"        "continuous"      "num_discrete"    "num_continuous" 
#> [5] "num_all_missing"

Rules of thumb:

  • Normalize skewed monetary variables (e.g., \(\log(\text{volume}+1)\)) before linear modeling.
  • Group rare categories and avoid too many dummies relative to \(n\).
  • Create business KPIs: avg_price, absorption, vol_per_listing, and simple momentum features.