A Data
Due to the cost and difficulty of gathering data, it is expected that the data in many studies come from a specific region, period, or group. Partial data may store only pieces of information. Of course, its outcome will only represent the local knowledge rather than the characteristics of the population. This fact is a challenge to get any generalized conclusion. A countrywide or worldwide random sampling dataset can mostly address this issue. Although people developed more new data-collection techniques and gathered more detailed data in the past years, Randomization is still the fundamental requirement for eliminating the bias.
Identifying the effects of urban form on travel behaviors is harder than proving global warming. Scientists can get Antarctic ice core samples but cannot gather everyone’s daily travel records, in part because of personal privacy protections, and in part because of limited techniques for gathering these data. Surveys, traffic flow detection, or event data recorder (EDR) can only capture a small piece of the whole picture. Thus, to get a results without loss of generality, wide-range random-sampling data is a critical condition.
A.1 Travel Survey
National Household Travel Survey (NHTS) (U.S. Department of Transportation, Federal Highway Administration 2009) is a nationwide travel survey on travel and transportation patterns in the United States, which includes the essential trip variables: number of trips, purpose, modes, VMT, etc.
NHTS conducted in 2001, 2009, and 2017 by the Federal Highway Administration (FHWA). The prior Nationwide Personal Transportation Surveys (NPTS) conducted in 1969, 1977, 1983, 1990, and 1995.
NHTS is a well-designed survey. Some deliberate sampling methods make NHTS data representing the population-level travel characteristics. Although it is not as detailed as the GPS data, as large as the social media data, the advantage of NHTS is it satisfies the fundamental requirement of statistic inference: random sampling.
NHTS data include population density of census tract and gasoline-equivalent gallons consumed per year. Some studies use NHTS data to modeling the automotive CO\(_2\) emissions (Kim and Brownstone 2013; Perumal and Timmons 2017). But this dataset doesn’t contain enough built environment information and cannot support a full VMT-5Ds variable studies. Some studies add psychological factors such as role preferences, motivations, and expectations in choices to the models. Hong, Shen, and Zhang (2014) use eight attitudinal questions in the 2006 Household Activity Survey and improve the models’ \(R^2\) over 0.7.
State is the valid geographic levels to use NHTS data. Only the Add-on Partner,11 the states and MPOs with more sufficient samples size could conduct valid estimates in smaller levels of geography (e.g., cities, counties).
- Sampling methods (Opt.)
A.2 Census Data
The new NHTS (2009 and 2017) “weights its person data based on control totals found in the American Community Survey (ACS).” This provides an opportunity of joining travel data to other demographic, employment, and built environment data.
NHTS is conducted at four units: trip, person, vehicle, and household. While SLD is collected at the Census block group (CBG) level. The attributes can be synthesized at block, tract, county, and city scales by geographic identifiers (Table 2.2).
In the nested series, one spatial entities can be combined by its lower-level unit. For example, Multnomah county, Oregon is made up of 171 tracts, Tract 56 is made up of 4 block groups, Block Group 2 is made up of 22 blocks. Urban Center at PSU locates inside Block 2014. Other geographic entities can also combined with small units, Such as Core Based Statistical Area (CBSA)12 and Urban Areas (including 486 Urbanized Areas and 3,087 Urban Clusters in 2010).13
It allows a cities-to-cities, metro-to-metro comparison with an uniform data source (Figure A.1).
- Tract scale and below: Similar Population Size with varying Area Size
A.3 Smart Location Database
Smart Location Database (SLD) (Ramsey and Bell 2014) is an informative data source covering the entire US. The Environmental Protection Agency (EPA) conducted SLD for measuring location efficiency and the built environment.
2009 NHTS and SLD have the same period and can join together by geographic locations. SLD synthesized many data sources in around 2010 including 2010 Census, 2010 ACS, LEHD Origin-Destination Employment Statistics (LODES), InfoUSA, NAVTEQ, PAD-US, TOD Database, GTFS.
SLD have more than 90 variables including ‘density of development’), ‘diversity of land use’, ‘street network design’, ‘transit service’, and ‘accessibility to destinations’. The initial data dimension is 150,145 observations by 192 variables. Hence, the joint datasets contain both trip and built environment information to fit VMT-density models.
The advantage is this dataset has almost reached the finest resolution for travel v.s built environment analysis.
However, among the 57 Ds variables in SLD, the variables inside each Ds group are highly correlated. A ongoing study chooses only one variable from each D group as below.
D1B: Gross population density (people/acre) on unprotected land; D2A_WRKEMP: Household Workers per Job, by CBG; D3a: Total road network density; D4b050: Proportion of CBG employment within 1/2 mile of fixed-guideway transit stop;
D5ar: Jobs within 45 minutes auto travel time, time decay (network travel time) weighted. Source:(Ramsey and Bell 2014)
A.4 New Data Sources
There are more and more transportation studies using new data sources such as Call Detail Records (CDR), GPS, ICT, Point of Interest (POI), and social media data. Liu et al. (2017) use Point-of-Interest (POI) demand modeling to analyze human mobility patterns. Damiani et al. (2020) use trajectory summarization technique for the extraction of the locations of interest high-quality information. Using image recognition techniques, satellite imagery and Street View data provide unprecedented built environment information. The advantage of these new data sources is the high resolution and detailed information.
As long as the data contain the identifiable variables, combining traditional and new data sources together can largely enrich the information and has tremendous potential to answer VMT-urban form questions. For example, Point of Interest may represent the number of opportunities of place or utility of place. Although some data have high resolution and plenty of information, the geographic restriction and self-report bias should not be neglected.
However, the enormous amount of emerging data is also a huge challenge. The analysis methods of abstracting out generic information lags behind the data collection. How to eliminate the sample bias in new data sources is another unfinished work.
References
9 add-on Partners in 2001; 20 in 2009; 13 in 2017↩︎
there are 384 metropolitan statistical areas and 543 micropolitan statistical areas as of 2020. The 2010 standards provide that each CBSA must contain at least one urban area of 10,000 or more population. Each metropolitan statistical area must have at least one urbanized area of 50,000 or more inhabitants. Each micropolitan statistical area must have at least one urban cluster of at least 10,000 but less than 50,000 population.↩︎
2010 Census Urban and Rural Classification and Urban Area Criteria Urbanized Areas (UAs) of 50,000 or more people; Urban Clusters (UCs) of at least 2,500 and less than 50,000 people. (https://www.census.gov/programs-surveys/geography/guidance/geo-areas/urban-rural/2010-urban-rural.html)↩︎