The following formula can help decide the most suitable size for a sample:
\[\begin{equation} n_0 = \frac{Z^2p(1-p)}{e^2} \tag{1.1} \end{equation}\]
Where \(n_0\) is the sample size, \(Z^2\) is the confidence level Z-score (can be found in this table), \(p\) is the estimated proportion of variability in the population and \(e^2\) is the margin of error, a.k.a. a confidence interval.
**Usually with a large population where there is no knowledge about the proportion of variability in the population -> \(p=0.5\) (the maximum variability).
La Plata suitable sample size with a 5% confidence interval:
\[\begin{equation} n_0 = \frac{1.96^2*0.5*(1-0.5)}{0.05^2} = 384.16 \tag{1.2} \end{equation}\]
Therefore the most suitable sample to collect for La Plata would be 384 building rooftops.
In this analysis two sets of random spatial samples have been drawn from the city of La Plata:
The samples were created using the Vector Research Tools -> Random points inside a polygon in QGIS1. Then the building rooftops were digitized in QGIS using Google Satellite Hybrid2, a Tile Map Service (TMS) layer.
For every sample the rooftop area \(m^2\), the mean global horizontal irradiation \((\frac{kWh}{m^2})\), the usable solar radiation \((kWh)\) and renewable electricity production \((kWh)\) were calculated.
Buildings rooftops area that are equal and under 30 \(m^2\) are defined as 0. Some of the sample points were computed on non built-up areas/roads/parks etc., therefore they were given a 0 to include a density factor in the calculation.
## Simple feature collection with 6 features and 6 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -6480273 ymin: -4138361 xmax: -6455331 ymax: -4118445
## CRS: +proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs
## # A tibble: 6 x 7
## id area X_mean usable_sr elec_prod geometry
## <dbl> <dbl> <dbl> <dbl> <dbl> <POLYGON [m]>
## 1 3 91.0 1728. 157246. 20285. ((-6461504 -4132034, -6461500 -41320~
## 2 4 131. 1735. 227935. 29404. ((-6466451 -4118457, -6466438 -41184~
## 3 8 59.1 1736. 102639. 13240. ((-6465099 -4118684, -6465094 -41186~
## 4 1 0.189 1729. 327. 0 ((-6455332 -4132428, -6455331 -41324~
## 5 2 0.063 1725. 109. 0 ((-6480273 -4138360, -6480272 -41383~
## 6 5 0.077 1726. 133. 0 ((-6468889 -4135469, -6468889 -41354~
## # ... with 1 more variable: elec_prod_mwh <dbl>
## id area X_mean usable_sr
## Min. : 1.00 Min. : 0.002 Min. :1721 Min. : 3
## 1st Qu.: 25.75 1st Qu.: 0.021 1st Qu.:1724 1st Qu.: 36
## Median : 50.50 Median : 0.078 Median :1727 Median : 134
## Mean : 50.50 Mean : 243.218 Mean :1727 Mean : 419697
## 3rd Qu.: 75.25 3rd Qu.: 0.240 3rd Qu.:1729 3rd Qu.: 413
## Max. :100.00 Max. :10405.603 Max. :1740 Max. :17939051
## elec_prod geometry elec_prod_mwh
## Min. : 0 POLYGON :100 Min. : 0.00
## 1st Qu.: 0 epsg:NA : 0 1st Qu.: 0.00
## Median : 0 +proj=merc...: 0 Median : 0.00
## Mean : 54056 Mean : 54.06
## 3rd Qu.: 0 3rd Qu.: 0.00
## Max. :2314138 Max. :2314.14
## Simple feature collection with 6 features and 6 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -6459285 ymin: -4157358 xmax: -6437314 ymax: -4133646
## CRS: +proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs
## # A tibble: 6 x 7
## id area X_mean usable_sr elec_prod geometry
## <dbl> <dbl> <dbl> <dbl> <dbl> <POLYGON [m]>
## 1 1 90.4 1732. 156456. 20183. ((-6444962 -4133747, -6444957 -413374~
## 2 2 0.036 1722. 62.0 0 ((-6459285 -4157357, -6459285 -415735~
## 3 3 0.144 1730. 249. 0 ((-6444164 -4138267, -6444164 -413826~
## 4 4 0.065 1731. 112. 0 ((-6447967 -4135898, -6447966 -413589~
## 5 5 0.23 1732. 398. 0 ((-6445041 -4133647, -6445040 -413364~
## 6 6 0.212 1732. 367. 0 ((-6437315 -4138498, -6437314 -413849~
## # ... with 1 more variable: elec_prod_mwh <dbl>
## id area X_mean usable_sr
## Min. : 1.00 Min. : 0.007 Min. :1721 Min. : 12
## 1st Qu.: 75.75 1st Qu.: 0.057 1st Qu.:1725 1st Qu.: 98
## Median :150.50 Median : 0.117 Median :1727 Median : 204
## Mean :150.50 Mean : 365.856 Mean :1728 Mean : 631475
## 3rd Qu.:225.25 3rd Qu.: 0.405 3rd Qu.:1730 3rd Qu.: 699
## Max. :300.00 Max. :13225.659 Max. :1740 Max. :22824922
## elec_prod geometry elec_prod_mwh
## Min. : 0 POLYGON :300 Min. : 0.0
## 1st Qu.: 0 epsg:NA : 0 1st Qu.: 0.0
## Median : 0 +proj=merc...: 0 Median : 0.0
## Mean : 81401 Mean : 81.4
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :2944415 Max. :2944.4
Calculation of the sample mean is represented by \(\overline{y}\). Calculation of the sample variance is represented by \(s^2\).
## [1] "The mean of sample 1: 54.06 (mWh)"
## [1] "The variance of sample 1: 78930.14 (mWh)"
The following equation calculates the unbiased variance of the estimator \(\overline{y}\):
\[\begin{equation} \hat{var}(\overline{y})= (\frac{N-n}{N})(\frac{s^2}{n}) \tag{1.3} \end{equation}\]
The following equation calculates the estimated standard error of the estimator \(\overline{y}\):
\[\begin{equation} SEM = \sqrt{\hat{var}(\overline{y})} \tag{1.4} \end{equation}\]
## [1] "The variance of the sample mean: 757.5 (mWh)"
## [1] "The estimated standard error of the sample mean: 27.52 (mWh)"
The following equation calculates an unbiased estimator of the population total \(\hat{t}\): \[\begin{equation} \hat{t} = N{\overline{y}} \tag{1.5} \end{equation}\]
The following equation calculates the unbiased variance of the estimator \(\hat{t}\):
\[\begin{equation} \hat{var}(\hat{t})= N^2\hat{var}(\overline{y}) \tag{1.6} \end{equation}\]
The following equation calculates the estimated standard error of the estimator \(\hat{t}\): \[\begin{equation} SET = \sqrt{\hat{var}(\hat{t})} \tag{1.7} \end{equation}\]
## [1] "The estimation of the renewable electricity production potential by all the buildings in the city: 134167.17 (mWh)"
## [1] "The variance of the estimated total: 4666447948.84 (mWh)"
## [1] "The estimated standard error of the total: 68311.4 (mWh)"
## [1] "The 95% confidence interval estimation for sample 1 is: (20743.51 (mWh), 247590.82 (mwh))"
Calculation of the sample mean is represented by \(\overline{y}\). Calculation of the sample variance is represented by \(s\).
## [1] "The mean of sample 2: 81.4 (mWh)"
## [1] "The variance of sample 2: 109098.9 (mWh)"
Equation (1.3) calculates the unbiased variance of the estimator \(\overline{y}\).
Equation (1.4) calculates the estimated standard error of the estimator \(\overline{y}\).
## [1] "The variance of the sample mean: 319.71 (mWh)"
## [1] "The estimated standard error of the sample mean: 17.88 (mWh)"
Equation (1.5) calculates an unbiased estimator of the population total \(\hat{t}\).
Equation (1.6) calculates the unbiased variance of the estimator \(\hat{t}\).
Equation (1.7) calculates the estimated standard error of the estimator \(\hat{t}\).
## [1] "The estimation of the renewable electricity production potential by all the buildings in the city: 202036.58 (mWh)"
## [1] "The variance of the estimated total: 1969498517.02 (mWh)"
## [1] "The estimated standard error of the total: 44379.03 (mWh)"
## [1] "The 95% confidence interval estimation for sample 2 is: (128812.7 (mWh), 275260.47 (mwh))"
The following calculations are for both sample 1 and sample 2 together, resulting in a total sample of 400 building rooftops. We will call this sample, sample 3.
Calculation of the sample mean is represented by \(\overline{y}\). Calculation of the sample variance is represented by \(s\).
## [1] "The mean of sample 3: 74.56 (mWh)"
## [1] "The variance of sample 3: 101480.54 (mWh)"
Equation (1.3) calculates the unbiased variance of the estimator \(\overline{y}\).
Equation (1.4) calculates the estimated standard error of the estimator \(\overline{y}\).
## [1] "The variance of the sample mean: 212.81 (mWh)"
## [1] "The estimated standard error of the sample mean: 14.59 (mWh)"
Equation (1.5) calculates an unbiased estimator of the population total \(\hat{t}\).
Equation (1.6) calculates the unbiased variance of the estimator \(\hat{t}\).
Equation (1.7) calculates the estimated standard error of the estimator \(\hat{t}\).
## [1] "The estimation of the renewable electricity production potential by all the buildings in the city: 185069.23 (mWh)"
## [1] "The variance of the estimated total: 1311007843.95 (mWh)"
## [1] "The estimated standard error of the total: 36207.84 (mWh)"
## [1] "The 95% confidence interval estimation for sample 3 is: (125374.03 (mWh), 244764.43 (mwh))"
La Plata is divided to 2 strata based on satellite imagery provided by the Copernicus Land Monitoring Service global maps of land cover & cover changes and related surface area statistics3.
Strata 1 represents built up area in the city and Strata 2 represents non built up area.
For each strata 60 random points were computed and then 20 building rooftops were digitized using the same methods as in chapter 1.
## Simple feature collection with 6 features and 7 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -6476517 ymin: -4140116 xmax: -6455221 ymax: -4125471
## CRS: +proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs
## # A tibble: 6 x 8
## id area X_mean usable_sr elec_prod geometry landuse
## <dbl> <dbl> <dbl> <dbl> <dbl> <POLYGON [m]> <chr>
## 1 1 0.004 1725. 6.90 0 ((-6467786 -4139229, -64677~ Built ~
## 2 2 834. 1727. 1440503. 185825. ((-6476517 -4129667, -64764~ Built ~
## 3 3 0.143 1727. 247. 0 ((-6459586 -4135615, -64595~ Built ~
## 4 4 69.1 1728. 119366. 15398. ((-6455235 -4134598, -64552~ Built ~
## 5 5 0.057 1726. 98.4 0 ((-6464188 -4140116, -64641~ Built ~
## 6 6 0.06 1728. 104. 0 ((-6471555 -4125471, -64715~ Built ~
## # ... with 1 more variable: elec_prod_mwh <dbl>
## id area X_mean usable_sr
## Min. : 1.00 Min. : 0.001 Min. :1721 Min. : 2
## 1st Qu.:15.75 1st Qu.: 0.016 1st Qu.:1727 1st Qu.: 27
## Median :30.50 Median : 0.039 Median :1729 Median : 68
## Mean :30.50 Mean : 143.529 Mean :1730 Mean : 247773
## 3rd Qu.:45.25 3rd Qu.: 0.078 3rd Qu.:1734 3rd Qu.: 134
## Max. :60.00 Max. :11988.234 Max. :1740 Max. :20680615
## elec_prod geometry landuse elec_prod_mwh
## Min. : 0 POLYGON :120 Length:120 Min. : 0.00
## 1st Qu.: 0 epsg:NA : 0 Class :character 1st Qu.: 0.00
## Median : 0 +proj=merc...: 0 Mode :character Median : 0.00
## Mean : 31870 Mean : 31.87
## 3rd Qu.: 0 3rd Qu.: 0.00
## Max. :2667799 Max. :2667.80
After stratification, the strata are combined to one sample and the computations are the same as for random sampling.
## landuse elec_prod_mwh
## 1 Built up 63.73959
## 2 Non built up 0.00000
## [1] "The mean of the stratfied sample is: 31.87 (mWh)"
## [1] "The variance of the stratfied sample is: 62862.69 (mWh)"
Equation (1.3) calculates the unbiased variance of the estimator \(\overline{y}\).
Equation (1.4) calculates the estimated standard error of the estimator \(\overline{y}\).
## [1] "The variance of the sample mean: 498.53 (mWh)"
## [1] "The estimated standard error of the sample mean: 22.33 (mWh)"
Equation (1.5) calculates an unbiased estimator of the population total \(\hat{t}\).
Equation (1.6) calculates the unbiased variance of the estimator \(\hat{t}\).
Equation (1.7) calculates the estimated standard error of the estimator \(\hat{t}\).
## [1] "The estimation of the renewable electricity production potential by all the buildings in the city: 79100.84 (mWh)"
## [1] "The variance of the estimated total: 3071095766.73 (mWh)"
## [1] "The estimated standard error of the total: 55417.47 (mWh)"
## [1] "The 95% confidence interval estimation for the stratified sample is: (-12767.99 (mWh), 170969.66 (mwh))"
When dividing the city of La Plata into built up and non built up area, the sample calculations from the built up area present more variable strata. The optimum scheme allocates larger sample size to the more variable strata and smaller sample size to the more difficult-to-sample strata.4
Therefore, 2 new strata are computed -
## Simple feature collection with 6 features and 7 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -6463597 ymin: -4130860 xmax: -6448816 ymax: -4123499
## CRS: +proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs
## # A tibble: 6 x 8
## id area X_mean usable_sr elec_prod geometry landuse
## <dbl> <dbl> <dbl> <dbl> <dbl> <POLYGON [m]> <chr>
## 1 1 175. 1732. 303058. 39094. ((-6452163 -4130836, -645215~ Built ~
## 2 2 91.7 1735. 159086. 20522. ((-6448832 -4127833, -644882~ Built ~
## 3 3 0.066 1739. 115. 0 ((-6457547 -4125248, -645754~ Built ~
## 4 4 0.006 1739. 10.4 0 ((-6457997 -4123499, -645799~ Built ~
## 5 5 54.8 1731. 94900. 12242. ((-6463593 -4124949, -646358~ Built ~
## 6 6 56.2 1735. 97564. 12586. ((-6456051 -4129352, -645604~ Built ~
## # ... with 1 more variable: elec_prod_mwh <dbl>
## id area X_mean usable_sr
## Min. : 1.00 Min. : 0.001 Min. : 0 Min. : 0
## 1st Qu.:20.75 1st Qu.: 0.006 1st Qu.:1729 1st Qu.: 10
## Median :45.50 Median : 0.021 Median :1733 Median : 36
## Mean :45.75 Mean : 131.517 Mean :1716 Mean : 227277
## 3rd Qu.:70.25 3rd Qu.: 53.038 3rd Qu.:1736 3rd Qu.: 91834
## Max. :95.00 Max. :6109.904 Max. :1741 Max. :10546739
## elec_prod geometry landuse elec_prod_mwh
## Min. : 0 POLYGON :100 Length:100 Min. : 0.00
## 1st Qu.: 0 epsg:NA : 0 Class :character 1st Qu.: 0.00
## Median : 0 +proj=merc...: 0 Mode :character Median : 0.00
## Mean : 29316 Mean : 29.32
## 3rd Qu.: 11847 3rd Qu.: 11.85
## Max. :1360529 Max. :1360.53
After stratification, the strata are combined to one sample and the computations are the same as for random sampling.
## landuse elec_prod_mwh
## 1 Built up 30.85859
## 2 Non built up 0.00000
## [1] "The mean of the stratfied sample is: 29.32 (mWh)"
## [1] "The variance of the stratfied sample is: 25362.42 (mWh)"
Equation (1.3) calculates the unbiased variance of the estimator \(\overline{y}\).
Equation (1.4) calculates the estimated standard error of the estimator \(\overline{y}\).
## [1] "The variance of the sample mean: 498.53 (mWh)"
## [1] "The estimated standard error of the sample mean: 15.6 (mWh)"
Equation (1.5) calculates an unbiased estimator of the population total \(\hat{t}\).
Equation (1.6) calculates the unbiased variance of the estimator \(\hat{t}\).
Equation (1.7) calculates the estimated standard error of the estimator \(\hat{t}\).
## [1] "The estimation of the renewable electricity production potential by all the buildings in the city: 72761.47 (mWh)"
## [1] "The variance of the estimated total: 1499457748.65 (mWh)"
## [1] "The estimated standard error of the total: 38722.83 (mWh)"
## [1] "The 95% confidence interval estimation for the stratified sample is: (8466.42 (mWh), 137056.52 (mwh))"
Sample | Mean | Var of Mean | SEM | Total Estimation | Var of Total | SET | Lower CI | Upper CI |
---|---|---|---|---|---|---|---|---|
Sample 1 - 100 | 54.06 | 757.50 | 27.52 | 134,167.17 | 4,666,447,949 | 68,311.40 | 20,743.51 | 247,590.8 |
Sample 2 - 300 | 81.40 | 319.71 | 17.88 | 202,036.58 | 1,969,498,517 | 44,379.03 | 128,812.70 | 275,260.5 |
Both Samples - 400 | 74.56 | 212.81 | 14.59 | 185,069.23 | 1,311,007,844 | 36,207.84 | 125,374.03 | 244,764.4 |
Stratified Sample - 120 | 31.87 | 498.53 | 22.33 | 79,100.84 | 3,071,095,767 | 55,417.47 | -12,767.99 | 170,969.7 |
Allocated Stratified Sample - 100 | 29.32 | 243.41 | 15.60 | 72,761.47 | 1,499,457,749 | 38,722.83 | 8,466.42 | 137,056.5 |
Thompson, S. (2012). Sampling. Hoboken, N.J.: Wiley.p.147↩︎