Chapter 51 Data Exploration and Validation with the {pointblank} Package
This chapter is part of the Data Validation pathway. Packages needed for this chapter include {tidyverse}, {pointblank}, and {medicaldata}.
It is quite common in clinical research to download a dataset and to want to plunge right in to data analysis. Unfortunately, we often find to our regret, that we should have spent more time exploring and validating the data before starting the analysis. This chapter will introduce you to the {pointblank} package, which provides a powerful set of tools for data validation and exploration in R.
51.1 Goals for this Chapter
Understand how to explore and validate data using the {pointblank} package
Identify missing and outlier data
Learn how to document and fix data issues
51.2 Packages needed for this Chapter
{tidyverse}
{pointblank}
{medicaldata}
51.3 Pathway for this Chapter
This Chapter is part of the Data Validation pathway.
51.4 Getting Started with the {pointblank} package
We will start by installing and loading this package, using the code below.
This package is downloadable from any CRAN mirror, and has a practice dataset built in. Let’s take a look at “small table”, one of the built-in datasets.
There are 8 variables and thireteen rows in this dataset, with 7 distinct data types.
51.5 Starting with a Quick Scan of Your Table.
The {pointblank} package provides a quick way to scan your dataset for potential issues. We can use the scan_data() function to get a summary of the data types, missing values, and unique values in each column. Let’s try this on small_table.
The first step in using {pointblank} is to create an “agent”. An agent is an object that holds the data and the validation rules. We can create an agent using the create_agent() function.
agent <-create_agent(tbl = small_table)agent
Pointblank Validation PlanNo Interrogation Performed
[2025-11-20|18:18:16]tibblesmall_table
STEP
COLUMNS
VALUES
TBL
EVAL
UNITS
PASS
FAIL
W
S
N
EXT
This new from the academy agent has no validation rules, but it now holds the small_table dataset, and we can add validation rules to it. We want to validate each variable and determine if there are any missing or outlier data points.
51.7 Examples of Validation Rules
The {pointblank} package provides a wide range of validation rules that we can use to check our data. Some examples include:
col_vals_not_null(): Check for missing values in a column
col_vals_between(): Check for values within a specified range
col_vals_in_set(): Check for values in a specified set
col_is_numeric(): Check if a column is numeric
col_is_character(): Check if a column is character
col_is_logical(): Check if a column is logical
col_vals_gte(): Check for values greater than or equal to a specified value
col_vals_lte(): Check for values less than or equal to a specified value
rows_distinct(): Check for duplicate rows
51.8 Adding Validation Rules
We can add validation rules to the agent using various functions provided by {pointblank}. This will train up the agent to look for the kind of data problems we care about. For example, we can use the col_vals_not_null() function to check for missing values in a column. Or use a data range to check for out of range values.
agent <- agent %>%col_vals_not_null(vars(a)) %>%col_vals_between(vars(b), left =1, right =10) |>col_vals_in_set(vars(c), set =c("A", "B", "C"))agent
Pointblank Validation PlanNo Interrogation Performed
[2025-11-20|18:18:16]tibblesmall_table
STEP
COLUMNS
VALUES
TBL
EVAL
UNITS
PASS
FAIL
W
S
N
EXT
1
col_vals_not_null()
▮a
—
2
col_vals_between()
▮b
[1, 10]
3
col_vals_in_set()
▮c
A, B, C
We can see that we have added three validation rules to the agent. The first rule checks that column a has no missing values, the second rule checks that column b has values between 1 and 10, and the third rule checks that column c has values in the set {“A”, “B”, “C”}.
51.9 Interrogating the Data with Your Agent
Once we have added validation rules to the agent, the agent can interrogate the agent to see if the data passes the validation rules. The agent will add novel intel on the dataset. We can use the interrogate() function to do this.
agent <- agent %>%interrogate()agent
Pointblank Validation
[2025-11-20|18:18:16]
tibblesmall_table
STEP
COLUMNS
VALUES
TBL
EVAL
UNITS
PASS
FAIL
W
S
N
EXT
1
col_vals_not_null()
▮a
—
✓
13
13 1.00
0 0.00
—
—
—
—
2
col_vals_between()
▮b
[1, 10]
✓
13
2 0.15
11 0.85
—
—
—
3
col_vals_in_set()
▮c
A, B, C
✓
13
0 0.00
13 1.00
—
—
—
2025-11-20 18:18:16 EST< 1 s2025-11-20 18:18:16 EST
We can now see the result of the interrogation. The updated agent shows us how many rows passed and failed each validation rule. The dark green sidebars show the number of rows that passed, and the red sidebars show the number of rows that failed. More than 10 rows failed the second and third validation rules. We can dig deeper to see which rows failed. To do this, we can use the get_data_extracts() function.
This function returns a data frame with the rows that failed each validation rule. We can see that rows 3, 7, 9, 11, and 13 failed the second validation rule, and rows 4, 6, 8, 10, and 12 failed the third validation rule.
51.10 Documenting and Fixing Data Issues
Once we have identified the data issues, we can document them using the yaml_write() function
This function writes the validation results to a YAML file, which can be used for documentation purposes.
We can also fix the data issues using various functions provided by {pointblank}. For example, we can use the col_vals_replace_na() function to replace missing values with a specified value. Or use col_vals_replace_out_of_bounds() to replace out of bounds values.
## # A tibble: 13 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <lgl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 NA 3423. TRUE high
## 2 2016-01-04 00:32:00 2016-01-04 3 <NA> NA 10000. TRUE low
## 3 2016-01-05 13:32:00 2016-01-05 6 <NA> NA 2343. TRUE high
## 4 2016-01-06 17:23:00 2016-01-06 2 <NA> NA 3892. FALSE mid
## 5 2016-01-09 12:36:00 2016-01-09 8 <NA> NA 284. TRUE low
## 6 2016-01-11 06:15:00 2016-01-11 4 <NA> NA 3291. TRUE mid
## 7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 NA 843. TRUE high
## 8 2016-01-17 11:27:00 2016-01-17 4 <NA> NA 1036. FALSE low
## 9 2016-01-20 04:30:00 2016-01-20 3 <NA> NA 838. FALSE high
## 10 2016-01-20 04:30:00 2016-01-20 3 <NA> NA 838. FALSE high
## 11 2016-01-26 20:07:00 2016-01-26 4 <NA> NA 834. TRUE low
## 12 2016-01-28 02:51:00 2016-01-28 2 <NA> NA 108. FALSE low
## 13 2016-01-30 11:23:00 2016-01-30 1 <NA> NA 2230. TRUE high
This code replaces missing values in column a with 0, replaces out of range values in column b with NA, and replaces invalid values in column c with NA.
## Conclusion
The {pointblank} package provides a powerful set of tools for data validation and exploration in R. By creating an agent, adding validation rules, interrogating the agent, and documenting and fixing data issues, we can ensure that our data is clean and ready for analysis.
51.11 Setting action Levels
By default, the {pointblank} package uses a warning action level for validation rules. This means that if any rows fail a validation rule, the agent will issue a warning. However, we can set different action levels for each validation rule using the action_levels argument. The available action levels are “stop”, “warn”, and “notify”.
al <-action_levels(warn_at =2, stop_at =4)small_table %>%create_agent(actions = al) %>%col_vals_lt(a, value =7) %>%interrogate()
Pointblank Validation
[2025-11-20|18:18:16]
tibbleWARN2STOP4NOTIFY—
STEP
COLUMNS
VALUES
TBL
EVAL
UNITS
PASS
FAIL
W
S
N
EXT
1
col_vals_lt()
▮a
7
✓
13
11 0.85
2 0.15
●
○
—
2025-11-20 18:18:16 EST< 1 s2025-11-20 18:18:16 EST
If you look at the validation report table, we can see:
The FAIL column shows that 2 tests units have failed.
the W column (short for ‘warning’) shows a filled yellow circle indicating those failing test units reached that threshold value.
the S column (short for ‘stop’) shows an open red circle indicating that the number of failing test units is below that threshold.
The one final action level, N (for ‘notify’), wasn’t set so it appears on the validation table as a long dash.
51.12 Try it Yourself
Now it’s your turn to try using the {pointblank} package. Use the code below to create your own agent, add validation rules, interrogate the agent, and document and fix any data issues you find.
Start with small_table. Edit the agent, and devise interrogation rules to make sure that
all values for a are between 1 and 7
all values for d are between 100-3000
variable e is always a logical
variable f is always low, mid, or high, and is turned into a factor variable (mutate)
Create the agent and run the interrogation, then display the updated agent.
agent <-create_agent(tbl = small_table) %>%col_vals_between(vars(a), left =1, right =7) %>%col_vals_between(vars(d), left =100, right =3000) %>%col_is_logical(vars(e)) %>%col_vals_in_set(vars(f), set =c("low", "mid", "high")) %>%interrogate()agent
Pointblank Validation
[2025-11-20|18:18:17]
tibblesmall_table
STEP
COLUMNS
VALUES
TBL
EVAL
UNITS
PASS
FAIL
W
S
N
EXT
1
col_vals_between()
▮a
[1, 7]
✓
13
12 0.92
1 0.08
—
—
—
2
col_vals_between()
▮d
[100, 3,000]
✓
13
9 0.69
4 0.31
—
—
—
3
col_is_logical()
▮e
—
✓
1
1 1.00
0 0.00
—
—
—
—
4
col_vals_in_set()
▮f
low, mid, high
✓
13
13 1.00
0 0.00
—
—
—
—
2025-11-20 18:18:17 EST< 1 s2025-11-20 18:18:17 EST
Now identify the failures and fix them in small_table, creating a new data frame called small_table_fixed.
failed_rows <-get_data_extracts(agent)failed_rows
## $`1`
## # A tibble: 1 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
##
## $`2`
## # A tibble: 4 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
## 2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
## 3 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid
## 4 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291. TRUE mid
## # A tibble: 13 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <fct>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 NA TRUE high
## 2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 NA TRUE low
## 3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
## 4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA NA FALSE mid
## 5 2016-01-09 12:36:00 2016-01-09 NA 3-ldm-038 7 284. TRUE low
## 6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 NA TRUE mid
## 7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
## 8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1036. FALSE low
## 9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 834. TRUE low
## 12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108. FALSE low
## 13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
51.13 Further Challenges
Try using the {pointblank} package on your own datasets. Create an agent, add validation rules, interrogate the agent, and document and fix any data issues you find. An example dataset you can use is the psych dataset from the {medicaldata} package. Load this dataset and explore it using the {pointblank} package.
View the psych dataset. Think about the variable type, allowable values, and ranges for each variable. Create an agent and add at least 8 validation rules to check for variable data type, missing values, out of range values, and invalid values. Make sure that there are no duplicate rows. Interrogate the agent and document any data issues you find. Finally, fix the data issues and create a new data frame called psych_fixed.
agent <-create_agent(tbl = psych) %>%col_vals_not_null(vars(study_id)) %>%col_vals_in_set(vars(redcap_data_access_group), set =c("University of Michigan", "Mayo Clinic", "Oregon Health and Science University")) %>%col_vals_in_set(vars(psych_hx), set =c("Yes", "No")) %>%col_vals_in_set(vars(psych_dep), set =c("Yes", "No")) %>%col_vals_in_set(vars(psych_bipo), set =c("Yes", "No")) %>%col_vals_in_set(vars(psych_anx), set =c("Yes", "No")) %>%col_vals_in_set(vars(psych_ptsd), set =c("Yes", "No")) %>%col_vals_in_set(vars(psych_meds), set =c("Yes", "No")) %>%col_vals_in_set(vars(psych_visit), set =c("Yes", "No", "Not available at study site", "Unknown/not reported"))interrogate(agent)
Pointblank Validation
[2025-11-20|18:18:17]
tibblepsych
STEP
COLUMNS
VALUES
TBL
EVAL
UNITS
PASS
FAIL
W
S
N
EXT
1
col_vals_not_null()
▮study_id
—
✓
1K
1K 1.00
0 0.00
—
—
—
—
2
col_vals_in_set()
▮redcap_data_access_group
University of Michigan, Mayo Clinic, Oregon Health and Science University
✓
1K
1K 0.94
64 0.06
—
—
—
3
col_vals_in_set()
▮psych_hx
Yes, No
✓
1K
1K 0.99
12 0.01
—
—
—
4
col_vals_in_set()
▮psych_dep
Yes, No
✓
1K
771 0.68
369 0.32
—
—
—
5
col_vals_in_set()
▮psych_bipo
Yes, No
✓
1K
771 0.68
369 0.32
—
—
—
6
col_vals_in_set()
▮psych_anx
Yes, No
✓
1K
771 0.68
369 0.32
—
—
—
7
col_vals_in_set()
▮psych_ptsd
Yes, No
✓
1K
771 0.68
369 0.32
—
—
—
8
col_vals_in_set()
▮psych_meds
Yes, No
✓
1K
771 0.68
369 0.32
—
—
—
9
col_vals_in_set()
▮psych_visit
Yes, No, Not available at study site, Unknown/not reported
✓
1K
772 0.68
368 0.32
—
—
—
2025-11-20 18:18:17 EST< 1 s2025-11-20 18:18:17 EST
agent
Pointblank Validation PlanNo Interrogation Performed
[2025-11-20|18:18:17]tibblepsych
STEP
COLUMNS
VALUES
TBL
EVAL
UNITS
PASS
FAIL
W
S
N
EXT
1
col_vals_not_null()
▮study_id
—
2
col_vals_in_set()
▮redcap_data_access_group
University of Michigan, Mayo Clinic, Oregon Health and Science University
3
col_vals_in_set()
▮psych_hx
Yes, No
4
col_vals_in_set()
▮psych_dep
Yes, No
5
col_vals_in_set()
▮psych_bipo
Yes, No
6
col_vals_in_set()
▮psych_anx
Yes, No
7
col_vals_in_set()
▮psych_ptsd
Yes, No
8
col_vals_in_set()
▮psych_meds
Yes, No
9
col_vals_in_set()
▮psych_visit
Yes, No, Not available at study site, Unknown/not reported
51.14 Additional Resources
There is a lot more that the pointblank package can do, including automated reporting each time new data come in, custom validation rules, emailing automated reports daily or weekly, and integration with other packages.
There are several great articles and tutorials available online to help you learn more about the {pointblank} package and its capabilities.