Chapter 51 Data Exploration and Validation with the {pointblank} Package

51.1 Goals for this Chapter

Understand how to explore and validate data using the {pointblank} package
Identify missing and outlier data
Learn how to document and fix data issues

51.2 Packages needed for this Chapter

{tidyverse} {pointblank} {medicaldata}

51.3 Pathway for this Chapter

This Chapter is part of the Data Validation pathway.

51.4 Getting Started with the {pointblank} package

We will start by installing and loading this package, using the code below.

# install.packages('pointblank')
library(pointblank)

This package is downloadable from any CRAN mirror, and has a practice dataset built in. Let’s take a look at “small table”, one of the built-in datasets.

data("small_table", package = "pointblank")
small_table

## # A tibble: 13 × 8
##    date_time           date           a b             c      d e     f    
##    <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
##  1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3  3423. TRUE  high 
##  2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
##  3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3  2343. TRUE  high 
##  4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
##  5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7   284. TRUE  low  
##  6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid  
##  7 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3   843. TRUE  high 
##  8 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2  1036. FALSE low  
##  9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
## 10 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
## 11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7   834. TRUE  low  
## 12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8   108. FALSE low  
## 13 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high

There are 8 variables and thireteen rows in this dataset, with 7 distinct data types.

51.5 Starting with a Quick Scan of Your Table.

The {pointblank} package provides a quick way to scan your dataset for potential issues. We can use the scan_data() function to get a summary of the data types, missing values, and unique values in each column. Let’s try this on small_table.

scan_data(small_table)

Table Overview

Columns	8
Rows	13
`NA`s	2 (1.92%)
Duplicate Rows	1 (7.69%)

Column Types

character	2
numeric	2
Date	1
POSIXct	1
integer	1
logical	1

Reproducibility Information

Scan Build Time	`2025-11-20 18:18:11`
pointblank Version	`0.12.2`
R Version	R version 4.5.1 (2025–06–13) Great Square Root
Operating System	`aarch64-apple-darwin20`

date_time
datetime

Distinct	12
`NA`s	0
`Inf`/`-Inf`	0

date
date

Distinct	11
`NA`s	0
`Inf`/`-Inf`	0

a
integer

Distinct	7
`NA`s	0
`Inf`/`-Inf`	0

Mean	3.77
Minimum	1
Maximum	8

Quantile Statistics

Minimum	1.00
5th Percentile	1.60
Q1	2.00
Median	3.00
Q3	4.00
95th Percentile	7.40
Maximum	8.00
Range	7.00
IQR	2.00

Descriptive Statistics

Mean	3.77
Variance	4.36
Standard Deviation	2.09
Coefficient of Variation	0.55

Value	Count	Frequency
2	3	23.1%
3	3	23.1%
4	3	23.1%
1	1	7.7%
6	1	7.7%
7	1	7.7%
8	1	7.7%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
2	3	23.08%
3	3	23.08%
4	3	23.08%
1	1	7.69%
6	1	7.69%
7	1	7.69%
8	1	7.69%

Minimum Values

Value	Count	Frequency
1	1	7.69%
6	1	7.69%
7	1	7.69%
8	1	7.69%
2	3	23.08%
3	3	23.08%
4	3	23.08%

b
character

Distinct	12
`NA`s	0
`Inf`/`-Inf`	0

Value	Count	Frequency
5-bce-642	2	15.4%
1-bcd-345	1	7.7%
1-knw-093	1	7.7%
2-dhe-923	1	7.7%
2-dmx-010	1	7.7%
3-dka-303	1	7.7%
3-ldm-038	1	7.7%
5-boe-639	1	7.7%
5-egh-163	1	7.7%
Other Values (2)	2	15.4%

String Lengths

Mean	9.0
Minimum	9.0
Maximum	9.0

Histogram

c
numeric

Distinct	7
`NA`s	2
`Inf`/`-Inf`	0

Mean	5.73
Minimum	2
Maximum	9

Quantile Statistics

Minimum	2.00
5th Percentile	2.50
Q1	3.00
Median	7.00
Q3	8.00
95th Percentile	9.00
Maximum	9.00
Range	7.00
IQR	5.00

Descriptive Statistics

Mean	5.73
Variance	7.42
Standard Deviation	2.72
Coefficient of Variation	0.48

Value	Count	Frequency
3	3	23.1%
7	2	15.4%
8	2	15.4%
9	2	15.4%
`NA`	2	15.4%
2	1	7.7%
4	1	7.7%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
3	3	23.08%
7	2	15.38%
8	2	15.38%
9	2	15.38%
`NA`	2	15.38%
2	1	7.69%
4	1	7.69%

Minimum Values

Value	Count	Frequency
2	1	7.69%
4	1	7.69%
7	2	15.38%
8	2	15.38%
9	2	15.38%
`NA`	2	15.38%
3	3	23.08%

d
numeric

Distinct	12
`NA`s	0
`Inf`/`-Inf`	0

Mean	2,304.7
Minimum	108.34
Maximum	9,999.99

Quantile Statistics

Minimum	108.34
5th Percentile	213.70
Q1	837.93
Median	1,035.64
Q3	3,291.03
95th Percentile	6,335.44
Maximum	9,999.99
Range	9,891.65
IQR	2,453.10

Descriptive Statistics

Mean	2,304.70
Variance	6,924,068.46
Standard Deviation	2,631.36
Coefficient of Variation	1.14

Value	Count	Frequency
837.93	2	15.4%
108.34	1	7.7%
283.94	1	7.7%
833.98	1	7.7%
843.34	1	7.7%
1035.64	1	7.7%
2230.09	1	7.7%
2343.23	1	7.7%
3291.03	1	7.7%
Other Values (2)	2	15.4%

Maximum Values

Value	Count	Frequency
837.93	2	15.38%
108.34	1	7.69%
283.94	1	7.69%
833.98	1	7.69%
843.34	1	7.69%
1035.64	1	7.69%
2230.09	1	7.69%
2343.23	1	7.69%
3291.03	1	7.69%
3423.29	1	7.69%

Minimum Values

Value	Count	Frequency
108.34	1	7.69%
283.94	1	7.69%
833.98	1	7.69%
843.34	1	7.69%
1035.64	1	7.69%
2230.09	1	7.69%
2343.23	1	7.69%
3291.03	1	7.69%
3423.29	1	7.69%
3892.40	1	7.69%

e
logical

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

f
character

Distinct	3
`NA`s	0
`Inf`/`-Inf`	0

Value	Count	Frequency
high	6	46.2%
low	5	38.5%
mid	2	15.4%
Other Values (0)	0	0.0%

String Lengths

Mean	3.5
Minimum	3.0
Maximum	4.0

Histogram

	date_time	date	a	b	c	d	e	f
1	2016-01-04 11:00:00	2016-01-04	2	1-bcd-345	3	3423.29	TRUE	high
2	2016-01-04 00:32:00	2016-01-04	3	5-egh-163	8	9999.99	TRUE	low
3	2016-01-05 13:32:00	2016-01-05	6	8-kdg-938	3	2343.23	TRUE	high
4	2016-01-06 17:23:00	2016-01-06	2	5-jdo-903	`NA`	3892.40	FALSE	mid
5	2016-01-09 12:36:00	2016-01-09	8	3-ldm-038	7	283.94	TRUE	low
6..8
9	2016-01-20 04:30:00	2016-01-20	3	5-bce-642	9	837.93	FALSE	high
10	2016-01-20 04:30:00	2016-01-20	3	5-bce-642	9	837.93	FALSE	high
11	2016-01-26 20:07:00	2016-01-26	4	2-dmx-010	7	833.98	TRUE	low
12	2016-01-28 02:51:00	2016-01-28	2	7-dmx-010	8	108.34	FALSE	low
13	2016-01-30 11:23:00	2016-01-30	1	3-dka-303	`NA`	2230.09	TRUE	high

51.6 Creating an Agent

The first step in using {pointblank} is to create an “agent”. An agent is an object that holds the data and the validation rules. We can create an agent using the create_agent() function.

agent <- create_agent(tbl = small_table)
agent

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	S	N	EXT
Pointblank Validation PlanNo Interrogation Performed
[2025-11-20\|18:18:16] tibble small_table

This new from the academy agent has no validation rules, but it now holds the small_table dataset, and we can add validation rules to it. We want to validate each variable and determine if there are any missing or outlier data points.

51.7 Examples of Validation Rules

The {pointblank} package provides a wide range of validation rules that we can use to check our data. Some examples include:

col_vals_not_null(): Check for missing values in a column
col_vals_between(): Check for values within a specified range
col_vals_in_set(): Check for values in a specified set
col_is_numeric(): Check if a column is numeric
col_is_character(): Check if a column is character
col_is_logical(): Check if a column is logical
col_vals_gte(): Check for values greater than or equal to a specified value
col_vals_lte(): Check for values less than or equal to a specified value
rows_distinct(): Check for duplicate rows

51.8 Adding Validation Rules

We can add validation rules to the agent using various functions provided by {pointblank}. This will train up the agent to look for the kind of data problems we care about. For example, we can use the col_vals_not_null() function to check for missing values in a column. Or use a data range to check for out of range values.

agent <- agent %>%
  col_vals_not_null(vars(a)) %>%
  col_vals_between(vars(b), left = 1, right = 10) |> 
  col_vals_in_set(vars(c), set = c("A", "B", "C"))

agent

		STEP	COLUMNS	VALUES
Pointblank Validation PlanNo Interrogation Performed
[2025-11-20\|18:18:16] tibble small_table
	1	`col_vals_not_null()`	`▮a`	—
	2	`col_vals_between()`	`▮b`	`[1, 10]`
	3	`col_vals_in_set()`	`▮c`	`A, B, C`

We can see that we have added three validation rules to the agent. The first rule checks that column a has no missing values, the second rule checks that column b has values between 1 and 10, and the third rule checks that column c has values in the set {“A”, “B”, “C”}.

51.9 Interrogating the Data with Your Agent

Once we have added validation rules to the agent, the agent can interrogate the agent to see if the data passes the validation rules. The agent will add novel intel on the dataset. We can use the interrogate() function to do this.

agent <- agent %>%
  interrogate()
agent

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	S	N	EXT
Pointblank Validation
[2025-11-20\|18:18:16] tibble small_table
	1	`col_vals_not_null()`	`▮a`	—	✓	`13`	`13` `1.00`	`0` `0.00`	—	—	—	—
	2	`col_vals_between()`	`▮b`	`[1, 10]`	✓	`13`	`2` `0.15`	`11` `0.85`	—	—	—
	3	`col_vals_in_set()`	`▮c`	`A, B, C`	✓	`13`	`0` `0.00`	`13` `1.00`	—	—	—
2025-11-20 18:18:16 EST < 1 s 2025-11-20 18:18:16 EST

We can now see the result of the interrogation. The updated agent shows us how many rows passed and failed each validation rule. The dark green sidebars show the number of rows that passed, and the red sidebars show the number of rows that failed. More than 10 rows failed the second and third validation rules. We can dig deeper to see which rows failed. To do this, we can use the get_data_extracts() function.

failed_rows <- get_data_extracts(agent)
failed_rows

## $`2`
## # A tibble: 11 × 8
##    date_time           date           a b             c      d e     f    
##    <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
##  1 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
##  2 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3  2343. TRUE  high 
##  3 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
##  4 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7   284. TRUE  low  
##  5 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid  
##  6 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2  1036. FALSE low  
##  7 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
##  8 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
##  9 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7   834. TRUE  low  
## 10 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8   108. FALSE low  
## 11 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high 
## 
## $`3`
## # A tibble: 13 × 8
##    date_time           date           a b             c      d e     f    
##    <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
##  1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3  3423. TRUE  high 
##  2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
##  3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3  2343. TRUE  high 
##  4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
##  5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7   284. TRUE  low  
##  6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid  
##  7 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3   843. TRUE  high 
##  8 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2  1036. FALSE low  
##  9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
## 10 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
## 11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7   834. TRUE  low  
## 12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8   108. FALSE low  
## 13 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high

This function returns a data frame with the rows that failed each validation rule. We can see that rows 3, 7, 9, 11, and 13 failed the second validation rule, and rows 4, 6, 8, 10, and 12 failed the third validation rule.

51.10 Documenting and Fixing Data Issues

Once we have identified the data issues, we can document them using the yaml_write() function

# yaml_write(agent, filename = "data_issues.yaml")

This function writes the validation results to a YAML file, which can be used for documentation purposes. We can also fix the data issues using various functions provided by {pointblank}. For example, we can use the col_vals_replace_na() function to replace missing values with a specified value. Or use col_vals_replace_out_of_bounds() to replace out of bounds values.

small_table_fixed <- small_table %>%
  mutate(a = ifelse(is.na(a), 0, a),
         b = ifelse(b < 1 | b > 10, NA, b),
         c = ifelse(!c %in% c("A", "B", "C
"), NA, c))
small_table_fixed

## # A tibble: 13 × 8
##    date_time           date           a b         c          d e     f    
##    <dttm>              <date>     <int> <chr>     <lgl>  <dbl> <lgl> <chr>
##  1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345 NA     3423. TRUE  high 
##  2 2016-01-04 00:32:00 2016-01-04     3 <NA>      NA    10000. TRUE  low  
##  3 2016-01-05 13:32:00 2016-01-05     6 <NA>      NA     2343. TRUE  high 
##  4 2016-01-06 17:23:00 2016-01-06     2 <NA>      NA     3892. FALSE mid  
##  5 2016-01-09 12:36:00 2016-01-09     8 <NA>      NA      284. TRUE  low  
##  6 2016-01-11 06:15:00 2016-01-11     4 <NA>      NA     3291. TRUE  mid  
##  7 2016-01-15 18:46:00 2016-01-15     7 1-knw-093 NA      843. TRUE  high 
##  8 2016-01-17 11:27:00 2016-01-17     4 <NA>      NA     1036. FALSE low  
##  9 2016-01-20 04:30:00 2016-01-20     3 <NA>      NA      838. FALSE high 
## 10 2016-01-20 04:30:00 2016-01-20     3 <NA>      NA      838. FALSE high 
## 11 2016-01-26 20:07:00 2016-01-26     4 <NA>      NA      834. TRUE  low  
## 12 2016-01-28 02:51:00 2016-01-28     2 <NA>      NA      108. FALSE low  
## 13 2016-01-30 11:23:00 2016-01-30     1 <NA>      NA     2230. TRUE  high

This code replaces missing values in column a with 0, replaces out of range values in column b with NA, and replaces invalid values in column c with NA. ## Conclusion The {pointblank} package provides a powerful set of tools for data validation and exploration in R. By creating an agent, adding validation rules, interrogating the agent, and documenting and fixing data issues, we can ensure that our data is clean and ready for analysis.

51.11 Setting action Levels

By default, the {pointblank} package uses a warning action level for validation rules. This means that if any rows fail a validation rule, the agent will issue a warning. However, we can set different action levels for each validation rule using the action_levels argument. The available action levels are “stop”, “warn”, and “notify”.

al <- action_levels(warn_at = 2, stop_at = 4)

small_table %>%
  create_agent(actions = al) %>% 
  col_vals_lt(a, value = 7) %>%
  interrogate()

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	S	N
Pointblank Validation
[2025-11-20\|18:18:16] tibbleWARN 2 STOP 4 NOTIFY —
	1	`col_vals_lt()`	`▮a`	`7`	✓	`13`	`11` `0.85`	`2` `0.15`	●	○	—
2025-11-20 18:18:16 EST < 1 s 2025-11-20 18:18:16 EST

If you look at the validation report table, we can see:

The FAIL column shows that 2 tests units have failed. the W column (short for ‘warning’) shows a filled yellow circle indicating those failing test units reached that threshold value. the S column (short for ‘stop’) shows an open red circle indicating that the number of failing test units is below that threshold. The one final action level, N (for ‘notify’), wasn’t set so it appears on the validation table as a long dash.

51.12 Try it Yourself

Now it’s your turn to try using the {pointblank} package. Use the code below to create your own agent, add validation rules, interrogate the agent, and document and fix any data issues you find.

Start with small_table. Edit the agent, and devise interrogation rules to make sure that

all values for a are between 1 and 7
all values for d are between 100-3000
variable e is always a logical
variable f is always low, mid, or high, and is turned into a factor variable (mutate)

Create the agent and run the interrogation, then display the updated agent.

agent <- create_agent(tbl = small_table) %>%
  col_vals_between(vars(a), left = 1, right = 7) %>%
  col_vals_between(vars(d), left = 100, right = 3000) %>%
  col_is_logical(vars(e)) %>%
  col_vals_in_set(vars(f), set = c("low", "mid", "high")) %>%
  interrogate()

agent

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	S	N	EXT
Pointblank Validation
[2025-11-20\|18:18:17] tibble small_table
	1	`col_vals_between()`	`▮a`	`[1, 7]`	✓	`13`	`12` `0.92`	`1` `0.08`	—	—	—
	2	`col_vals_between()`	`▮d`	`[100, 3,000]`	✓	`13`	`9` `0.69`	`4` `0.31`	—	—	—
	3	`col_is_logical()`	`▮e`	—	✓	`1`	`1` `1.00`	`0` `0.00`	—	—	—	—
	4	`col_vals_in_set()`	`▮f`	`low, mid, high`	✓	`13`	`13` `1.00`	`0` `0.00`	—	—	—	—
2025-11-20 18:18:17 EST < 1 s 2025-11-20 18:18:17 EST

Now identify the failures and fix them in small_table, creating a new data frame called small_table_fixed.

failed_rows <- get_data_extracts(agent)
failed_rows

## $`1`
## # A tibble: 1 × 8
##   date_time           date           a b             c     d e     f    
##   <dttm>              <date>     <int> <chr>     <dbl> <dbl> <lgl> <chr>
## 1 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7  284. TRUE  low  
## 
## $`2`
## # A tibble: 4 × 8
##   date_time           date           a b             c      d e     f    
##   <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3  3423. TRUE  high 
## 2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
## 3 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
## 4 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid

small_table_fixed <- small_table %>%
  mutate(a = ifelse(a < 1 | a > 7, NA, a),
         d = ifelse(d < 100 | d > 3000, NA, d),
         e = as.logical(e),
         f = factor(ifelse(!f %in% c("low", "mid", "high"), NA, f), levels = c("low", "mid", "high")))
small_table_fixed

## # A tibble: 13 × 8
##    date_time           date           a b             c     d e     f    
##    <dttm>              <date>     <int> <chr>     <dbl> <dbl> <lgl> <fct>
##  1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3   NA  TRUE  high 
##  2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8   NA  TRUE  low  
##  3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3 2343. TRUE  high 
##  4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA   NA  FALSE mid  
##  5 2016-01-09 12:36:00 2016-01-09    NA 3-ldm-038     7  284. TRUE  low  
##  6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4   NA  TRUE  mid  
##  7 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3  843. TRUE  high 
##  8 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2 1036. FALSE low  
##  9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high 
## 10 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high 
## 11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7  834. TRUE  low  
## 12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8  108. FALSE low  
## 13 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA 2230. TRUE  high

51.13 Further Challenges

Try using the {pointblank} package on your own datasets. Create an agent, add validation rules, interrogate the agent, and document and fix any data issues you find. An example dataset you can use is the psych dataset from the {medicaldata} package. Load this dataset and explore it using the {pointblank} package.

psych <- readRDS(here('data/psych.Rd')) |> 
  select(-form_status_complete)

View the psych dataset. Think about the variable type, allowable values, and ranges for each variable. Create an agent and add at least 8 validation rules to check for variable data type, missing values, out of range values, and invalid values. Make sure that there are no duplicate rows. Interrogate the agent and document any data issues you find. Finally, fix the data issues and create a new data frame called psych_fixed.

Use the documentation from the {pointblank} package at https://rstudio.github.io/pointblank/ to help you create the validation rules.

agent <- create_agent(tbl = psych) %>%
  col_vals_not_null(vars(study_id)) %>%
  col_vals_in_set(vars(redcap_data_access_group), set = c("University of Michigan", "Mayo Clinic", "Oregon Health and Science University")) %>%
  col_vals_in_set(vars(psych_hx), set = c("Yes", "No")) %>%
  col_vals_in_set(vars(psych_dep), set = c("Yes", "No")) %>%
  col_vals_in_set(vars(psych_bipo), set = c("Yes", "No")) %>%
  col_vals_in_set(vars(psych_anx), set = c("Yes", "No")) %>%
  col_vals_in_set(vars(psych_ptsd), set = c("Yes", "No")) %>%
  col_vals_in_set(vars(psych_meds), set = c("Yes", "No")) %>%
  col_vals_in_set(vars(psych_visit), set = c("Yes", "No", "Not available at study site", "Unknown/not reported"))

interrogate(agent)

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	S	N	EXT
Pointblank Validation
[2025-11-20\|18:18:17] tibble psych
	1	`col_vals_not_null()`	`▮study_id`	—	✓	`1K`	`1K` `1.00`	`0` `0.00`	—	—	—	—
	2	`col_vals_in_set()`	`▮redcap_data_access_group`	`University of Michigan, Mayo Clinic, Oregon Health and Science University`	✓	`1K`	`1K` `0.94`	`64` `0.06`	—	—	—
	3	`col_vals_in_set()`	`▮psych_hx`	`Yes, No`	✓	`1K`	`1K` `0.99`	`12` `0.01`	—	—	—
	4	`col_vals_in_set()`	`▮psych_dep`	`Yes, No`	✓	`1K`	`771` `0.68`	`369` `0.32`	—	—	—
	5	`col_vals_in_set()`	`▮psych_bipo`	`Yes, No`	✓	`1K`	`771` `0.68`	`369` `0.32`	—	—	—
	6	`col_vals_in_set()`	`▮psych_anx`	`Yes, No`	✓	`1K`	`771` `0.68`	`369` `0.32`	—	—	—
	7	`col_vals_in_set()`	`▮psych_ptsd`	`Yes, No`	✓	`1K`	`771` `0.68`	`369` `0.32`	—	—	—
	8	`col_vals_in_set()`	`▮psych_meds`	`Yes, No`	✓	`1K`	`771` `0.68`	`369` `0.32`	—	—	—
	9	`col_vals_in_set()`	`▮psych_visit`	`Yes, No, Not available at study site, Unknown/not reported`	✓	`1K`	`772` `0.68`	`368` `0.32`	—	—	—
2025-11-20 18:18:17 EST < 1 s 2025-11-20 18:18:17 EST

agent

		STEP	COLUMNS	VALUES
Pointblank Validation PlanNo Interrogation Performed
[2025-11-20\|18:18:17] tibble psych
	1	`col_vals_not_null()`	`▮study_id`	—
	2	`col_vals_in_set()`	`▮redcap_data_access_group`	`University of Michigan, Mayo Clinic, Oregon Health and Science University`
	3	`col_vals_in_set()`	`▮psych_hx`	`Yes, No`
	4	`col_vals_in_set()`	`▮psych_dep`	`Yes, No`
	5	`col_vals_in_set()`	`▮psych_bipo`	`Yes, No`
	6	`col_vals_in_set()`	`▮psych_anx`	`Yes, No`
	7	`col_vals_in_set()`	`▮psych_ptsd`	`Yes, No`
	8	`col_vals_in_set()`	`▮psych_meds`	`Yes, No`
	9	`col_vals_in_set()`	`▮psych_visit`	`Yes, No, Not available at study site, Unknown/not reported`

51.14 Additional Resources

There is a lot more that the pointblank package can do, including automated reporting each time new data come in, custom validation rules, emailing automated reports daily or weekly, and integration with other packages. There are several great articles and tutorials available online to help you learn more about the {pointblank} package and its capabilities.

For more information on the {pointblank} package, check out the following resources: - {pointblank} documentation: https://rich-iannone.github.io/pointblank/ - {pointblank} GitHub repository: https://github.com/rstudio/pointblank

Reproducible Medical Research with R

Chapter 51 Data Exploration and Validation with the {pointblank} Package

51.1 Goals for this Chapter

51.2 Packages needed for this Chapter

51.3 Pathway for this Chapter

51.4 Getting Started with the {pointblank} package

51.5 Starting with a Quick Scan of Your Table.

Overview of `small_table`

Variables

Interactions

Correlations

Missing Values

Sample

51.6 Creating an Agent

51.7 Examples of Validation Rules

51.8 Adding Validation Rules

51.9 Interrogating the Data with Your Agent

51.10 Documenting and Fixing Data Issues

51.11 Setting action Levels

51.12 Try it Yourself

51.13 Further Challenges

51.14 Additional Resources

Chapter 51 Data Exploration and Validation with the {pointblank} Package

51.1 Goals for this Chapter

51.2 Packages needed for this Chapter

51.3 Pathway for this Chapter

51.4 Getting Started with the {pointblank} package

51.5 Starting with a Quick Scan of Your Table.

Overview of small_table

Variables

Interactions

Correlations

Missing Values

Sample

51.6 Creating an Agent

51.7 Examples of Validation Rules

51.8 Adding Validation Rules

51.9 Interrogating the Data with Your Agent

51.10 Documenting and Fixing Data Issues

51.11 Setting action Levels

51.12 Try it Yourself

51.13 Further Challenges

51.14 Additional Resources

Overview of `small_table`