3 Basic Data Visualizations
Visualizing data is a crucial step in translating raw information into insights that can be quickly understood and acted upon. Before creating meaningful charts or plots, it is essential to understand the characteristics of the underlying data, including its type, structure, and key attributes. Without this foundational knowledge, visualizations may be misleading or fail to convey the intended message.
This section focuses on Basic Data Visualizations (Figure 3.1), explaining how data can be categorized into numeric (quantitative) and categorical (qualitative) forms, along with subtypes like discrete, continuous, nominal, and ordinal. It also discusses common data sources and the fundamental elements of a dataset, such as variables and observations, which are essential for selecting appropriate visualization methods.
As discussed in the section of Data Exploration, understanding data types and structure is essential before creating visualizations. By considering the structure of datasets, including variables, observations, and data sources, readers will be able to select appropriate charts and plots—such as histograms for continuous data, bar charts for categorical data, or scatter plots for examining relationships—thereby clearly revealing patterns, trends, and actionable insights from the dataset [1].
According to the mindmap, the following section will explore several fundamental data visualizations by emphasizing their types, purposes, applications, users, and tools. Starting with these essential visualizations is crucial before progressing to more advanced analytical techniques. These visuals not only help us understand distributions, comparisons, and relationships between variables in a simple yet informative way but also provide the foundation for deeper analysis. By mastering these basics, we can communicate insights more effectively, spot hidden patterns, and make data-driven decisions with greater confidence [2], [3], knaflic2015?.
3.1 Dataset
This dataset represents 200 simulated sales transactions from various cities across Indonesia during the year 2024. It is designed to illustrate different types of data commonly found in business and analytics contexts — including nominal, ordinal, discrete, and continuous variables.
Each row in the dataset corresponds to a single customer transaction, recording essential details such as date, product type, city, customer tier, quantity sold, price, and payment method. The dataset is intentionally structured to be used for teaching and practicing data exploration, visualization, and analysis in tools like R, Python, Excel, or Power BI.
3.1.1 Purpose of the Dataset
The dataset can be used to:
- Demonstrate how to identify and classify different data types (nominal, ordinal, discrete, continuous).
- Practice generating and interpreting common visualizations such as line chart, bar charts, histograms, pie charts, boxplots, and scatter plots.
- Perform exploratory data analysis (EDA) on sales trends, customer segments, and pricing patterns.
- Explore relationships between variables, such as how quantity and price affect total sales or how customer tiers differ across payment methods.
3.1.2 Dataset Overview
Column | Example | Data Type | Description |
---|---|---|---|
TransactionID |
T0045 | Nominal | Unique identifier for each transaction |
TransactionDate |
2024-05-14 | Date | Date of transaction |
ProductCategory |
Electronics | Nominal | Category of the purchased product |
City |
Jakarta | Nominal | City where the transaction occurred |
CustomerTier |
Gold | Ordinal | Customer level (Bronze < Silver < Gold < Platinum) |
Quantity |
3 | Discrete | Number of items sold |
UnitPrice |
1,200,000 | Continuous | Price per unit of the product |
TotalPrice |
3,600,000 | Continuous | Total transaction value |
PaymentMethod |
Credit Card | Nominal | Payment method used by the customer |
library(DT)
# Generate Sales Transaction Dataset in R
# ==============================
set.seed(123) # to make results reproducible
# --- 1. Define base variables ---
<- sprintf("T%04d", 1:200)
TransactionID
# random transaction dates across 2024
<- sample(seq(as.Date("2024-01-01"), as.Date("2024-12-31"),
TransactionDate by = "day"), 200, replace = TRUE)
# product categories
<- sample(c("Electronics", "Groceries", "Fashion",
ProductCategory "Furniture", "Beauty"), 200, replace = TRUE)
# 20 major cities in Indonesia
<- sample(c(
City "Jakarta", "Surabaya", "Bandung", "Medan", "Semarang", "Palembang",
"Makassar", "Bekasi", "Tangerang", "Depok", "Batam", "Pekanbaru",
"Bandar Lampung", "Denpasar", "Padang", "Malang", "Banjarmasin",
"Pontianak", "Manado", "Balikpapan"
200, replace = TRUE)
),
# customer tier (Ordinal)
<- sample(c("Bronze", "Silver", "Gold", "Platinum"), 200,
CustomerTier replace = TRUE, prob = c(0.3, 0.4, 0.2, 0.1))
# quantity of products (Discrete)
<- sample(1:10, 200, replace = TRUE)
Quantity
# unit price (Continuous)
<- round(runif(200, 20000, 3000000), 0)
UnitPrice
# total price (Continuous)
<- Quantity * UnitPrice
TotalPrice
# payment method (Nominal)
<- sample(c("Cash", "Credit Card", "Debit Card", "E-Wallet"),
PaymentMethod 200, replace = TRUE)
# --- 2. Combine into a data frame ---
<- data.frame(
sales_data
TransactionID,
TransactionDate,
ProductCategory,
City,
CustomerTier,
Quantity,
UnitPrice,
TotalPrice,
PaymentMethod
)
# Display the data frame as a neat table
datatable(sales_data,
caption = "Table of Dataset",
rownames = FALSE) # hides the index column
3.2 Line Chart
A Line Chart is a data visualization tool that illustrates how values change over a sequence, typically over time. It connects data points with a continuous line, making it ideal for displaying trends and patterns in time-series data Wickham2016?. Line charts are particularly useful for:
- Identifying Seasonal Patterns: Recognizing recurring fluctuations at regular intervals, such as increased sales during holidays [4].
- Detecting Growth or Decline Trends: Observing upward or downward movements in data over time [5].
- Spotting Peaks or Dips: Highlighting significant increases or decreases in activity, such as sales spikes during promotions [6].
In this Dataset, we can use a line chart to show how total sales or the number of transactions change across dates during the year 2024.
3.2.1 Basic Line Chart
# Step 1: Pastikan TransactionDate dalam format Date
$TransactionDate <- as.Date(sales_data$TransactionDate, format = "%Y-%m-%d")
sales_data
# Step 2: Hitung total penjualan per bulan
<- aggregate(TotalPrice ~ format(sales_data$TransactionDate, "%Y-%m"),
sales_trend data = sales_data, sum)
# Step 3: Ubah nama kolom jadi lebih jelas
names(sales_trend) <- c("MonthStr", "TotalSales")
# Step 4: Tambahkan "-01" agar jadi format tanggal lengkap
$Month <- as.Date(paste0(sales_trend$MonthStr, "-01"), format = "%Y-%m-%d")
sales_trend
# Step 5: Plot line chart
plot(
$Month,
sales_trend$TotalSales,
sales_trendtype = "o",
col = "steelblue",
pch = 16,
lwd = 2,
main = "Monthly Sales Trend (From sales_data)",
xlab = "Month",
ylab = "Total Sales (IDR)"
)grid(col = "gray80", lty = "dotted")
3.2.2 Line Chart using ggplot2
# Load required packages
library(ggplot2)
library(dplyr)
library(lubridate)
# Summarize total sales by month
<- sales_data %>%
sales_trend mutate(Month = floor_date(TransactionDate, "month")) %>%
group_by(Month) %>%
summarise(TotalSales = sum(TotalPrice))
# Create line chart
ggplot(sales_trend, aes(x = Month, y = TotalSales)) +
geom_line(color = "steelblue", linewidth = 1.2) + # updated aesthetic
geom_point(color = "darkorange", size = 2) +
labs(
title = "Monthly Sales Trend in 2024",
x = "Month",
y = "Total Sales (IDR)"
+
) theme_minimal()
3.3 Bar Chart
A Bar Chart is a type of data visualization used to represent categorical data with rectangular bars. Each bar’s height (or length) corresponds to the value or frequency of a category, making it easy to compare quantities across different groups Yi2023?.
Bar charts are especially suitable for:
- Discrete numeric data – numbers that can only take specific values (e.g., number of items purchased) [7].
- Ordinal categorical data – categories with a natural order (e.g., customer satisfaction levels: Low, Medium, High) [8].
In this Dataset, the Bar Chart is used to show the Total Sales by City. This allows us to quickly identify which cities contribute the most to total sales performance [9].
Insights:
- Taller bars indicate higher total sales.
- The chart helps compare city-level sales performance visually.
- It is ideal for categorical variables such as
City
and discrete numeric values likeTotalPrice
. - For ordinal data, bar charts make it easy to observe trends or patterns across ordered categories.
3.3.1 Basic Bar Chart
# Step 1: Aggregate total sales per city
<- aggregate(TotalPrice ~ City, data = sales_data, sum)
sales_city
# Step 2: Sort data by total sales (descending)
<- sales_city[order(sales_city$TotalPrice, decreasing = TRUE), ]
sales_city
# Step 3: Set margins
par(mar = c(8, 5, 4, 2)) # c(bottom, left, top, right)
# Step 4: Create bar chart
barplot(
height = sales_city$TotalPrice,
names.arg = sales_city$City,
col = "steelblue",
las = 2, # rotate city labels vertically
cex.names = 0.8, # reduce font size of city names
main = "Total Sales by City",
xlab = "",
ylab = ""
)
# Optional: Add grid lines
grid(nx = NA, ny = NULL, col = "gray80", lty = "dotted")
3.3.2 Bar Chart using ggplot2
# Load ggplot2
library(ggplot2)
# Summarize total sales per city
<- aggregate(TotalPrice ~ City, data = sales_data, sum)
sales_city
# Sort city by total sales (descending)
<- sales_city[order(sales_city$TotalPrice, decreasing = TRUE), ]
sales_city
# Create bar chart
ggplot(sales_city, aes(x = reorder(City, -TotalPrice), y = TotalPrice)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = round(TotalPrice/1e6, 1)),
vjust = -0.5, size = 3, color = "black") +
labs(
title = "Total Sales by City",
x = "City",
y = "Total Sales (in Millions IDR)"
+
) theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 9),
plot.title = element_text(size = 14, face = "bold")
)
3.4 Histogram
A Histogram is a graphical representation of the distribution of numerical data. It divides the data into intervals, known as bins, and displays the frequency of data points within each bin. This visualization helps identify patterns such as the central tendency, spread, skewness, and the presence of multiple modes in the data Wickham2016?.
Histograms are particularly effective for:
- Visualizing the Distribution: They provide a clear picture of how data is distributed across different ranges, helping to identify the shape of the distribution (e.g., normal, skewed, bimodal) [9].
- Identifying Central Tendency and Spread: By observing the peak of the histogram, one can infer the central value of the data. The width of the histogram indicates the variability or spread of the data [10].
- Detecting Skewness: The asymmetry of the histogram can reveal whether the data is skewed to the left or right, indicating potential biases in the data collection process [11].
- Recognizing Multiple Modes: A histogram can show if the data has multiple peaks (modes), suggesting the presence of different subgroups within the dataset [12].
In this Dataset, we can use histograms to explore the distribution of variables such as:
Quantity
(number of items purchased)UnitPrice
(price per item)TotalPrice
(total transaction value)
3.4.1 Basic Histogram
Let say we want to shows how many transactions occurred for each number of items purchased. Peaks (tall bars) indicate the most common purchase quantities.
hist(sales_data$Quantity,
main = "Histogram of Quantity",
xlab = "Number of Items Purchased",
ylab = "Frequency",
col = "skyblue",
border = "white",
breaks = 5)
3.4.2 Histogram using ggplot2
library(ggplot2)
ggplot(sales_data, aes(x = Quantity)) +
geom_histogram(
bins = 5, # number of bins (adjust as needed)
fill = "skyblue", # fill color for the bars
color = "white", # border color for the bars
alpha = 0.8 # transparency level
+
) labs(
title = "Histogram of Quantity",
x = "Number of Items Purchased",
y = "Frequency"
+
) theme_minimal()