A Novel Way to Visualize Word2Vec Embeddings with Shiny in R

Cyrus Kurd
5 min readOct 30, 2024

--

Towards Dynamic and Interactive 3D Word Embedding Visualizations

Imagine stepping into a room full of words. Words aren’t just standing around — they’re clustered together, some tightly packed in groups like “coffee”, “latte”, and “espresso”, while others, like “galaxy” and “nebula,” are farther apart. You’re not just seeing words; you’re seeing relationships — relationships between meanings (‘semantics’) that reveals how words connect, interact, and group together.

That’s the kind of exploration we wanted to create with Word2Vec embeddings. But here’s the reality: most of the current word visualization tools are stuck in a cluttered, non-interactive 2D world, crowded with overlapping clusters that tell only ‘half’ of the story (to be fair, the new method offers perhaps more like a 5–15% boost in the average case, see Figure 1). Language, like our thoughts, is complex, and reducing it to two dimensions often fails to capture the full picture, and incorporating just one more dimension can have a significant impact.

The extra dimension also makes it easier to rotate, zoom, and focus on clusters that may otherwise be hidden. We used Plotly for this, which supports interactive 3D plots. We want to enable true data exploration, not just messy visualization. Using Shiny and Plotly in R, we created an interactive visualization that lets you explore word embeddings as if you’re in the room with them; but you’re also able to zoom in on clusters, and drill down into the word-level details to uncover the closest neighbors.

Figure 1: Using just one principal component via PCA explains 20% of the variance, whereas two explains 31%, and three explains 41%. The extra dimension can be significant.

Why Do We Need a New Way to See Words?

Let’s face it: traditional visualizations of word embeddings can be very messy. If you’ve ever squinted at a t-SNE plot, trying to decipher which words cluster together, you know exactly what we mean (we provided one in Figure 2 for those fortunate enough to have avoided one this long). Sure, it may be helpful for a quick look, but it doesn’t invite any real exploration.

Figure 2: An antiquated, messy, t-SNE plot of word2vec embeddings

Towards a Better Solution

We used three main steps to prepare the data for modeling:

Data Cleaning: Converting everything to lowercase, removing punctuation, numbers, and common stopwords. For example:

library(tm)
data$review <- tolower(data$review) # convert text to lowercase
data$review <- removeNumbers(data$review) # remove numbers
data$review <- removePunctuation(data$review) # remove punctuation
data$review <- removeWords(data$review, stopwords("en")) # remove English stopwords

Clustering:

To make sense of the 3D embeddings, we used K-Means clustering. This algorithm groups similar word vectors into clusters, providing a visual way to identify semantic groups. For example, you might see terms like “heart”, “artery”, and “ventricle” forming a medical cluster, while “database”, “server”, and “query” cluster together under tech terms.

Choosing the right number of clusters can be tricky, so we implemented an interactive Elbow Method plot. This method helps find the optimal number of clusters by plotting the total within-cluster sum of squares (WSS) against the number of clusters. The “elbow” in the plot indicates the point beyond which adding clusters doesn’t significantly improve the model or may lead to overfitting.

Making it Interactive:

We used Shiny, an R package for building interactive web apps. It consists of a simple UI with sliders and dropdowns that allow users to adjust the number of clusters, highlight specific clusters, and drill down into word-level analysis.

For example, if you select a cluster, the plot dims the other clusters and highlights the one you picked. You can then view the most frequent words in that cluster, along with their closest semantic neighbors.

The Shiny app also lets you interactively adjust the number of clusters, which updates the 3D plot in real time. You can zoom in, rotate, and inspect it from any angle.

Takeaway: A New Way to Visualize Word Embeddings & NLP Data

1. Clarity: The 3D plot makes it much easier to see word relationships and clusters. No more overlapping points or crowded visuals.

2. Interactivity: By adjusting clusters and highlighting specific groups, users can focus on the words and relationships that matter most for their data.

3. Applications: This approach isn’t just for visualization; it can enhance document similarity, topic modeling, and even sentiment analysis by making clusters more identifiable.

Why This Matters: Better Tools for NLP

If you’re working in NLP, you know how essential it is to see your data in ways that make sense. Word2Vec is a fantastic tool, but it’s only as useful as the insights we can extract from it. This interactive visualization not only makes those insights clearer but also invites exploration — an essential step in any NLP project.

What’s Next? Try it Yourself!

Want to try building this visualization? We’ve made the code available on GitHub. Whether you’re a data scientist, an NLP enthusiast, or just curious about how words connect, this project is a great way to dive deeper into word embeddings.

Written & Coded By Cyrus Kurd & Rohan Mathur
October 30th, 2024

More Learning Resources:

Find out the effects of additional principal components on your data with this script:

# Test this on your own data!
# load necessary libraries
library(ggplot2)
library(factoextra)

# 'embeddings' is the data matrix
pca_result <- prcomp(embeddings, scale. = TRUE)

# calculate variance explained & cumulative variance
variance_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
cumulative_variance <- cumsum(variance_explained)

# create a dataframe for visualization
plot_data <- data.frame(
Dim = 1:length(cumulative_variance),
Variance = variance_explained * 100, # convert to percentage
Cumulative = cumulative_variance * 100 # convert to percentage
)

# limit to the first 15 dimensions
plot_data <- plot_data[1:15, ]

# round values to whole numbers
plot_data$Variance <- round(plot_data$Variance)
plot_data$Cumulative <- round(plot_data$Cumulative)

# create the plot
plot <- ggplot(plot_data, aes(x = factor(Dim))) + # bar plot for individual variance explained +
geom_bar(aes(y = Variance), stat = 'identity', fill = 'steelblue', alpha = 0.8) + # line plot for cumulative variance +
geom_line(aes(y = Cumulative, group = 1), color = 'darkred', size = 1.2) +
geom_point(aes(y = Cumulative), color = 'darkred', size = 3) + # add labels for cumulative variance +
geom_text(aes(y = Cumulative, label = paste0(Cumulative, "%")), vjust = -0.5, color = 'darkred', size = 3.5, fontface = 'bold') + # add labels for individual variance +
geom_text(aes(y = Variance, label = paste0(Variance, "%")), vjust = -0.5, color = 'black', size = 3.5, fontface = 'bold') +
labs(x = "Principal Components", y = "Variance Explained (%)", title = "Variance Explained by Principal Components") +
theme_minimal() +
theme(plot.title = element_text(size = 14, face = "bold"), axis.title = element_text(size = 12), axis.text = element_text(size = 10))

Other Resources:

https://projector.tensorflow.org/

Raunak, V., Kumar, V., Gupta, V., & Metze, F. (2020). On Dimensional Linguistic Properties of the Word Embedding Space. arXiv:1910.02211 [cs.CL]. Available at: https://arxiv.org/abs/1910.02211

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs.CL]. Available at: https://arxiv.org/abs/1301.3781

--

--

Cyrus Kurd
Cyrus Kurd

Written by Cyrus Kurd

M.S. Data Science Student at Columbia University | linkedin.com/in/cykurd/

No responses yet