Word2Vec To Transformers And Where We Are Today With AI

This blog is meant to serve as a resource for understanding the foundational technology upon which Dosu is built.

Pixel art space landscape with vehicle on barren planet surface with colorful nebula sky

2013 - 2017: Word2Vec and the Birth of Embeddings

Before 2013, computers treated words like entries in a simple dictionary: each word was an isolated symbol. And so, for computers, their understanding of the word "cat" was no closer related to the word "kitten" than it was to the word "lava." The primary problem was a lack of semantic similarity.

The creators of Word2Vec realized an intuitive truth about language that early computers were unable to understand: that a word is known by the company it keeps. Word2Vec attempts to capture the semantic similarity between words in a sentence to understand the true meaning of each word. Words that appear in similar contexts (i.e., surrounded by similar neighboring words) must have similar meanings.

Word2Vec does this by turning every word into a numerical vector called an embedding. These embeddings capture meaning based on usage. They quantify the semantics of the word, so related words (like cat and kitten) end up with vectors that are numerically close to each other in this multidimensional space.

Word2Vec uses two models to learn word representations that can be efficiently trained on large amounts of text data:

Skip-gram: Given a single word (e.g., apple), the model tries to predict its surrounding context words (e.g., eats, red, pie).
Continuous Bag of Words (CBOW): The opposite—given a set of context words (e.g., eats, red, pie), the model tries to predict the target word (apple).

The learning process adjusts the vector for "apple" until it gets its predictions right, encoding its meaning into the vector itself.

Word2Vec unlocked the magic of semantic arithmetic. It allowed for simple analogies, such as the famous example: King - Male + Female = Queen. The vector math worked because the embedded meanings were consistent.

However, Word2Vec had a fundamental limitation: it provided a single vector for each word, regardless of its role in a sentence. It therefore had difficulty understanding context-dependent words correctly.

For example, in Word2Vec, the word "bank" received a single vector, regardless of the sentence. In the sentence "I went to the bank to deposit money," and "The river bank overflowed," the vector for "bank" is identical. In fact, the meaning the model had encoded for the word "bank" was essentially a usage-weighted average of all its meanings, rolled into a single representation. Regardless of the surrounding words, Word2Vec would provide this average meaning, which failed to represent any of the nuanced meanings for "bank" correctly. There may be instances where it stumbles upon the correct usage, but often the meaning would seem slightly off.

This inability of early embeddings to adapt to context created a significant roadblock for sophisticated AI applications: the static embedding problem. AI models were unable to handle words with multiple meanings or nuance.

Key insight: The Need for Context

Real-world tasks require deep, dynamic understanding:
Machine Translation: To translate "bank" correctly, the model must look at "river" or "money" elsewhere in the sentence.
Question Answering (Q&A): Understanding a question requires linking the noun in the question to its descriptive elements, which may be many words away in the source text.
The Challenge: AI models needed to perform sequence modeling—the ability to model long-range dependencies (connections) across an entire sentence or even a paragraph.

Pixel art spaceship orbiting a cratered moon with distant planets

2017: Introducing the Transformer Architecture

Early attempts to solve the context problem used Recurrent Neural Networks (RNNs). While these models were sequential and could read context, they had a critical flaw. RNNs processed a sequence step-by-step. To process the 50th word, it had to wait for the 49th, which had to wait for the 48th, and so on.

Speed (The Parallelization Problem): This made them incredibly slow to train, as the entire process couldn't be run in parallel on modern GPU hardware.
Context Loss (The Vanishing Gradient Problem): Dependencies far back in the sequence (e.g., 100 words ago) often got "forgotten" by the time the model reached the end of the sentence.

The Transformer Solution

The Transformer architecture, introduced in 2017, solved the parallelization problem by abandoning recurrence entirely and instead processes the entire input sequence at once.

Whereas previous AI models, such as LSTMs and GRUs, read a book word-by-word, a Transformer scans the whole chapter instantly, simultaneously noting relationships between every word and every other word.

Transformers have been explained in-depth by many engineering leaders, and we recommend reviewing some of our favorite resources.

The secret ingredient that gives the Transformer its power is the Attention Mechanism.

Attention is a mechanism that allows the model to selectively focus on the most essential parts of the input when processing any given word. This led to the creation of Contextual Embeddings, the final output of a Transformer layer for any given word. This process creates fundamentally unique vectors for words. Each vector is a weighted sum of the entire input sequence's vectors, ensuring that the meaning of a word is always dynamic and context-dependent.

Example: How Attention works

When reading a complex research paper, you don't treat every sentence equally. You highlight key definitions, experimental results, and conclusion sentences, giving them more weight.
When the Transformer processes the word "it" in "The animal didn't cross the street because it was too tired," Attention automatically and numerically highlights the word "animal" as the most relevant word for understanding "it."

Pixel art spaceship traveling through deep space nebula

The Q, K, V Analogy

Attention is a calculation performed using three components derived from the input vectors. Think of it like a search engine query:

Component	Role	Analogy
Query (Q)	The word being processed.	What I'm looking for (the search term).
Key (K)	All other words in the sequence.	What I have (the index tags on all documents).
Value (V)	All other words' actual content.	What I get (the content of the documents).

The Query is compared against all the Keys to calculate an Attention Score. This score is a weight. The higher the score, the more that word's Value (its context) contributes to the final, new, contextual vector for the Query word.

Multi-Head Attention

The concept of Multi-Head Attention is simply doing the Attention calculation multiple times simultaneously (e.g., 8, 12, or 16 times).

Each "head" learns to look for a different kind of relationship. One head might focus on grammatical relationships (e.g., subject-verb agreement), while another focuses on semantic relationships (e.g., synonyms or related concepts).

By combining the outputs of these diverse "experts," the model creates a vibrant, nuanced representation.

Positional Encoding

Since the Transformer processes all words in parallel, it has no inherent sense of word order. It needs to know if "dog bites man" is different from "man bites dog." Positional Encoding is the simple, yet critical, solution. It's a small, pre-calculated vector added to the word's embedding that numerically encodes its absolute position in the sentence.

To put it simply, after a raw text sequence is input, the Transformer converts each word into an embedding and determines each word's position in the input to understand the sentence's meaning properly. This combination of word embedding and positional encoding creates a vector that carries both the word's meaning and its order within the sentence.

2022+: The Modern Emergence of LLMs

The entire structure of a layer within a Transformer (comprising Multi-Head Attention, a subsequent feed-forward network, and layer normalization) is now commonly referred to as an Attention Block (or Transformer Block).

Modern LLMs, like those from OpenAI, Anthropic, and Google, while initially built using the original Encoder-Decoder Transformer framework, are now typically composed of an absolutely massive, stacked column of these Attention Blocks (often using only the Decoder component) trained to predict the next word in a sequence.

The computational efficiency of the parallelized Attention Block enabled researchers to scale models to billions and even trillions of parameters. This massive scaling is a precondition for emergent abilities (such as complex reasoning, following multi-step instructions, or zero-shot prompting) that weren't explicitly programmed but emerge only after the model reaches a certain size threshold.

LLMs have gone through major transformations since 2022, and future blogs from Dosu will dive deeper into which technologies we are using.

Word2Vec To Transformers And Where We Are Today With AI

2013 - 2017: Word2Vec and the Birth of Embeddings

Key insight: The Need for Context

2017: Introducing the Transformer Architecture

The Transformer Solution

Example: How Attention works

The Q, K, V Analogy

Multi-Head Attention

Positional Encoding

2022+: The Modern Emergence of LLMs

Found this article helpful?

Explore More Articles

Related Articles

How to Catch Documentation Drift with Claude Code and GitHub Actions

February Dosu Drop: Maintain docs directly in Notion, Confluence and more

Why Your Coding Agent Needs More Than a Static README

Bluefin: Community Support After StackExchange