Transformers Architecture: One-Stop Detailed Guide: Part 2 — Self Attention

5 min readDec 6, 2024

Self-attention is a foundational concept in modern Natural Language Processing (NLP), enabling models like Transformers to understand the context and relationships between words in a sequence. This blog, the second in our series on Transformers, will dive deep into self-attention, exploring its origins, mechanisms, and importance in overcoming the limitations of earlier word representation techniques.

Introduction to Vectorizing Words

At the core of any NLP application lies the fundamental need to convert words into numbers. Computers excel at processing numerical data but cannot directly understand words, necessitating a process called vectorization — the conversion of textual data into numerical representations.

Early Techniques for Converting Words into Numbers

Before advanced techniques like self-attention emerged, NLP relied on simpler methods to represent words numerically. These approaches laid the groundwork for subsequent innovations but had their own set of limitations.

1. One-Hot Encoding

In this technique, each unique word in a sentence is represented by a vector containing a single ‘1’ and the rest ‘0’s.

Example: For the sentence "mat cat mat cat rat rat", the unique words are "mat", "cat", and "rat". Their one-hot encoded vectors would be:

"mat" → [1, 0, 0]
"cat" → [0, 1, 0]
"rat" → [0, 0, 1]

Limitations:

High Dimensionality: The vector size grows with the vocabulary size, making it computationally expensive for large datasets.
Lack of Semantic Understanding: Words that are semantically related, such as "king" and "queen", are represented as entirely different vectors with no relationship.

2. Bag of Words (BoW)

This technique improves upon one-hot encoding by considering the frequency of words in a sentence.

Example: For the sentence "mat cat mat cat rat rat", the representation would be:

"mat" → 2 (appears twice)
"cat" → 2 (appears twice)
"rat" → 2 (appears twice)

Limitations:

Loss of Context: BoW does not capture the order of words in a sentence. For instance, "dog bites man" and "man bites dog" would have the same representation despite their different meanings.
Sparse Representation: Like one-hot encoding, the resulting vectors are sparse, containing many zero values.

3. TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF further refines word representation by accounting for a word’s importance in the context of an entire document collection.

How it Works:

Term Frequency (TF): Measures how frequently a word appears in a document.
Inverse Document Frequency (IDF): Measures how unique a word is across all documents.

By combining these measures, TF-IDF assigns higher weights to words that are frequent in a document but rare across the corpus, making it more context-aware.

Limitations:

Context Insensitivity: Like BoW, TF-IDF does not capture the order or relationships between words.
Static Representation: Words are represented the same way, regardless of their context in a sentence.

The Evolution: Word Embeddings

Word embeddings marked a significant leap forward by capturing the semantic meaning of words. These dense vector representations enable models to understand the relationships between words.

Key Features of Word Embeddings

Semantic Representation
Word embeddings place words with similar meanings close together in a high-dimensional vector space.
Example: "king" and "queen" have vectors that are closer in space compared to "king" and "cricketer".
Dimensionality
Word embeddings are represented as vectors with dimensions like 64, 256, or 512, depending on the application. Each dimension captures a specific aspect of the word’s meaning, learned from the training data.

Example: One dimension might represent “royalty,” another “gender,” and another “power.”

4. Training on Large Datasets
Word embeddings are typically trained on extensive datasets, such as all Wikipedia articles, using neural networks that learn the context of each word.

The Limitation of Word Embeddings

Despite their power, word embeddings are static — meaning that a word’s vector representation remains the same regardless of its context.

Example:

"apple" can mean a fruit or a technology company.
A static word embedding trained on a dataset primarily mentioning "apple" as a fruit will emphasize the "taste" aspect while ignoring its "technology" meaning.

This lack of contextual understanding posed a significant challenge, leading to inaccuracies in tasks where context changes the meaning of words.

Self-Attention: The Contextual Solution

Self-attention emerged as a solution to the static nature of word embeddings, enabling the creation of contextual embeddings — dynamic word representations that adapt to the sentence’s context.

How Self-Attention Works

Self-attention begins with static word embeddings as input and generates new, contextual embeddings for each word by analyzing its relationships with other words in the sentence.

Input: A sentence with static embeddings for each word.
Output: Contextual embeddings that consider the meaning of each word in relation to the entire sentence.

Key Concepts in Self-Attention

Query, Key, and Value Vectors: These are mathematical representations used to calculate relationships between words. Each word in the sentence is transformed into these vectors, enabling the model to focus on the most relevant words for generating contextual embeddings.
Context Awareness: Words that are semantically or syntactically connected have stronger interactions, as reflected in their contextual embeddings.

Benefits of Self-Attention

Contextual Understanding
Self-attention dynamically adjusts word representations based on context, allowing the model to distinguish between "apple" as a fruit and "Apple" as a company.
Improved NLP Applications
By generating smarter, more accurate embeddings, self-attention enhances the performance of tasks like:

Translation: Producing fluent and accurate translations by understanding word dependencies.
Summarization: Identifying key ideas in lengthy texts.

In Simple Terms

Self-attention is like a word relationship calculator. It examines every word in a sentence, determines its connection to other words, and generates a unique vector for each word based on these relationships.

Conclusion

Self-attention has revolutionized NLP by addressing the limitations of static word embeddings. Through the generation of contextual embeddings, it has paved the way for models to better understand the nuanced meaning of words in different contexts.

As the cornerstone of the Transformer architecture, self-attention represents a crucial advancement, enabling breakthroughs in translation, summarization, and beyond. Stay tuned for the next blog in this series, where we’ll explore the mathematical foundation of self-attention, including the concepts of queries, keys, and values!

This blog is part of a series unraveling the mysteries of Transformers. Follow along to deepen your understanding of how these architectures are shaping the future of AI!

Transformers Architecture: One-Stop Detailed Guide: Part 3— Self Attention

The self-attention mechanism is a cornerstone of transformer architectures, powering revolutionary advancements in AI…

kshitijkutumbe.medium.com

Transformers Architecture: One-Stop Detailed Guide: Part 2 — Self Attention

Introduction to Vectorizing Words

Early Techniques for Converting Words into Numbers

1. One-Hot Encoding

2. Bag of Words (BoW)

3. TF-IDF (Term Frequency-Inverse Document Frequency)

The Evolution: Word Embeddings

Key Features of Word Embeddings

The Limitation of Word Embeddings

Self-Attention: The Contextual Solution

How Self-Attention Works

Key Concepts in Self-Attention

Benefits of Self-Attention

In Simple Terms

Conclusion

Transformers Architecture: One-Stop Detailed Guide: Part 3— Self Attention

The self-attention mechanism is a cornerstone of transformer architectures, powering revolutionary advancements in AI…

Written by Kshitij Kutumbe

No responses yet