Word Embeddings as inputs for LLMs — Second Part

Vitomir Jovanović
7 min readJan 22, 2024

After recapitulation of basic concepts of vectors and basic vector similarity measures (cosine similarity), now we can go into deep how embeddings for words are generated.

In the realm of Natural Language Processing (NLP), vector representations are crafted through the training of single-layer neural networks (Word2Vec approach).

Word Embeddings — Unveiling the Parameters:

Word embeddings, often referred as word vectors, are a result of training these single layer neural networks. These embeddings are intricately tied to the layer’s parameters and neurons. The primary task at hand is to create meaningful representations of words in a vector space.

Two fundamental tasks in this context are SKIP-gram and Continuous Bag of Words (CBOW) models.

  • SKIP-gram Model:

It involves a neural network that predicts the surrounding words (target context) based on a given word. The network aims to capture the relationships and meanings between words.

  • CBOW Model:

In contrast, the CBOW model employs a neural network to predict a specified word (target word) based on its context, i.e., the words that preceed it. The model adjusts based on n-grams, enhancing its ability to understand and represent language nuances.

Key Element: The Embedding Matrix

Crucial to both models is the creation of an embedding matrix, a pivotal layer in the architecture of the neural network. This matrix is structured as the size of the vocabulary by the embedding dimension, serving as a foundational element in the representation of words within the NLP framework.

CBOW and SkipGram Word2Vec models

In the landscape of Natural Language Processing (NLP), envisioning the entirety of a corpus (text, such as from Wikipedia) is akin to a tableau of sentences. Each sentence becomes a row, analogous to a row in a tabular dataset.

To unravel the intricacies of language, a methodical process unfolds. Picture the entire corpus as a series of sentences, and at each step, a sliding window traverses these sentences. In this dynamic process:

  • Sliding Window Mechanism: A sliding window moves systematically over the sentences within the corpus. Think of it as a lens scrutinizing the linguistic landscape.
  • Central Word Selection: At each step, a pivotal decision is made: the choice of a central word. This word becomes the focal point for either predicting its surrounding words, encapsulating the context (as in the SKIP-gram model), or predicting the targeted word based on its contextual neighbors (as in the CBOW model).

This approach mimics the essence of understanding language through context, where each step in the sliding window unfolds a new perspective, capturing the intricate relationships and meanings embedded within the corpus.

In the more detailed picture of CBOW approach, matrix W’ is actually the word embedding matrix after finished word embedding training:

CBOW: creation of word embbedings (W’)

A technique known as negative sampling plays a crucial role in achieving a nuanced understanding of language. This method involves the deliberate inclusion of a small set of negative examples — typically ranging from 5% to 20% of the total words — in the training process.

The key f negative sampling lies in strategically sampling words to follow the frequency distribution observed across the entire corpus. In other words, the choice of negative examples is guided by their prevalence in the overall dataset. This strategic approach ensures a balanced representation of both positive and negative instances, enriching the learning process.

In essence, negative sampling acts as a refined tuning mechanism, fostering a balanced and insightful training regimen for word embeddings.

Based on the back-propagation process, the structure of language undergoes a transformation, effectively mapping onto the parameters of a neural network. The dimensional aspect of the matrix aimed at capturing language structure is defined as N (number of words) multiplied by the embedding dimension (typically 512).

For instance:

  • text-embedding-ada-002 by OpenAI employs an embedding dimension of 1536.
  • This is eight times smaller than the preceding version, davinci-001.

With the next piece of code you can play with different embedings and compare them:

nlp = spacy.load('en_core_web_md')
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(round(doc1.similarity(doc2),2))

doc = nlp("Two ball goal in pyjamas socks")

# Get the vector for the token "bananas"
ball_vector = doc[1].vector
goal_vector = doc[2].vector
pidzama_vector = doc[4].vector
carape_vector = doc[5].vector
# print(bananas_vector)
embeddings = np.array([ball_vector, goal_vector, pidzama_vector, carape_vector])

plt.figure(figsize=(12,10))
sns.heatmap(embeddings, cmap="viridis",
annot=False,
xticklabels=False,
yticklabels=['Ball', 'Goal', 'Pijama', 'Socks'],

With this part of the code you can with heatmap see the similarity of word embeddings for pairs of different similar words. You will see that pijama and socks have very similar contexts.

Examples of Word Embedding Usage: Transformers and BERT

In the context of Transformer models, the intuitive application of word embeddings involves enhancing the original embedding with context vectors. By introducing context vectors, the embedding is effectively shifted in a specific direction within the multidimensional space.

This directional shift in an n-dimensional space holds semantic significance, representing a semantic displacement across the entire token space. In simpler terms, altering the embedding in a particular direction corresponds to a meaningful semantic shift in the space of all tokens. This mechanism contributes to a more nuanced and context-aware representation of words within the Transformer architecture, enabling a richer understanding of language semantics.

One of the key components contributing to the extraordinary performance of BERT (or other LLM) is its ability to undergo pre-training in a self-supervised manner, using the word embeddings as input. At a high level, this form of training is invaluable as it can be conducted on raw, unlabeled text. Considering the widespread availability of such data online, sourced from platforms like digital book repositories or websites such as Wikipedia, a vast corpus of textual data can be amassed for pre-training. This facilitates BERT’s learning from a diverse dataset that is orders of magnitude larger than most supervised/annotated datasets.

Creation of embeddings on narrow language corpus

Creating embeddings on a narrow language corpus involves generating vector representations for words in a dataset that is specific and limited in scope. Depending on your specific needs, you may fine-tune the embeddings on a task related to your domain. For instance, if you are working on a classification task, you might fine-tune the embeddings on a labeled dataset.

Embeddings created on a local language corpus are:

  • More sensitive to specific meanings.
  • Suitable for simpler tasks (e.g., text classification).
  • Better suited for a narrower semantic context.

Similar to Retrieval Augmented Generation (RAG) for GenAI, the GenAI model is enriched with a narrow dataset and knowledge through supplemental retrieval. This is other important state-of-the art usage of word embeddings where they can be used of averaging paragraph or document embedding. Vector-based embedding databases serve as an additional source of knowledge, utilizing search and extraction mechanisms based on vector similarity. The prompt is augmented by extracting from the vector database, targeting the LLM enriched with content.

This approach has been fine-tuned for specialized domain tasks, effectively reducing “hallucinations.”

Conclusions

In the realm of Natural Language Processing (NLP), word embedding serves as a foundational concept, acting as the “input” for various NLP tasks. Drawing parallels between BERT (Bidirectional Encoder Representations from Transformers) and the Continuous Bag of Words (CBOW) model reveals intriguing similarities:

Common Objective: Both BERT and CBOW are designed to learn vector representations of tokens (words) within a given context. BERT is doing it through masked language modelling which is very similar to CBOW method. It can also do it by Next Sentence Prediction task (NSP).

Masking Strategy: BERT employs a masking strategy where several words in a sentence are masked, and the training revolves around predicting these masked words. This mirrors the principle of the CBOW model.

Interconnected Training: The output (word embedding) from CBOW serves as the input for BERT. However, in BERT, instead of using word indices, vector representations of words become the input, introducing a more sophisticated dimension.

Sophistication in Learning: BERT goes beyond CBOW by learning sophisticated representations enriched with context. This advancement is attributed to the utilization of self-attention mechanisms.

The CBOW model has left an imprint on the development of BERT. While BERT builds upon the foundations laid by CBOW, it elevates the game by incorporating self-attention mechanisms, allowing for a more nuanced understanding of contextual relationships within language.

This synergy between CBOW and BERT underscores the continuous evolution and enhancement of word embedding techniques in NLP, showcasing the iterative nature of advancements in the field.

While word vector representation approaches have significantly advanced natural language processing, they come with certain drawbacks:

Universal Context Representation: Words are often represented in a universal context derived from the entire linguistic corpus. This approach reduces the potential for capturing nuances in meaning, especially in cases of ambiguity or words with multiple layers of meaning.

Mitigation Attempts in LLM: Language models with mechanisms like self-attention, such as those found in Large Language Models (LLMs), attempt to address these limitations. They achieve this by forwarding context vectors alongside specific words during training.

Broad Training Corpus: The training process for many word representation models involves a broad linguistic corpus. This lack of sensitivity to specific and local meanings can be a drawback for applications requiring fine-tuned semantic understanding.

Expensive Training: Training models like ChatGPT, for instance, involves substantial costs, approximately around 3.5 million dollars for both training and maintenance. The need for powerful hardware contributes to the expense.

Semantic Drift: Semantic drift, a phenomenon where the meaning of words changes over time or due to historical and domain-specific contexts, poses a challenge. Word vector representations may struggle to adapt to evolving meanings.

These limitations underscore the ongoing challenges in the development of word vector representations, prompting continuous refinement to address contextual nuances and ensure accurate semantic understanding.

This was very comprehensive blog about word embedings.

--

--

Vitomir Jovanović

PhD in Psychometrics, Data Scientist, works in tech, from Belgrade, Serbia