Categories: AI/ML Research

The Journey of a Token: What Really Happens Inside a Transformer

Large language models (LLMs) are based on the transformer architecture, a complex deep neural network whose input is a sequence of token embeddings.

Creating a Llama or GPT Model for Next-Token Prediction

This article is divided into three parts; they are: • Understanding the Architecture of Llama or GPT Model • Creating a Llama or GPT Model for Pretraining • Variations in the Architecture The architecture of a Llama or GPT model is simply a stack of transformer blocks.

December 9, 2025

In "AI/ML Research"

KV Prediction for Improved Time to First Token

Inference with transformer-based language models begins with a prompt processing step. In this step, the model generates the first output token and stores the KV cache needed for future generation steps. This prompt processing step can be computationally expensive, taking 10s of seconds or more for billion-parameter models on edge…

February 20, 2025

In "FAANG"

A Gentle Introduction to Attention and Transformer Models

This post is divided into three parts; they are: • Origination of the Transformer Model • The Transformer Architecture • Variations of the Transformer Architecture Transformer architecture originated from the 2017 paper "Attention is All You Need" by Vaswani et al.

March 29, 2025

In "AI/ML Research"

AI Generated Robotic Content