Categories: AI/ML Research

Tokenizers in Language Models

This post is divided into five parts; they are: • Naive Tokenization • Stemming and Lemmatization • Byte-Pair Encoding (BPE) • WordPiece • SentencePiece and Unigram The simplest form of tokenization splits text into tokens based on whitespace.

How to Diagnose Why Your Language Model Fails

Language models , as incredibly useful as they are, are not perfect, and they may fail or exhibit undesired performance due to a variety of factors, such as data quality, tokenization constraints, or difficulties in correctly interpreting user prompts.

November 6, 2025

In "AI/ML Research"

Building a Transformer Model for Language Translation

This post is divided into six parts; they are: • Why Transformer is Better than Seq2Seq • Data Preparation and Tokenization • Design of a Transformer Model • Building the Transformer Model • Causal Mask and Padding Mask • Training and Evaluation Traditional seq2seq models with recurrent neural networks have…

August 3, 2025

In "AI/ML Research"

3D Shape Tokenization

We introduce Shape Tokens, a 3D representation that is continuous, compact, and easy to integrate into machine learning models. Shape Tokens serve as conditioning vectors, representing shape information within a 3D flow-matching model. This flow-matching model is trained to approximate probability density functions corresponding to delta functions concentrated on the…

January 7, 2025

In "FAANG"

AI Generated Robotic Content