Hi, this is Sen, a lifelong learner and explorer.

Passionate about AI/ML and other technologies. Join me on my journey as I share learnings and discoveries. Let’s explore together!

Transformer Part II: The implementation & experiments

In this episode, we will go into some details of the causal transfomer implementation. Some toy experiment results are shown to analyze transformer, in an attempt to understand what drives the performance. We will not go through every single line of implementation. The code used for illustration can be found: https://github.com/yuansen23aa/GPT-learning/blob/main/basic_gpt.ipynb, which largely follows Anrej Karpathy’s nanoGPT implementation with some modifications. So Let’s dig in. We use those terms interchangeably: block size = sequence length, causal attention = masked attention, causal transformation = decoder-only transformer. ...

Transformer Part I: The algorithmic details that actually matter

Transformer is arguably the most influential AI innovation in the past decade, serving as the foundation as our modern day LLM models. It also revolutionize some adjacent fields such as computer vision, recommender system etc. In this note, we are going to revisit this revolutionary technology with a focus on discussing a set of details that matter. Transformer Recap Transformer was first proposed as a encoder-decoder model to solve the machine translation problem. Take English to Chinese language translation as our example, encoder is basically encoding text tokens in a way such that both individual token meaning and joint dependency can be captured. The output is the tensor representation of English sentences. The decoder is responsible for text generation and learning relationship between English and Chines. The text generation process is basically next token prediction which is achieved by masking unseen tokens so attention is only paid to preceding tokens in the sentence, namely causal masking. The English and Chineses relationship is learned by cross attention where query is Chinese tokens, which pays more attention to tokens from encoded English output with high attention weights. ...

Why I Started Writing

2026 is shaping up to be another exciting year for AI and technology. One of my New Year’s resolutions is to start a personal website to document my learning journey and the “aha” moments I encounter along the way, both in technology and in life. So why writing, and why now? Some might argue that large language models already explain technical concepts so well that writing blog posts no longer adds much value. I see it differently.Here are my reasons. ...