Dive into FlashAttention-3

The rise of large language models has made the “attention” mechanism a cornerstone of modern AI. But attention is computationally and memory-intensive, often becoming a bottleneck. Enter FlashAttention, a groundbreaking algorithm designed to accelerate this crucial step. While the cutting-edge FlashAttention-4 for NVIDIA’s new Blackwell architecture is now emerging, understanding the leap forward made by FlashAttention-3 on the widely-used Hopper (H100) platform is key to grasping modern GPU optimization. This post will dissect the clever combination of techniques that make it fast, from algorithmic innovations like the fused kernel to the deep hardware co-design on Hopper, which uses specialized units like the Tensor Memory Accelerator (TMA) to power advanced scheduling patterns like Warp Specialization and Pingpong Scheduling. ...

September 18, 2025 · 8 min · 1500 words · Li Cao

Summer Wrap Up

Summer break officially ends this week. I spent the last several weeks of summer on a few things: Solving DSA problems (honestly, I’d prefer learning new skills, but they’re necessary for some job interviews) System Design (ByteByteGo is a great YouTube channel for this topic). I read several case studies and it’s really the “big bang” of computer systems. It draws on knowledge from computer networks, databases, cloud-native platforms, DevOps, software engineering, backend development, and distributed systems Cloud Computing Large Language Models Computer Networks While I dedicated most of my summer to studying and research (with the exception of a 3-day trip to Cuyahoga Valley National Park and Cleveland), There’s still much more to explore—both new topics and greater depth in areas I’ve already been working on. ...

August 21, 2025 · 1 min · 130 words · Li Cao

The 80-Year-Old Architecture Holding Back AI

I came across an IBM Research post recently (https://research.ibm.com/blog/why-von-neumann-architecture-is-impeding-the-power-of-ai-computing ). It turns out one of the biggest things holding back AI computing isn’t some exotic new problem—it’s a design choice from 1945. The culprit? The von Neumann architecture that’s powered nearly every computer. The Problem: A Traffic Jam Inside Your Computer Picture a brilliant chef (the processor) and a massive pantry (memory) separated by a narrow hallway. For each step of a recipe, the chef must walk down the hallway, grab a single ingredient, and walk back. This round trip is repeated thousands of times. That’s essentially what happens when AI models run on traditional computers. ...

August 6, 2025 · 4 min · 682 words · Li Cao

Three Ingredients of Happiness

Naval Ravikant once distilled happiness down to its essence: “Happiness is a choice. It’s a choice between three things: where you live, who you are with, and what you do.” It took me years to realize how profound this insight was. When I think about the times I’ve felt most fulfilled versus completely stuck, it almost always comes down to these three elements. Where You Live Environment shapes us more than we realize. It’s not just physical space, but the entire ecosystem we inhabit daily. ...

August 3, 2025 · 4 min · 747 words · Li Cao

What is the Full Stack Equivalent of Systems Programming?

Web development has popularized the concept of a “full stack developer”—someone who is comfortable working on every part of an application, from the user-facing front-end to the server-side back-end and the database it connects to, with a holistic view of the entire web stack. But what is the equivalent of “full stack” in the world of systems programming? Redefining the “Stack” To answer this, we need to first identify what constitutes the “stack” in systems programming: ...

August 1, 2025 · 3 min · 568 words · Li Cao

Finetuning LLM for Text-to-SQL generation

I just completed a project that lets people ask database questions in plain English and get back proper SQL queries using a fine-tuned large language model. For the base model, I chose Mistral-7B-v3 and fine-tuned it specifically for SQL generation. Using QLoRA for efficient training, I was able to train the 7-billion parameter model on a single consumer-grade GPU (Nvidia Tesla P100) in around 3 hours. The resulting model performs well on common SQL patterns like filtering, joins, and aggregations, effectively handling the majority of real-world database queries. That said, it can be less accurate for complex subqueries or really intricate nested queries due to the limitations of the Mistral-7B model —- a larger model would handle these cases better, but this was a tradeoff between performance and computational requirements. ...

July 24, 2025 · 2 min · 235 words · Li Cao

Summer of Computer Systems and More

Due to the restriction that we need at least two consecutive semesters of study in order to intern in the US, I didn’t apply for any summer internship. Employment at school is exempted, and I am fortunate to be an RA at CMU Tepper BLA lab this summer. It involves using graph theory to analyze financial statements. More details can be found here. I am also currently working with a PhD student at Safe AI Lab to do research in the area of Robotics. ...

July 20, 2025 · 4 min · 787 words · Li Cao

CMU Course Reviews - 1st Semester

Looking back on my first semester at CMU, I wanted to share my thoughts and experiences with the courses I took. This might be helpful for future students planning their schedules. This semester I took 4 courses. The workload was intense but manageable with good time management. Here’s my breakdown: 18-613: Introduction to Computer Systems While the famous CSAPP course has been extensively reviewed and is taught at many universities, here’s my personal perspective on the CMU experience. ...

July 18, 2025 · 6 min · 1273 words · Li Cao

Curiosity is (Almost) All You Need

The landscape of learning has been fundamentally transformed. In an era where Large Language Models can generate code and explain complex concepts, the traditional barriers to learning have largely disappeared. What remains—and what has become more important than ever—is curiosity. The Great Democratization Not too long ago, learning new technologies or skills required: Access to expensive courses or textbooks Mentorship from experienced practitioners Trial and error through countless hours of debugging Physical presence in classrooms or labs Today, anyone with internet access can have a conversation with an AI that knows more about programming, mathematics, science, and virtually any field than most human experts. The means of learning are no longer the bottleneck—curiosity and the drive to learn are. ...

June 15, 2025 · 4 min · 825 words · Li Cao