Building a Tokenizer from Scratch [part 2]
Parser Theory: Q/A with Claude Opus In part 1, we built a working FSM that recognizes <div>text</div> using just 7 primitives mapped 1:1 to assembly opcodes. But FSMs have a hard limit:...
![Building a Tokenizer from Scratch [part 2]](https://media2.dev.to/dynamic/image/width=1200,height=627,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqcv5i9ky1shh864vi2c8.png)
Source: DEV Community
Parser Theory: Q/A with Claude Opus In part 1, we built a working FSM that recognizes <div>text</div> using just 7 primitives mapped 1:1 to assembly opcodes. But FSMs have a hard limit: they can't handle nested structures like <div><div>hello</div></div>. In this post, we climb the Chomsky hierarchy from finite state machines to pushdown automata, build a PDA that recognizes nested <div> tags, and then turn it into a transducer that emits tokens. In other words we are building the core of a lexer. Q: Why can't FSMs handle nested structures? Because an FSM has a fixed number of states, and that's all the memory it has. Consider nested divs: <div><div><div>hello</div></div></div> To correctly match closing tags, you need to count how many <div>s you've opened so you know how many </div>s to expect. An FSM with, say, 12 states can handle nesting up to some fixed depth — but someone can always write H