LLMs may speak in words, but under the hood they think in tokens: compact numeric IDs representing character sequences. If you grasp why tokens exist, how they are formed, and where the real-world costs arise, you can trim your invoices, slash latency, and squeeze higher throughput from any model, whether you rent a commercial endpoint or serve one in-house.
Why LLMs Don’t Generate Text One Character at a Time
Imagine predicting “language” character by character. When decoding the very last “e,” the network must still replay the entire hidden state for the preceding seven characters. Multiply that overhead by thousands of characters in a long prompt and you get eye-watering compute.