How ChatGPT Understands Your Questions?
Every day, millions of people type questions into ChatGPT, Gemini, or Claude and receive instant, human-like answers. To the untrained eye, it feels like magic—as if there is a tiny, highly intelligent human sitting inside the computer who understands English, Spanish, or JavaScript.
But in reality, computers do not understand human language at all.
In this article, we will peel back the layers of Large Language Models (LLMs) to explore the fascinating mathematics, architectures, and data structures that allow ChatGPT to process your prompts and generate responses.
1. What is an LLM?
An LLM stands for Large Language Model. Let's break down these three words:
- Large: Refers to the scale of the neural network. Modern models contain billions (sometimes trillions) of adjustable settings called parameters and are trained on massive datasets comprising books, websites, articles, and codebase repositories.
- Language: Refers to the domain. These models are specifically built to understand, generate, translate, and reason about human or programming languages.
- Model: A mathematical representation of a system. An LLM is a complex mathematical equation that maps input sequences to output sequences.
What Problems Do LLMs Solve?
LLMs are versatile general-purpose text processors. They solve problems that were historically extremely difficult for classical code, including:
- Processing Unstructured Data: Extracting structured JSON data from messy, unstructured text.
- Reasoning & Planning: Following step-by-step logic to solve math, logic, or programming problems.
- Translation & Summarization: Translating between natural languages or summarizing a 100-page document into key bullet points.
- Code Generation: Writing, debugging, and explaining programming code across dozens of languages.
Popular Examples of LLMs
Today, several companies lead the frontier of LLM development:
| Creator | Model Series | Access Details |
|---|---|---|
| OpenAI | GPT-4o / o1 | Commercial, powers ChatGPT |
| Gemini 1.5 Pro / Flash | Commercial, features massive context windows | |
| Anthropic | Claude 3.5 Sonnet | Commercial, renowned for coding & reasoning |
| Meta | Llama 3 / 3.1 | Open-weights, downloadable and run-able locally |
2. What Happens When You Send a Message to ChatGPT?
When you type a prompt and press "Enter," a series of instant computational stages take place:
The Generation Pipeline
- Typing a Prompt: You submit raw text (e.g.,
"Explain closures in JavaScript"). - Processing Your Message: The text is broken down into numbers (tokenization) and sent to a server farm hosting the model's neural network.
- Generating a Response: The model processes the input numbers and predicts the single most logical next word/token. Once that token is chosen, it is appended to the input, and the model runs again to predict the next token. This repeating loop is called autoregressive generation.
- Displaying the Text: The numbers are translated back into readable text and streamed to your browser window.
Why Responses Are Not Copied from the Internet
A common misconception is that ChatGPT acts like Google Search—scanning the internet, finding an article, and copy-pasting it.
This is not how it works.
LLMs do not have an active database of documents or search results inside their brains. Instead, during training, the model reads billions of pages and adjusts its internal connections (weights) to learn the patterns of language. When generating a response, the model is constructing a brand-new sequence of words, calculating which word makes the most sense next, based entirely on probability. It is a highly advanced, context-aware version of your smartphone's predictive keyboard.
3. Why Computers Don't Understand Human Language
Computers are machines built from silicon transistors that operate on binary states: 0 (off) and 1 (on). They excel at calculations—adding, multiplying, and dividing numbers at lightning speed.
However, a computer has no concept of what a word is. To a processor, the word "cat" is just a sequence of characters (c, a, t), which are represented under the hood by character codes (like ASCII/UTF-8: 99, 97, 116). The computer doesn't know that a cat is a small, furry, four-legged animal that meows.
"cat" ---> [ 99, 97, 116 ] ---> But what is the relationship between "cat" and "dog"?To make relationships between words calculable, we must convert words into a format that supports math.
Word Embeddings: Multi-Dimensional Spaces
LLMs solve this using embeddings. Every word (or part of a word) is assigned to a list of numbers called a vector. This vector represents a coordinate in a massive, multi-dimensional space (often spanning 1,536 to 4,096 dimensions).
In this vector space, words with similar meanings are positioned close together:
"king"and"queen"will have vectors that point in very similar directions.- We can perform vector arithmetic:
king - man + woman ≈ queen.
By converting words into vectors, we translate human semantics into linear algebra.
4. Tokenization: The Bridge Between Text and Numbers
Before text can be turned into vectors, it must first be chopped into manageable pieces. This process is called tokenization.
What is a Token?
A token is a textual chunk. It is the basic unit of currency for an LLM. A token is not necessarily a full word. It can be:
- A whole word (e.g.,
"apple") - A sub-word chunk (e.g.,
"token"and"ization"for"tokenization") - A single character or punctuation mark (e.g.,
"."or",")
Why Tokenization is Needed
- Vocabulary Size Control: If we treated every single word as a unique token, our vocabulary would be millions of items long. It would struggle to handle typos, plurals, or new words (like "generative").
- Handling Unknown Words: Sub-word tokenization allows the model to break down a completely new word into fragments it does recognize. For example, if it doesn't know
"unbelievability", it can break it down into["un", "believ", "ability"].
Words vs. Tokens
As a rule of thumb for English text:
- 1 Token ≈ 4 characters of text.
- 100 English Words ≈ 130 Tokens.
- 0.75 Words ≈ 1 Token.
Tokenization Example
Consider the sentence: "ChatGPT is amazing!"
A tokenizer might break this down as follows:
| Fragment | Token ID | Notes |
|---|---|---|
Chat | 29437 | Common sub-word |
G | 40 | Capital letter |
PT | 9801 | Acronym part |
is | 318 | Includes the preceding space |
amazing | 4983 | Common word with space |
! | 0 | Punctuation token |
5. Transformers: The Engine of Modern AI
Almost every famous LLM (GPT, Gemini, Claude, Llama) is built on a specific neural network architecture called the Transformer.
Introduced by Google researchers in their landmark 2017 paper "Attention Is All You Need", the Transformer architecture completely revolutionized the field of Artificial Intelligence.
Why Transformers Changed AI
Before Transformers, AI models processed language sequentially (word-by-word) using recurrent architectures (RNNs and LSTMs). If a sentence had 50 words, the model had to process word 1, then word 2, all the way to word 50.
- The Problem: Sequential processing cannot be easily parallelized on GPU chips, making training slow. More importantly, by the time the model got to word 50, it had often "forgotten" the context of word 1.
- The Transformer Solution: Transformers process the entire sentence all at once (in parallel). This makes training extremely fast and allows models to scale to trillions of words.
The Self-Attention Mechanism
The magic ingredient of a Transformer is called Self-Attention.
In any language, the meaning of a word depends heavily on its context. Self-attention allows the model to calculate how much "attention" one token should pay to other tokens in the same sentence to resolve its meaning.
Consider these two sentences:
- "She deposited her money at the bank."
- "The children played on the river bank."
Sentence 1: bank ─── (pays attention to) ───> money
Sentence 2: bank ─── (pays attention to) ───> riverIn the first sentence, the self-attention mechanism links the word bank to money, resulting in a vector representing a financial institution. In the second sentence, self-attention links bank to river, creating a vector representing land next to water.
6. Key LLM Mechanics
To build applications with LLMs, developers must understand two critical concepts: Context Windows and Temperature.
Concept A: The Context Window
An LLM has a strict limit on how much text it can process at any given moment. This is called the Context Window.
Think of the context window as the model's active working memory. It must hold:
- The system instructions (e.g., "You are a helpful coding assistant").
- The conversation history (previous questions and answers).
- The new prompt you just typed.
- The generated response (as it is being written).
If your conversation exceeds the context window, the model starts "forgetting" the oldest messages in the chat history to make room for the new tokens.
Concept B: Temperature Control
When predicting the next token, the model calculates probabilities for many words. The Temperature setting controls how the model samples from these probabilities:
- Low Temperature (e.g.,
0.1-0.3): The model almost always selects the word with the highest probability. The output is logical, focused, and highly deterministic. Good for: Writing code, analyzing data, answering factual questions. - High Temperature (e.g.,
0.8-1.2): The model samples from lower-probability options. The output becomes highly creative, diverse, but prone to errors or "hallucinations". Good for: Brainstorming, writing stories, roleplay.
7. The End-to-End LLM Workflow
To summarize everything, here is how a raw text input travels through an LLM to return a response:
8. The JavaScript Perspective
As you begin your journey of GenAI with JavaScript, here are two examples demonstrating how to apply these concepts in code.
Example 1: Tokenization in Node.js
We can perform tokenization locally in JavaScript using different libraries, depending on which family of models we are working with.
Option A: Tokenizing with tiktoken (OpenAI GPT Models)
OpenAI uses the tiktoken library under the hood. In JavaScript, we can install the tiktoken package from npm and run the following code:
import { get_encoding } from "tiktoken";
// Get the encoder for the classic GPT-2 model (or use 'cl100k_base' for GPT-4/GPT-3.5)
const encodedForGpt2 = get_encoding('gpt2');
// Encode raw human text into token IDs
const encoded = encodedForGpt2.encode('Hello i am Pratap Das');
console.log("Encoded Token IDs:", encoded);
// Output: Uint32Array(6) [ 15496, 1318, 716, 33261, 4232, 299 ]
// Decode the token IDs back into binary, and parse with TextDecoder
const decodedBytes = encodedForGpt2.decode(encoded);
const decodedText = new TextDecoder().decode(decodedBytes);
console.log("Decoded Text:", decodedText);
// Output: "Hello i am Pratap Das"
// Free the encoder memory when done
encodedForGpt2.free();Option B: Tokenizing with @xenova/transformers (Llama/Hugging Face Models)
For open-weights models (like Llama or Gemma), we can use Hugging Face's official JavaScript engine, @xenova/transformers:
import { AutoTokenizer } from '@xenova/transformers';
async function tokenizeText() {
// Load the tokenizer for the Llama 3 model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/llama-3-tokenizer');
const text = "ChatGPT is amazing!";
// Encode the text into token IDs
const tokenIds = await tokenizer.encode(text);
console.log("Token IDs:", tokenIds);
// Output: [1294, 76, 2983, 310, 8943, 0] (Values vary by vocabulary)
// Detokenize individual IDs back to words
for (let id of tokenIds) {
const textFragment = await tokenizer.decode([id]);
console.log(`ID ${id} -> "${textFragment}"`);
}
}
tokenizeText();Example 2: Calling an LLM with Temperature Config (using @google/genai)
When interfacing with Google's Gemini models in JavaScript, you can adjust settings like temperature and maxOutputTokens directly in the configurations:
import { GoogleGenAI } from '@google/genai';
// Initialize the Google Gen AI client (reads GEMINI_API_KEY environment variable)
const ai = new GoogleGenAI();
async function askAI() {
try {
const response = await ai.models.generateContent({
model: 'gemini-1.5-flash',
contents: 'Write a creative title for a JavaScript Generative AI tutorial.',
config: {
// Higher temperature (e.g., 0.9) generates more creative and unexpected results
temperature: 0.9,
// Limits the output context window to save cost/latency
maxOutputTokens: 100,
}
});
console.log("Response:", response.text);
} catch (error) {
console.error("Error communicating with Gemini API:", error);
}
}
askAI();[!NOTE] Summary Checklist:
- LLMs predict the next token based on statistical patterns.
- Computers process text by mapping character fragments to numerical Token IDs and mapping those to embeddings.
- Transformers use Self-Attention to evaluate relationships between all words in a sentence at once.
- Temperature controls creativity, while the Context Window limits active working memory.