How ChatGPT Understands Your Questions?

Every day, millions of people type questions into ChatGPT, Gemini, or Claude and receive instant, human-like answers. To the untrained eye, it feels like magic—as if there is a tiny, highly intelligent human sitting inside the computer who understands English, Spanish, or JavaScript.

But in reality, computers do not understand human language at all.

In this article, we will peel back the layers of Large Language Models (LLMs) to explore the fascinating mathematics, architectures, and data structures that allow ChatGPT to process your prompts and generate responses.

1. What is an LLM?

An LLM stands for Large Language Model. Let's break down these three words:

Large: Refers to the scale of the neural network. Modern models contain billions (sometimes trillions) of adjustable settings called parameters and are trained on massive datasets comprising books, websites, articles, and codebase repositories.
Language: Refers to the domain. These models are specifically built to understand, generate, translate, and reason about human or programming languages.
Model: A mathematical representation of a system. An LLM is a complex mathematical equation that maps input sequences to output sequences.

What Problems Do LLMs Solve?

LLMs are versatile general-purpose text processors. They solve problems that were historically extremely difficult for classical code, including:

Processing Unstructured Data: Extracting structured JSON data from messy, unstructured text.
Reasoning & Planning: Following step-by-step logic to solve math, logic, or programming problems.
Translation & Summarization: Translating between natural languages or summarizing a 100-page document into key bullet points.
Code Generation: Writing, debugging, and explaining programming code across dozens of languages.

Popular Examples of LLMs

Today, several companies lead the frontier of LLM development:

Creator	Model Series	Access Details
OpenAI	GPT-4o / o1	Commercial, powers ChatGPT
Google	Gemini 1.5 Pro / Flash	Commercial, features massive context windows
Anthropic	Claude 3.5 Sonnet	Commercial, renowned for coding & reasoning
Meta	Llama 3 / 3.1	Open-weights, downloadable and run-able locally

2. What Happens When You Send a Message to ChatGPT?

When you type a prompt and press "Enter," a series of instant computational stages take place:

The Generation Pipeline

Typing a Prompt: You submit raw text (e.g., "Explain closures in JavaScript").
Processing Your Message: The text is broken down into numbers (tokenization) and sent to a server farm hosting the model's neural network.
Generating a Response: The model processes the input numbers and predicts the single most logical next word/token. Once that token is chosen, it is appended to the input, and the model runs again to predict the next token. This repeating loop is called autoregressive generation.
Displaying the Text: The numbers are translated back into readable text and streamed to your browser window.

Why Responses Are Not Copied from the Internet

A common misconception is that ChatGPT acts like Google Search—scanning the internet, finding an article, and copy-pasting it.

This is not how it works.

LLMs do not have an active database of documents or search results inside their brains. Instead, during training, the model reads billions of pages and adjusts its internal connections (weights) to learn the patterns of language. When generating a response, the model is constructing a brand-new sequence of words, calculating which word makes the most sense next, based entirely on probability. It is a highly advanced, context-aware version of your smartphone's predictive keyboard.

3. Why Computers Don't Understand Human Language

Computers are machines built from silicon transistors that operate on binary states: 0 (off) and 1 (on). They excel at calculations—adding, multiplying, and dividing numbers at lightning speed.

However, a computer has no concept of what a word is. To a processor, the word "cat" is just a sequence of characters (c, a, t), which are represented under the hood by character codes (like ASCII/UTF-8: 99, 97, 116). The computer doesn't know that a cat is a small, furry, four-legged animal that meows.

"cat" ---> [ 99, 97, 116 ] ---> But what is the relationship between "cat" and "dog"?

To make relationships between words calculable, we must convert words into a format that supports math.

Word Embeddings: Multi-Dimensional Spaces

LLMs solve this using embeddings. Every word (or part of a word) is assigned to a list of numbers called a vector. This vector represents a coordinate in a massive, multi-dimensional space (often spanning 1,536 to 4,096 dimensions).

In this vector space, words with similar meanings are positioned close together:

"king" and "queen" will have vectors that point in very similar directions.
We can perform vector arithmetic: king - man + woman ≈ queen.

By converting words into vectors, we translate human semantics into linear algebra.

4. Tokenization: The Bridge Between Text and Numbers

Before text can be turned into vectors, it must first be chopped into manageable pieces. This process is called tokenization.

What is a Token?

A token is a textual chunk. It is the basic unit of currency for an LLM. A token is not necessarily a full word. It can be:

A whole word (e.g., "apple")
A sub-word chunk (e.g., "token" and "ization" for "tokenization")
A single character or punctuation mark (e.g., "." or ",")

Why Tokenization is Needed

Vocabulary Size Control: If we treated every single word as a unique token, our vocabulary would be millions of items long. It would struggle to handle typos, plurals, or new words (like "generative").
Handling Unknown Words: Sub-word tokenization allows the model to break down a completely new word into fragments it does recognize. For example, if it doesn't know "unbelievability", it can break it down into ["un", "believ", "ability"].

Words vs. Tokens

As a rule of thumb for English text:

1 Token ≈ 4 characters of text.
100 English Words ≈ 130 Tokens.
0.75 Words ≈ 1 Token.

Tokenization Example

Consider the sentence: "ChatGPT is amazing!" A tokenizer might break this down as follows:

Fragment	Token ID	Notes
`Chat`	`29437`	Common sub-word
`G`	`40`	Capital letter
`PT`	`9801`	Acronym part
`is`	`318`	Includes the preceding space
`amazing`	`4983`	Common word with space
`!`	`0`	Punctuation token

5. Transformers: The Engine of Modern AI

Almost every famous LLM (GPT, Gemini, Claude, Llama) is built on a specific neural network architecture called the Transformer.

Introduced by Google researchers in their landmark 2017 paper "Attention Is All You Need", the Transformer architecture completely revolutionized the field of Artificial Intelligence.

Why Transformers Changed AI

Before Transformers, AI models processed language sequentially (word-by-word) using recurrent architectures (RNNs and LSTMs). If a sentence had 50 words, the model had to process word 1, then word 2, all the way to word 50.

The Problem: Sequential processing cannot be easily parallelized on GPU chips, making training slow. More importantly, by the time the model got to word 50, it had often "forgotten" the context of word 1.
The Transformer Solution: Transformers process the entire sentence all at once (in parallel). This makes training extremely fast and allows models to scale to trillions of words.

The Self-Attention Mechanism

The magic ingredient of a Transformer is called Self-Attention.

In any language, the meaning of a word depends heavily on its context. Self-attention allows the model to calculate how much "attention" one token should pay to other tokens in the same sentence to resolve its meaning.

Consider these two sentences:

"She deposited her money at the bank."
"The children played on the river bank."

Sentence 1: bank ─── (pays attention to) ───> money
Sentence 2: bank ─── (pays attention to) ───> river

In the first sentence, the self-attention mechanism links the word bank to money, resulting in a vector representing a financial institution. In the second sentence, self-attention links bank to river, creating a vector representing land next to water.

How Transformers Process Text Internally

When a prompt enters a Transformer network, it travels through a stacked sequence of mathematical processing layers:

Input Encoding & Positional Embeddings: Tokens are mapped to dense vector coordinates. Since Transformers process all tokens simultaneously, they lose natural word order. To fix this, Positional Encodings (mathematical wave patterns) are added to the vectors to tag the exact order of each word.
Multi-Head Attention Layers: The vectors pass through multiple "attention heads". Each head focuses on a different aspect of relationships—such as subject-verb agreement, pronoun references (e.g., determining what "it" refers to), or emotional sentiment.
Feed-Forward Networks (FFN): After computing self-attention, each token vector is passed through a dense feed-forward neural network. This layer acts as the model's "factual database," where key facts, associations, and learned knowledge are recalled and integrated.
Normalization & Residual Connections: At every layer, residual connections add the original input vector back to the processed output. This ensures that the model does not lose the initial user context as it goes deeper.
Output Probabilities (Logits): The final representation is projected back to match the vocabulary size, yielding raw scores (logits) that indicate which tokens are the most logical continuations.

6. Key LLM Mechanics

To build applications with LLMs, developers must understand three critical concepts: Context Windows, Temperature, and Top-P.

Concept A: The Context Window (And Exceeding Limits)

An LLM has a strict limit on the total number of tokens it can read and write in a single interaction. This boundary is called the Context Window.

Think of the context window as the model's active working memory. It must hold:

The System Instructions (e.g., "You are a helpful coding assistant").
The Conversation History (all previous questions and answers in this thread).
The New Prompt you just typed.
The Generated Response (which occupies slots as it is generated).

What Happens When You Exceed the Context Window?

Depending on how the LLM interface or application is designed, exceeding the token limit leads to three main outcomes:

The "Sliding Window" Effect (Forgetting): In chat applications like ChatGPT, the system automatically deletes the oldest message turns to make room for your new prompt. This is why an AI might suddenly forget your name or instructions given at the very start of a long conversation.
API Execution Error: If you call an LLM programmatically via APIs (such as OpenAI or Google Gemini), sending a request that exceeds the max context token limit will trigger a 400 Bad Request or ContextExceeded exception, causing your code to fail.
Cost and Latency Spike: Processing more tokens increases the computational load. This raises your API usage billing and increases the delay (latency) before the model begins responding.

Concept B: Controlling Randomness (Temperature and Top-P)

When predicting the next token, the model calculates a probability score for thousands of words. Developers use two settings to control how the model chooses from these probabilities:

1. Temperature

Temperature scales the raw probabilities.

Low Temperature (e.g., 0.1 - 0.3): Flattens low probabilities to near zero and inflates the top choice. The model becomes highly deterministic, factual, and predictable. Good for: Writing code, mathematical logic, factual retrieval.
High Temperature (e.g., 0.8 - 1.2): Flattens the difference between choices, making lower-probability tokens much more likely to be selected. The output becomes creative, diverse, and unpredictable—but prone to "hallucinations" (inventing false facts). Good for: Brainstorming, creative writing, roleplay.

2. Top-P (Nucleus Sampling)

Top-P controls the size of the token pool from which the model can sample, based on cumulative probability. For example, if top_p is set to 0.9 (90%), the model calculates the top candidates and discards all other choices once their combined probability adds up to 90%.

If Top-P = 0.1: The model only considers tokens in the top 10% probability mass. This behaves similarly to a low temperature.
If Top-P = 0.9: The model considers candidates making up the top 90%. It prunes the bottom 10% (the highly obscure or grammatically incorrect options), protecting the model from generating absolute gibberish.

Best Practice for Randomness Control

To fine-tune randomness:

[!IMPORTANT] Combine with Care: Developers usually adjust either Temperature or Top-P, but not both at the same time. If you want creative writing with a safety guard, keep Top-P at 0.9 and turn Temperature up to 0.85.

7. The End-to-End LLM Workflow

To summarize everything, here is how a raw text input travels through an LLM to return a response:

8. The JavaScript Perspective

As you begin your journey of GenAI with JavaScript, here are two examples demonstrating how to apply these concepts in code.

Example 1: Tokenization in Node.js

We can perform tokenization locally in JavaScript using different libraries, depending on which family of models we are working with.

Option A: Tokenizing with `tiktoken` (OpenAI GPT Models)

OpenAI uses the tiktoken library under the hood. In JavaScript, we can install the tiktoken package from npm and run the following code:

import { get_encoding } from "tiktoken";
 
// Get the encoder for the classic GPT-2 model (or use 'cl100k_base' for GPT-4/GPT-3.5)
const encodedForGpt2 = get_encoding('gpt2');
 
// Encode raw human text into token IDs
const encoded = encodedForGpt2.encode('Hello i am Pratap Das');
console.log("Encoded Token IDs:", encoded);
// Output: Uint32Array(6) [ 15496,  1318,   716, 33261,  4232,   299 ]
 
// Decode the token IDs back into binary, and parse with TextDecoder
const decodedBytes = encodedForGpt2.decode(encoded);
const decodedText = new TextDecoder().decode(decodedBytes);
console.log("Decoded Text:", decodedText);
// Output: "Hello i am Pratap Das"
 
// Free the encoder memory when done
encodedForGpt2.free();

Option B: Tokenizing with `@xenova/transformers` (Llama/Hugging Face Models)

For open-weights models (like Llama or Gemma), we can use Hugging Face's official JavaScript engine, @xenova/transformers:

import { AutoTokenizer } from '@xenova/transformers';
 
async function tokenizeText() {
  // Load the tokenizer for the Llama 3 model
  const tokenizer = await AutoTokenizer.from_pretrained('Xenova/llama-3-tokenizer');
 
  const text = "ChatGPT is amazing!";
  
  // Encode the text into token IDs
  const tokenIds = await tokenizer.encode(text);
  console.log("Token IDs:", tokenIds); 
  // Output: [1294, 76, 2983, 310, 8943, 0] (Values vary by vocabulary)
 
  // Detokenize individual IDs back to words
  for (let id of tokenIds) {
    const textFragment = await tokenizer.decode([id]);
    console.log(`ID ${id} -> "${textFragment}"`);
  }
}
 
tokenizeText();

Example 2: Calling an LLM with Temperature Config (using `@google/genai`)

When interfacing with Google's Gemini models in JavaScript, you can adjust settings like temperature and maxOutputTokens directly in the configurations:

import { GoogleGenAI } from '@google/genai';
 
// Initialize the Google Gen AI client (reads GEMINI_API_KEY environment variable)
const ai = new GoogleGenAI();
 
async function askAI() {
  try {
    const response = await ai.models.generateContent({
      model: 'gemini-1.5-flash',
      contents: 'Write a creative title for a JavaScript Generative AI tutorial.',
      config: {
        // Higher temperature (e.g., 0.9) generates more creative and unexpected results
        temperature: 0.9,
        // Top-P (e.g., 0.95) limits the candidate token pool to top 95% cumulative probability
        topP: 0.95,
        // Limits the output context window to save cost/latency
        maxOutputTokens: 100,
      }
    });
 
    console.log("Response:", response.text);
  } catch (error) {
    console.error("Error communicating with Gemini API:", error);
  }
}
 
askAI();

[!NOTE] Summary Checklist:

LLMs predict the next token based on statistical patterns.

Computers process text by mapping character fragments to numerical Token IDs and mapping those to embeddings.

Transformers use Self-Attention to evaluate relationships between all words in a sentence at once.

Temperature controls creativity, while the Context Window limits active working memory.

GenAI with JavaScript Where RAG Fails: Limitations