Giordano Rogers

AI Mediated Mediocrity

2025-12-14T00:00:00+00:00

The first time I talked to ChatGPT, it blew my mind. Talking to a computer. Getting real answers. Learning something. It was life-changing. It pushed me, a dropout, back into school for computer science.

More than that, it gave me a new dream when my old one died. The pursuit to understand intelligence. To improve the world.

On that journey, language models became mentors, teammates, friends. I spent hundreds of hours talking to them. I read textbooks, did research, even gave lectures in my school’s AI club. Now, I finally know enough to see limitations I was blind to at first.

The core problem: LLMs are excellent at removing struggle. And struggle is where growth happens.

Take coding. LLMs make it easy to get unstuck. Too easy. I’m a decent coder. But I’d be better if I struggled more. Using LLMs helped me move fast, keep a 4.0, and still live my life. But it also robbed me of hard moments I should have sat with. Moments where understanding solidifies.

I don’t regret using them. But I wish I made more of a habit of using them as tutors rather than laborers.

So I’m changing my approach. I’m allocating more time for long, uninterrupted coding blocks. Sitting with confusion. Reading the docs. Using LLMs only for small, narrow questions. Moving slower. Growing faster.

Take idea generation. LLMs are even more disabling here. Ideas don’t arrive clean. They start ugly. They take work. When I ask an LLM for good ideas, I circumvent that. I get something polished but not my own. Presentable but dead.

Your ideas have to come from you. If you want better ideas, focus on improving your mind, not your tools. Good tools can help shape ideas. They can’t have them for you. Think more. Write more. That’s what I’m resolving to do. Letting ideas surface on their own. No assistant. Just me and my text editor.

Take research. Same pattern, worse consequences. I’ve been doing research for a year now. During that time, I’ve leaned a lot on LLMs to think with, to plan, to code. The code was often wrong. The ideas were often flat. I chased bad directions for months at time. I blame a combination of poor prompting on my part, and terrible sycophancy on the LLMs’.

LLMs haven’t made me a better researcher. But they did make me feel like one; at my own expense.

So I’m pulling back. I’m spending my winter break going AI-minimal. Me, my mind, and my keyboard. Nothing external until I’ve tried and failed alone.

I’m not anti-AI. I just want more agency. LLMs are good for particular things. They’re leverage amplifiers. But right now, I don’t need to expand my leverage. I need to expand my competence.

My prediction for these three weeks: less output, more learning, more clarity. That’s a trade I’m willing to make.

On creativity

2025-08-18T00:00:00+00:00

I define the following variables:

x = 2
y = 6

Does the information that x and y are both even exist?

In one sense, yes. The fact that x is 2 and y is 6 logically entails that both are even. This is implicit information. It exists in the sense that any reasoning system with the rules of arithmetic could derive it.

But in another sense, no. The computer itself doesn’t know they’re even unless we encode that knowledge. All it stores are the numbers 2 and 6 in an address associated with the variables x and y. The property “evenness” only exists when checked, represented, and stored.

This is the distinction between potential information and actualized information.

Now I write a function and check the condition:

def both_even(x,y):
	return x % 2 == y % 2 == 0
both_even_x_y = both_even(x,y)

Now the information that x and y are both even exists.

I’ve taken an implicit relation and made it explicit by constructing a new variable that records the result. This information is not logically new. It is a representation of latent information in an accessible form.

But this example isn’t obviously relevant to real-world creativity. Let’s get more abstract, and show that this principle generalizes beyond numbers to attributes as well.

I define variables “Alice” and “Bob” with the following qualities:

Alice = {"tall", "rich", "fast"}
Bob = {"short", "rich", "slow"}

The information that Alice and Bob are both rich only exists implicitly. To make it explicit, I have to check it and store it.

def both_rich(x,y):
	return "rich" in x and "rich" in y
both_rich_Alice_Bob = both_rich(Alice, Bob)

Is this checking and storing process creative? Let’s define a creative act as one that is novel and valuable. It’s a definition that evokes what we point to when we claim creation in both science and art. In practice, creative acts, such as making a painting or a song, involve combining existing elements into something new, rather than generating wholly new objects.

The novelty of my both_rich_Alice_Bob representation is that I’ve written a new boolean value to an address in memory. The value of it is that I’ve transformed an implicit structure into an explicit, accessible state. I now no longer have to run a function to check its truth. I can simply look it up in memory. So yes, this is creative.

Everything possible is derivable from the properties of the universe. Creativity doesn’t mean magically conjuring up something from nothing. It’s the act of encoding implicit structure into an explicit representation.

Activation Patching the Residual Stream

2025-07-19T00:00:00+00:00

Activation patching is a technique that lets us identify which model components are involved in specific behaviors.

It is a core method of interpretability research, a subfield of AI focused on reverse-engineering the algorithms that Large Language Models (LLMs) learn.

The goal of this notebook is to explore activation patching through the lens of what we’ll call the similarity task: where we want the LLM to discern which person in a list has an attribute in common with a specific main person.

Installs & Imports

%pip install -q torch nnsight

import torch
from nnsight import LanguageModel

Model Setup

model_name = "meta-llama/Llama-3.3-70B-Instruct"

model = LanguageModel(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True
)
model

Output:

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 8192)
    (layers): ModuleList(
      (0-79): 80 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=8192, out_features=8192, bias=False)
          (k_proj): Linear(in_features=8192, out_features=1024, bias=False)
          (v_proj): Linear(in_features=8192, out_features=1024, bias=False)
          (o_proj): Linear(in_features=8192, out_features=8192, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=8192, out_features=28672, bias=False)
          (up_proj): Linear(in_features=8192, out_features=28672, bias=False)
          (down_proj): Linear(in_features=28672, out_features=8192, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((8192,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((8192,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((8192,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=8192, out_features=128256, bias=False)
  (generator): Generator(
    (streamer): Streamer()
  )
)

Here we can see that Llama 70B is composed of 80 decoder layers, each of which includes an attention layer, and an MLP layer. Each layer connects to the next via addition with a skip-connection called the residual stream. We can get a sense for the relation between the residual stream, the attention layers, and the mlp layers in the diagram below:

The Selection Task & Core Concepts

Defining the Experimental Setup

The similarity task asks the question: How does the model make a connection between two entities that share a common attribute? For example, when asked “Which of the following entities has a profession in common with Albert Einstein? Brad Pitt, Isaac Newton, Michael Jackson.” the model should respond with “Isaac Newton” since Albert Einstein and Isaac Newton are both physicists.

Activation patching can help us understand the LLM components involved in making this connection.

Prompt Setup

Now, our goal is to track which parts of the model are most necessary for making the correct prediction in this task. To do this, we’ll first need to define two prompts: one which we’ll call the clean prompt and another we’ll call the corrupt prompt.

PROMPT_TEMPLATE = "Q: Which of the following entities has a profession in common with {}? {}.\nA:"

clean_subj = "Albert Einstein"
corrupt_subj = "Taylor Swift"
entity_list = ["Brad Pitt", "Isaac Newton", "Michael Jackson"]

clean_prompt = PROMPT_TEMPLATE.format(clean_subj, (", ").join(entity_list))
print("Clean Prompt:", clean_prompt, "\n")

corrupt_prompt = PROMPT_TEMPLATE.format(corrupt_subj, (", ").join(entity_list))
print("Corrupt Prompt:", corrupt_prompt)

Output:

Clean Prompt: Q: Which of the following entities has a profession in common with Albert Einstein? Brad Pitt, Isaac Newton, Michael Jackson.
A: 

Corrupt Prompt: Q: Which of the following entities has a profession in common with Taylor Swift? Brad Pitt, Isaac Newton, Michael Jackson.
A:

Having the corrupt prompt gives us a version where the correct answer changes. In this case we would expect the model to respond with “ Michael Jackson”. Now we’ll be able to use activation patching to see which parts of the model are most important for changing the response from Michael Jackson back to Isaac Newton when we patch in the clean version where the only difference is replacing “Taylor Swift” with “Albert Einstein”.

Let’s make sure that our two subjects are the same token length so we don’t have to do any position manipulation.

print(model.tokenizer.encode(" Albert Einstein", add_special_tokens=False))
print(model.tokenizer.encode(" Taylor Swift", add_special_tokens=False))

Output:

[17971, 55152]
[16844, 24594]

Save Clean & Corrupt Target Tokens

Since Isaac Newton gets split into “ Isaac” and “Newton”, if the model predicts “ Isaac” as the next token, we consider that a correct prediction for Isaac Newton. Likewise for the corrupt prediction of Michael Jackson.

clean_target_token_id = model.tokenizer.encode(" Isaac", add_special_tokens=False)
corrupt_target_token_id = model.tokenizer.encode(" Michael", add_special_tokens=False)

Activation Patching

Patching the residual stream captures everything accumulated up to that point–previous layers plus the outputs of the current sub-blocks we’ve passed. It tells us: “Does information present here matter?”

It’s useful as a first-pass layer-by-layer, token-by-token sweep to find where in the network the decisive information lives.

But a major caveat is that it doesn’t tell us which submodule (mlp layer, attention layer, attention head) created the useful info. It just shows that the info is present in the aggregated site.

Clean Run

Run the model on the clean input. Cache activations of a set of given model components.

clean_activations = []
with model.trace(clean_prompt):
    for l in range(model.config.num_hidden_layers):
        residual_output = model.model.layers[l].output[0].save()
        clean_activations.append(residual_output)
    clean_target_logits = model.output.logits[0][-1, clean_target_token_id].item().save()

print(f"{clean_target_logits=}")
print(f"{len(clean_activations)=}")

Output:

clean_target_logits=20.0625
len(clean_activations)=80

Corrupt Run

Run the model on the corrupt prompt and record outputs

with model.trace(corrupt_prompt):
    # Save the output logits
    logits = model.output.logits.save()
    
    # Save the logits for predicting the clean_target
    clean_target_logits_corrupt = model.output.logits[0][-1, clean_target_token_id].item().save()

# Show the corrupt prediction and logits
corrupt_target_logits = logits[0][-1, corrupt_target_token_id].item()
corrupt_prediction = logits.argmax(dim=-1)[:, -1]
corrupt_token_prediction = model.tokenizer.decode(corrupt_prediction)

print(f"{corrupt_token_prediction=}")
print(f"{corrupt_target_logits=}")

# Show the logits for the clean target on the corrupt run
print(f"{clean_target_logits_corrupt=}")

Output:

corrupt_token_prediction=' Michael'
corrupt_target_logits=20.03125
clean_target_logits_corrupt=16.375

Baseline Logit Difference

This gives us a score of the difference between the model’s confidence in predicting “ Isaac” when we run the clean prompt versus when we run the corrupt prompt.

Note that there are only three possible options, so it is reasonable for this score to remain low even though Isaac would be an incorrect answer in the corrupt prompt.

total_logit_diff = clean_target_logits - clean_target_logits_corrupt
print(f"Baseline logit difference (clean - corrupt): {total_logit_diff}")

Output:

Baseline logit difference (clean - corrupt): 3.6875

Storing Token IDs

The clean input ids will be used for looping through token-by-token. And both will be used later for decoding back into the tokens for the activation plots.

clean_input_ids = model.tokenizer.encode(clean_prompt)
corrupt_input_ids = model.tokenizer.encode(corrupt_prompt)

assert len(clean_input_ids) == len(corrupt_input_ids)

Patched Run

Run the model on the corrupt input with residual stream activations restored from the clean run.

from tqdm import trange

patched_activations = []

# Loop through the model layers
for l in trange(model.config.num_hidden_layers):
    patch_at_layer = []
    
    # For each token position
    for t in range(len(clean_input_ids)):
        # We run the model on the corrupt prompt
        with model.trace(corrupt_prompt):
            # Replace the residual stream output with the clean_activations at that token
            model.model.layers[l].output[0][:, t, :] = clean_activations[l][:, t, :]
            
            # Get the model output logits
            patched_logits = model.output.logits[0].save()
            
            # Get the logits for the clean target
            patched_target_logits = patched_logits[-1, clean_target_token_id].item().save()
            
            # Calculate the logit difference
            logit_diff = patched_target_logits - clean_target_logits_corrupt
            
            # Normalize the logit difference
            if total_logit_diff == 0:
                normalized_score = 0.0
            else:
                normalized_score = logit_diff / total_logit_diff
                
            normalized_score = normalized_score.save()
            
            # Print the layer, token, and logit difference
            print(f"L{l}, T{t}, LD={normalized_score}")
            
            # Append the token to our list of logit differences at the current layer
            patch_at_layer.append(normalized_score)
    
    # Append each layer's logit differences list to our overall list of lists
    patched_activations.append(patch_at_layer)

print(f"\nPatched activations shape: {len(patched_activations)} layers x {len(patched_activations[0])} tokens")

Plotting Logic

We want a heatmap that shows us the indirect effect of our patching from clean to corrupt at each token and each layer. This will show us which tokens are the most necessary for predicting the correct answer.

import matplotlib.pyplot as plt
import numpy as np

# Create a 2D array: layers as rows, tokens as columns, then transpose
scores = np.array(patched_activations).T  # Transpose so tokens are rows, layers are columns
print(f"Scores shape after transpose: {scores.shape}")

# Decode tokens to actual text
corrupt_tokens = [model.tokenizer.decode([token_id]) for token_id in corrupt_input_ids]
clean_tokens = [model.tokenizer.decode([token_id]) for token_id in clean_input_ids]

# Create token labels
tokens = []
for idx, (corrupt_tok, clean_tok) in enumerate(zip(corrupt_tokens, clean_tokens)):
    tokens.append(f'"{clean_tok}"')

# We will truncate to the 0 to 1 range to prioritize positive values
# but this is the full actual range.
print(f"Data range: min={scores.min():.4f}, max={scores.max():.4f}")

plt.rcdefaults()
with plt.rc_context(
    rc={
        "font.family": "Times New Roman",
        "font.size": 6,
    }
):
    # Set figure size
    fig, ax = plt.subplots(
        figsize=(
            6,
            len(tokens) * 0.08 + 1.8
        ),
        dpi=200
    )
    
    # Scale range
    scale_kwargs = {
        "vmin": 0,
        "vmax": 1
    }
    
    heatmap = ax.pcolor(
        scores,
        cmap="Purples",
        **scale_kwargs,
    )
    ax.invert_yaxis()
    
    # Y-axis: token labels (rows)
    ax.set_yticks([0.5 + i for i in range(scores.shape[0])])  # Number of tokens
    ax.set_yticklabels(tokens)
    
    # X-axis: layer labels (columns)
    num_layers = scores.shape[1]
    tick_indices = np.arange(0, num_layers, 5)
    ax.set_xticks(tick_indices + 0.5)  # Number of layers
    ax.set_xticklabels(tick_indices)
    
    title = f"Indirect Effects of Residual Layers"
    ax.set_title(title)
    ax.set_xlabel("Layer")
    #ax.set_ylabel("Tokens")
    
    color_scale = plt.colorbar(heatmap)
    color_scale.ax.set_title(
        f"Normalized Score",
        y=-0.12,
        fontsize=8
    )
    
    plt.tight_layout()
    plt.show()

print(f"Data shape: {scores.shape}")
print(f"Number of tokens: {len(tokens)}")

Output:

Scores shape after transpose: (28, 80)
Data range: min=-0.2458, max=1.1144
Data shape: (28, 80)
Number of tokens: 28

Based on the above plot, we can make the following suppositions:

The early activations at the “ Einstein” token suggest that the earlier layers rely heavily on the representation of Albert Einstein for making the correct prediction. Perhaps the model uses this time to understand who Albert Einstein is, what attributes he possesses, what his profession is.
The activations in the middle layers for the “ Newton” and “,” tokens suggest this is when the model is retrieving information about Isaac Newton to inform its prediction.
By layer 34, the model activates strongest at the last token position, suggesting that the relevant information for making the correct prediction has already been extracted by this point.

Final Thoughts

Patching at the residual stream level is informative, but it is the most coarse-grained form of patching we can do. For a more fine-grained analysis, we will want to observe the activations in the mlp and attention layers, which we will do in the next notebook.