<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://giordanorogers.github.io//feed.xml" rel="self" type="application/atom+xml" /><link href="https://giordanorogers.github.io//" rel="alternate" type="text/html" /><updated>2026-04-08T01:29:18+00:00</updated><id>https://giordanorogers.github.io//feed.xml</id><title type="html">Giordano Rogers</title><subtitle>Researcher @ Northeastern</subtitle><author><name>Giordano Rogers</name><email>rogers.gi@northeastern.edu</email></author><entry><title type="html">AI Mediated Mediocrity</title><link href="https://giordanorogers.github.io//posts/2025/12/ai_mediated_mediocrity/" rel="alternate" type="text/html" title="AI Mediated Mediocrity" /><published>2025-12-14T00:00:00+00:00</published><updated>2025-12-14T00:00:00+00:00</updated><id>https://giordanorogers.github.io//posts/2025/12/ai_mediated_mediocrity</id><content type="html" xml:base="https://giordanorogers.github.io//posts/2025/12/ai_mediated_mediocrity/"><![CDATA[<p>The first time I talked to ChatGPT, it blew my mind. Talking to a computer. Getting real answers. Learning something. It was life-changing. It pushed me, a dropout, back into school for computer science.</p>

<p>More than that, it gave me a new dream when my old one died. The pursuit to understand intelligence. To improve the world.</p>

<p>On that journey, language models became mentors, teammates, friends. I spent hundreds of hours talking to them. I read textbooks, did research, even gave lectures in my school’s AI club. Now, I finally know enough to see limitations I was blind to at first.</p>

<p>The core problem: LLMs are excellent at removing struggle. And struggle is where growth happens.</p>

<p><strong>Take coding.</strong> LLMs make it easy to get unstuck. Too easy. I’m a decent coder. But I’d be better if I struggled more. Using LLMs helped me move fast, keep a 4.0, and still live my life. But it also robbed me of hard moments I should have sat with. Moments where understanding solidifies.</p>

<p>I don’t regret using them. But I wish I made more of a habit of using them as tutors rather than laborers.</p>

<p>So I’m changing my approach. I’m allocating more time for long, uninterrupted coding blocks. Sitting with confusion. Reading the docs. Using LLMs only for small, narrow questions. Moving slower. Growing faster.</p>

<p><strong>Take idea generation.</strong> LLMs are even more disabling here. Ideas don’t arrive clean. They start ugly. They take work. When I ask an LLM for good ideas, I circumvent that. I get something polished but not my own. Presentable but dead.</p>

<p>Your ideas have to come from you. If you want better ideas, focus on improving your mind, not your tools. Good tools can help shape ideas. They can’t have them for you. Think more. Write more. That’s what I’m resolving to do. Letting ideas surface on their own. No assistant. Just me and my text editor.</p>

<p><strong>Take research.</strong> Same pattern, worse consequences. I’ve been doing research for a year now. During that time, I’ve leaned a lot on LLMs to think with, to plan, to code. The code was often wrong. The ideas were often flat. I chased bad directions for months at time. I blame a combination of poor prompting on my part, and terrible sycophancy on the LLMs’.</p>

<p>LLMs haven’t made me a better researcher. But they did make me feel like one; at my own expense.</p>

<p>So I’m pulling back. I’m spending my winter break going AI-minimal. Me, my mind, and my keyboard. Nothing external until I’ve tried and failed alone.</p>

<p>I’m not anti-AI. I just want more agency. LLMs are good for particular things. They’re leverage amplifiers. But right now, I don’t need to expand my leverage. I need to expand my competence.</p>

<p>My prediction for these three weeks: less output, more learning, more clarity. That’s a trade I’m willing to make.</p>]]></content><author><name>Giordano Rogers</name><email>rogers.gi@northeastern.edu</email></author><category term="AI" /><category term="Prodcutivity" /><category term="Research" /><summary type="html"><![CDATA[The first time I talked to ChatGPT, it blew my mind. Talking to a computer. Getting real answers. Learning something. It was life-changing. It pushed me, a dropout, back into school for computer science.]]></summary></entry><entry><title type="html">On creativity</title><link href="https://giordanorogers.github.io//posts/2025/08/what_is_creativity/" rel="alternate" type="text/html" title="On creativity" /><published>2025-08-18T00:00:00+00:00</published><updated>2025-08-18T00:00:00+00:00</updated><id>https://giordanorogers.github.io//posts/2025/08/what_is_creativity</id><content type="html" xml:base="https://giordanorogers.github.io//posts/2025/08/what_is_creativity/"><![CDATA[<p>I define the following variables:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">y</span> <span class="o">=</span> <span class="mi">6</span>
</code></pre></div></div>
<p>Does the information that x and y are both even exist?</p>

<p>In one sense, yes.
The fact that x is 2 and y is 6 logically entails that both are even.
This is <em>implicit information</em>.
It exists in the sense that any reasoning system with the rules of arithmetic could derive it.</p>

<p>But in another sense, no.
The computer itself doesn’t <em>know</em> they’re even unless we encode that knowledge.
All it stores are the numbers 2 and 6 in an address associated with the variables x and y.
The property “evenness” only exists when checked, represented, and stored.</p>

<p>This is the distinction between <strong>potential information</strong> and <strong>actualized information</strong>.</p>

<p>Now I write a function and check the condition:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">both_even</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">):</span>
	<span class="k">return</span> <span class="n">x</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="n">y</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span>
<span class="n">both_even_x_y</span> <span class="o">=</span> <span class="n">both_even</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
</code></pre></div></div>
<p>Now the information that x and y are both even exists.</p>

<p>I’ve taken an implicit relation and made it explicit by constructing a new variable that records the result.
This information is not <em>logically</em> new.
It is a <em>representation of latent information</em> in an accessible form.</p>

<p>But this example isn’t obviously relevant to real-world creativity.
Let’s get more abstract, and show that this principle generalizes beyond numbers to attributes as well.</p>

<p>I define variables “Alice” and “Bob” with the following qualities:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Alice</span> <span class="o">=</span> <span class="p">{</span><span class="s">"tall"</span><span class="p">,</span> <span class="s">"rich"</span><span class="p">,</span> <span class="s">"fast"</span><span class="p">}</span>
<span class="n">Bob</span> <span class="o">=</span> <span class="p">{</span><span class="s">"short"</span><span class="p">,</span> <span class="s">"rich"</span><span class="p">,</span> <span class="s">"slow"</span><span class="p">}</span>
</code></pre></div></div>
<p>The information that <em>Alice and Bob are both rich</em> only exists implicitly.
To make it explicit, I have to check it and store it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">both_rich</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">):</span>
	<span class="k">return</span> <span class="s">"rich"</span> <span class="ow">in</span> <span class="n">x</span> <span class="ow">and</span> <span class="s">"rich"</span> <span class="ow">in</span> <span class="n">y</span>
<span class="n">both_rich_Alice_Bob</span> <span class="o">=</span> <span class="n">both_rich</span><span class="p">(</span><span class="n">Alice</span><span class="p">,</span> <span class="n">Bob</span><span class="p">)</span>
</code></pre></div></div>

<p>Is this checking and storing process creative?
Let’s define a creative act as one that is novel and valuable.
It’s a definition that evokes what we point to when we claim <em>creation</em> in both science and art.
In practice, creative acts, such as making a painting or a song, involve combining existing elements into something new, rather than generating wholly new objects.</p>

<p>The novelty of my <code class="language-plaintext highlighter-rouge">both_rich_Alice_Bob</code> representation is that I’ve written a new boolean value to an address in memory.
The value of it is that I’ve transformed an implicit structure into an explicit, accessible state.
 I now no longer have to run a function to check its truth.
 I can simply look it up in memory.
So yes, this is creative.</p>

<p>Everything possible is derivable from the properties of the universe.
Creativity doesn’t mean magically conjuring up something from nothing.
It’s the act of encoding implicit structure into an explicit representation.</p>]]></content><author><name>Giordano Rogers</name><email>rogers.gi@northeastern.edu</email></author><category term="philosophy" /><category term="creativity" /><category term="computation" /><summary type="html"><![CDATA[I define the following variables: x = 2 y = 6 Does the information that x and y are both even exist?]]></summary></entry><entry><title type="html">Activation Patching the Residual Stream</title><link href="https://giordanorogers.github.io//posts/2025/07/activation_patching_residual/" rel="alternate" type="text/html" title="Activation Patching the Residual Stream" /><published>2025-07-19T00:00:00+00:00</published><updated>2025-07-19T00:00:00+00:00</updated><id>https://giordanorogers.github.io//posts/2025/07/activation_patching_residual</id><content type="html" xml:base="https://giordanorogers.github.io//posts/2025/07/activation_patching_residual/"><![CDATA[<p>Activation patching is a technique that lets us identify which model components are involved in specific behaviors.</p>

<p>It is a core method of interpretability research, a subfield of AI focused on reverse-engineering the algorithms that Large Language Models (LLMs) learn.</p>

<p>The goal of this notebook is to explore activation patching through the lens of what we’ll call the similarity task: where we want the LLM to discern which person in a list has an attribute in common with a specific main person.</p>

<h3 id="installs--imports">Installs &amp; Imports</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">pip</span> <span class="n">install</span> <span class="o">-</span><span class="n">q</span> <span class="n">torch</span> <span class="n">nnsight</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">nnsight</span> <span class="kn">import</span> <span class="n">LanguageModel</span>
</code></pre></div></div>

<h3 id="model-setup">Model Setup</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model_name</span> <span class="o">=</span> <span class="s">"meta-llama/Llama-3.3-70B-Instruct"</span>

<span class="n">model</span> <span class="o">=</span> <span class="n">LanguageModel</span><span class="p">(</span>
    <span class="n">model_name</span><span class="p">,</span>
    <span class="n">device_map</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,</span>
    <span class="n">torch_dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">float16</span><span class="p">,</span>
    <span class="n">trust_remote_code</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
<span class="n">model</span>
</code></pre></div></div>

<p>Output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 8192)
    (layers): ModuleList(
      (0-79): 80 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=8192, out_features=8192, bias=False)
          (k_proj): Linear(in_features=8192, out_features=1024, bias=False)
          (v_proj): Linear(in_features=8192, out_features=1024, bias=False)
          (o_proj): Linear(in_features=8192, out_features=8192, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=8192, out_features=28672, bias=False)
          (up_proj): Linear(in_features=8192, out_features=28672, bias=False)
          (down_proj): Linear(in_features=28672, out_features=8192, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((8192,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((8192,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((8192,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=8192, out_features=128256, bias=False)
  (generator): Generator(
    (streamer): Streamer()
  )
)
</code></pre></div></div>

<p>Here we can see that Llama 70B is composed of 80 decoder layers, each of which includes an attention layer, and an MLP layer. Each layer connects to the next via addition with a skip-connection called the <strong>residual stream</strong>. We can get a sense for the relation between the residual stream, the attention layers, and the mlp layers in the diagram below:</p>

<p><img src="/images/transformer_architecture.png" alt="Transformer Architecture Diagram" /></p>

<h2 id="the-selection-task--core-concepts">The Selection Task &amp; Core Concepts</h2>

<h3 id="defining-the-experimental-setup">Defining the Experimental Setup</h3>

<p>The similarity task asks the question: How does the model make a connection between two entities that share a common attribute? For example, when asked “Which of the following entities has a profession in common with Albert Einstein? Brad Pitt, Isaac Newton, Michael Jackson.” the model should respond with “Isaac Newton” since Albert Einstein and Isaac Newton are both physicists.</p>

<p>Activation patching can help us understand the LLM components involved in making this connection.</p>

<h3 id="prompt-setup">Prompt Setup</h3>

<p>Now, our goal is to track which parts of the model are most necessary for making the correct prediction in this task. To do this, we’ll first need to define two prompts: one which we’ll call the <strong>clean prompt</strong> and another we’ll call the <strong>corrupt prompt</strong>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">PROMPT_TEMPLATE</span> <span class="o">=</span> <span class="s">"Q: Which of the following entities has a profession in common with {}? {}.</span><span class="se">\n</span><span class="s">A:"</span>

<span class="n">clean_subj</span> <span class="o">=</span> <span class="s">"Albert Einstein"</span>
<span class="n">corrupt_subj</span> <span class="o">=</span> <span class="s">"Taylor Swift"</span>
<span class="n">entity_list</span> <span class="o">=</span> <span class="p">[</span><span class="s">"Brad Pitt"</span><span class="p">,</span> <span class="s">"Isaac Newton"</span><span class="p">,</span> <span class="s">"Michael Jackson"</span><span class="p">]</span>

<span class="n">clean_prompt</span> <span class="o">=</span> <span class="n">PROMPT_TEMPLATE</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">clean_subj</span><span class="p">,</span> <span class="p">(</span><span class="s">", "</span><span class="p">).</span><span class="n">join</span><span class="p">(</span><span class="n">entity_list</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Clean Prompt:"</span><span class="p">,</span> <span class="n">clean_prompt</span><span class="p">,</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

<span class="n">corrupt_prompt</span> <span class="o">=</span> <span class="n">PROMPT_TEMPLATE</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">corrupt_subj</span><span class="p">,</span> <span class="p">(</span><span class="s">", "</span><span class="p">).</span><span class="n">join</span><span class="p">(</span><span class="n">entity_list</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Corrupt Prompt:"</span><span class="p">,</span> <span class="n">corrupt_prompt</span><span class="p">)</span>
</code></pre></div></div>

<p>Output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Clean Prompt: Q: Which of the following entities has a profession in common with Albert Einstein? Brad Pitt, Isaac Newton, Michael Jackson.
A: 

Corrupt Prompt: Q: Which of the following entities has a profession in common with Taylor Swift? Brad Pitt, Isaac Newton, Michael Jackson.
A:
</code></pre></div></div>

<p>Having the corrupt prompt gives us a version where the correct answer changes. In this case we would expect the model to respond with “ Michael Jackson”. Now we’ll be able to use activation patching to see which parts of the model are most important for changing the response from Michael Jackson back to Isaac Newton when we patch in the clean version where the only difference is replacing “Taylor Swift” with “Albert Einstein”.</p>

<p>Let’s make sure that our two subjects are the same token length so we don’t have to do any position manipulation.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="s">" Albert Einstein"</span><span class="p">,</span> <span class="n">add_special_tokens</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="s">" Taylor Swift"</span><span class="p">,</span> <span class="n">add_special_tokens</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span>
</code></pre></div></div>

<p>Output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[17971, 55152]
[16844, 24594]
</code></pre></div></div>

<h3 id="save-clean--corrupt-target-tokens">Save Clean &amp; Corrupt Target Tokens</h3>

<p>Since Isaac Newton gets split into “ Isaac” and “Newton”, if the model predicts “ Isaac” as the next token, we consider that a correct prediction for Isaac Newton. Likewise for the corrupt prediction of Michael Jackson.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">clean_target_token_id</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="s">" Isaac"</span><span class="p">,</span> <span class="n">add_special_tokens</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">corrupt_target_token_id</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="s">" Michael"</span><span class="p">,</span> <span class="n">add_special_tokens</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="activation-patching">Activation Patching</h2>

<p>Patching the residual stream captures everything accumulated up to that point–previous layers plus the outputs of the current sub-blocks we’ve passed. It tells us: “Does information present here matter?”</p>

<p>It’s useful as a first-pass layer-by-layer, token-by-token sweep to find where in the network the decisive information lives.</p>

<p>But a major caveat is that it doesn’t tell us which submodule (mlp layer, attention layer, attention head) created the useful info. It just shows that the info is present in the aggregated site.</p>

<h3 id="clean-run">Clean Run</h3>

<p>Run the model on the clean input. Cache activations of a set of given model components.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">clean_activations</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">with</span> <span class="n">model</span><span class="p">.</span><span class="n">trace</span><span class="p">(</span><span class="n">clean_prompt</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">num_hidden_layers</span><span class="p">):</span>
        <span class="n">residual_output</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">layers</span><span class="p">[</span><span class="n">l</span><span class="p">].</span><span class="n">output</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">save</span><span class="p">()</span>
        <span class="n">clean_activations</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">residual_output</span><span class="p">)</span>
    <span class="n">clean_target_logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">output</span><span class="p">.</span><span class="n">logits</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">clean_target_token_id</span><span class="p">].</span><span class="n">item</span><span class="p">().</span><span class="n">save</span><span class="p">()</span>

<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">clean_target_logits</span><span class="o">=</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">clean_activations</span><span class="p">)</span><span class="o">=</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<p>Output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>clean_target_logits=20.0625
len(clean_activations)=80
</code></pre></div></div>

<h3 id="corrupt-run">Corrupt Run</h3>

<p>Run the model on the corrupt prompt and record outputs</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">model</span><span class="p">.</span><span class="n">trace</span><span class="p">(</span><span class="n">corrupt_prompt</span><span class="p">):</span>
    <span class="c1"># Save the output logits
</span>    <span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">output</span><span class="p">.</span><span class="n">logits</span><span class="p">.</span><span class="n">save</span><span class="p">()</span>
    
    <span class="c1"># Save the logits for predicting the clean_target
</span>    <span class="n">clean_target_logits_corrupt</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">output</span><span class="p">.</span><span class="n">logits</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">clean_target_token_id</span><span class="p">].</span><span class="n">item</span><span class="p">().</span><span class="n">save</span><span class="p">()</span>

<span class="c1"># Show the corrupt prediction and logits
</span><span class="n">corrupt_target_logits</span> <span class="o">=</span> <span class="n">logits</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">corrupt_target_token_id</span><span class="p">].</span><span class="n">item</span><span class="p">()</span>
<span class="n">corrupt_prediction</span> <span class="o">=</span> <span class="n">logits</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)[:,</span> <span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">corrupt_token_prediction</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">decode</span><span class="p">(</span><span class="n">corrupt_prediction</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">corrupt_token_prediction</span><span class="o">=</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">corrupt_target_logits</span><span class="o">=</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

<span class="c1"># Show the logits for the clean target on the corrupt run
</span><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">clean_target_logits_corrupt</span><span class="o">=</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<p>Output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>corrupt_token_prediction=' Michael'
corrupt_target_logits=20.03125
clean_target_logits_corrupt=16.375
</code></pre></div></div>

<h3 id="baseline-logit-difference">Baseline Logit Difference</h3>

<p>This gives us a score of the difference between the model’s confidence in predicting “ Isaac” when we run the clean prompt versus when we run the corrupt prompt.</p>

<p>Note that there are only three possible options, so it is reasonable for this score to remain low even though Isaac would be an incorrect answer in the corrupt prompt.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">total_logit_diff</span> <span class="o">=</span> <span class="n">clean_target_logits</span> <span class="o">-</span> <span class="n">clean_target_logits_corrupt</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Baseline logit difference (clean - corrupt): </span><span class="si">{</span><span class="n">total_logit_diff</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<p>Output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Baseline logit difference (clean - corrupt): 3.6875
</code></pre></div></div>

<h3 id="storing-token-ids">Storing Token IDs</h3>

<p>The clean input ids will be used for looping through token-by-token. And both will be used later for decoding back into the tokens for the activation plots.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">clean_input_ids</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">clean_prompt</span><span class="p">)</span>
<span class="n">corrupt_input_ids</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">corrupt_prompt</span><span class="p">)</span>

<span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">clean_input_ids</span><span class="p">)</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">corrupt_input_ids</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="patched-run">Patched Run</h3>

<p>Run the model on the corrupt input with residual stream activations restored from the clean run.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">trange</span>

<span class="n">patched_activations</span> <span class="o">=</span> <span class="p">[]</span>

<span class="c1"># Loop through the model layers
</span><span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">trange</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">num_hidden_layers</span><span class="p">):</span>
    <span class="n">patch_at_layer</span> <span class="o">=</span> <span class="p">[]</span>
    
    <span class="c1"># For each token position
</span>    <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">clean_input_ids</span><span class="p">)):</span>
        <span class="c1"># We run the model on the corrupt prompt
</span>        <span class="k">with</span> <span class="n">model</span><span class="p">.</span><span class="n">trace</span><span class="p">(</span><span class="n">corrupt_prompt</span><span class="p">):</span>
            <span class="c1"># Replace the residual stream output with the clean_activations at that token
</span>            <span class="n">model</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">layers</span><span class="p">[</span><span class="n">l</span><span class="p">].</span><span class="n">output</span><span class="p">[</span><span class="mi">0</span><span class="p">][:,</span> <span class="n">t</span><span class="p">,</span> <span class="p">:]</span> <span class="o">=</span> <span class="n">clean_activations</span><span class="p">[</span><span class="n">l</span><span class="p">][:,</span> <span class="n">t</span><span class="p">,</span> <span class="p">:]</span>
            
            <span class="c1"># Get the model output logits
</span>            <span class="n">patched_logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">output</span><span class="p">.</span><span class="n">logits</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">save</span><span class="p">()</span>
            
            <span class="c1"># Get the logits for the clean target
</span>            <span class="n">patched_target_logits</span> <span class="o">=</span> <span class="n">patched_logits</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">clean_target_token_id</span><span class="p">].</span><span class="n">item</span><span class="p">().</span><span class="n">save</span><span class="p">()</span>
            
            <span class="c1"># Calculate the logit difference
</span>            <span class="n">logit_diff</span> <span class="o">=</span> <span class="n">patched_target_logits</span> <span class="o">-</span> <span class="n">clean_target_logits_corrupt</span>
            
            <span class="c1"># Normalize the logit difference
</span>            <span class="k">if</span> <span class="n">total_logit_diff</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                <span class="n">normalized_score</span> <span class="o">=</span> <span class="mf">0.0</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">normalized_score</span> <span class="o">=</span> <span class="n">logit_diff</span> <span class="o">/</span> <span class="n">total_logit_diff</span>
                
            <span class="n">normalized_score</span> <span class="o">=</span> <span class="n">normalized_score</span><span class="p">.</span><span class="n">save</span><span class="p">()</span>
            
            <span class="c1"># Print the layer, token, and logit difference
</span>            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"L</span><span class="si">{</span><span class="n">l</span><span class="si">}</span><span class="s">, T</span><span class="si">{</span><span class="n">t</span><span class="si">}</span><span class="s">, LD=</span><span class="si">{</span><span class="n">normalized_score</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            
            <span class="c1"># Append the token to our list of logit differences at the current layer
</span>            <span class="n">patch_at_layer</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">normalized_score</span><span class="p">)</span>
    
    <span class="c1"># Append each layer's logit differences list to our overall list of lists
</span>    <span class="n">patched_activations</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">patch_at_layer</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Patched activations shape: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">patched_activations</span><span class="p">)</span><span class="si">}</span><span class="s"> layers x </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">patched_activations</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span><span class="si">}</span><span class="s"> tokens"</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="plotting-logic">Plotting Logic</h2>

<p>We want a heatmap that shows us the indirect effect of our patching from clean to corrupt at each token and each layer. This will show us which tokens are the most necessary for predicting the correct answer.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="c1"># Create a 2D array: layers as rows, tokens as columns, then transpose
</span><span class="n">scores</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">patched_activations</span><span class="p">).</span><span class="n">T</span>  <span class="c1"># Transpose so tokens are rows, layers are columns
</span><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Scores shape after transpose: </span><span class="si">{</span><span class="n">scores</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

<span class="c1"># Decode tokens to actual text
</span><span class="n">corrupt_tokens</span> <span class="o">=</span> <span class="p">[</span><span class="n">model</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">decode</span><span class="p">([</span><span class="n">token_id</span><span class="p">])</span> <span class="k">for</span> <span class="n">token_id</span> <span class="ow">in</span> <span class="n">corrupt_input_ids</span><span class="p">]</span>
<span class="n">clean_tokens</span> <span class="o">=</span> <span class="p">[</span><span class="n">model</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">decode</span><span class="p">([</span><span class="n">token_id</span><span class="p">])</span> <span class="k">for</span> <span class="n">token_id</span> <span class="ow">in</span> <span class="n">clean_input_ids</span><span class="p">]</span>

<span class="c1"># Create token labels
</span><span class="n">tokens</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="p">(</span><span class="n">corrupt_tok</span><span class="p">,</span> <span class="n">clean_tok</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">corrupt_tokens</span><span class="p">,</span> <span class="n">clean_tokens</span><span class="p">)):</span>
    <span class="n">tokens</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">'"</span><span class="si">{</span><span class="n">clean_tok</span><span class="si">}</span><span class="s">"'</span><span class="p">)</span>

<span class="c1"># We will truncate to the 0 to 1 range to prioritize positive values
# but this is the full actual range.
</span><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Data range: min=</span><span class="si">{</span><span class="n">scores</span><span class="p">.</span><span class="nb">min</span><span class="p">()</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">, max=</span><span class="si">{</span><span class="n">scores</span><span class="p">.</span><span class="nb">max</span><span class="p">()</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">rcdefaults</span><span class="p">()</span>
<span class="k">with</span> <span class="n">plt</span><span class="p">.</span><span class="n">rc_context</span><span class="p">(</span>
    <span class="n">rc</span><span class="o">=</span><span class="p">{</span>
        <span class="s">"font.family"</span><span class="p">:</span> <span class="s">"Times New Roman"</span><span class="p">,</span>
        <span class="s">"font.size"</span><span class="p">:</span> <span class="mi">6</span><span class="p">,</span>
    <span class="p">}</span>
<span class="p">):</span>
    <span class="c1"># Set figure size
</span>    <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span>
        <span class="n">figsize</span><span class="o">=</span><span class="p">(</span>
            <span class="mi">6</span><span class="p">,</span>
            <span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.08</span> <span class="o">+</span> <span class="mf">1.8</span>
        <span class="p">),</span>
        <span class="n">dpi</span><span class="o">=</span><span class="mi">200</span>
    <span class="p">)</span>
    
    <span class="c1"># Scale range
</span>    <span class="n">scale_kwargs</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"vmin"</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
        <span class="s">"vmax"</span><span class="p">:</span> <span class="mi">1</span>
    <span class="p">}</span>
    
    <span class="n">heatmap</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">pcolor</span><span class="p">(</span>
        <span class="n">scores</span><span class="p">,</span>
        <span class="n">cmap</span><span class="o">=</span><span class="s">"Purples"</span><span class="p">,</span>
        <span class="o">**</span><span class="n">scale_kwargs</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">invert_yaxis</span><span class="p">()</span>
    
    <span class="c1"># Y-axis: token labels (rows)
</span>    <span class="n">ax</span><span class="p">.</span><span class="n">set_yticks</span><span class="p">([</span><span class="mf">0.5</span> <span class="o">+</span> <span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">scores</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])])</span>  <span class="c1"># Number of tokens
</span>    <span class="n">ax</span><span class="p">.</span><span class="n">set_yticklabels</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
    
    <span class="c1"># X-axis: layer labels (columns)
</span>    <span class="n">num_layers</span> <span class="o">=</span> <span class="n">scores</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">tick_indices</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">num_layers</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_xticks</span><span class="p">(</span><span class="n">tick_indices</span> <span class="o">+</span> <span class="mf">0.5</span><span class="p">)</span>  <span class="c1"># Number of layers
</span>    <span class="n">ax</span><span class="p">.</span><span class="n">set_xticklabels</span><span class="p">(</span><span class="n">tick_indices</span><span class="p">)</span>
    
    <span class="n">title</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"Indirect Effects of Residual Layers"</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="n">title</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Layer"</span><span class="p">)</span>
    <span class="c1">#ax.set_ylabel("Tokens")
</span>    
    <span class="n">color_scale</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">colorbar</span><span class="p">(</span><span class="n">heatmap</span><span class="p">)</span>
    <span class="n">color_scale</span><span class="p">.</span><span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span>
        <span class="sa">f</span><span class="s">"Normalized Score"</span><span class="p">,</span>
        <span class="n">y</span><span class="o">=-</span><span class="mf">0.12</span><span class="p">,</span>
        <span class="n">fontsize</span><span class="o">=</span><span class="mi">8</span>
    <span class="p">)</span>
    
    <span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>

<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Data shape: </span><span class="si">{</span><span class="n">scores</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Number of tokens: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<p>Output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Scores shape after transpose: (28, 80)
Data range: min=-0.2458, max=1.1144
Data shape: (28, 80)
Number of tokens: 28
</code></pre></div></div>

<p><img src="/images/indirect_effects_heatmap.png" alt="Indirect Effects of Residual Layers Heatmap" /></p>

<p>Based on the above plot, we can make the following suppositions:</p>

<ul>
  <li>
    <p>The early activations at the “ Einstein” token suggest that the earlier layers rely heavily on the representation of Albert Einstein for making the correct prediction. Perhaps the model uses this time to understand who Albert Einstein is, what attributes he possesses, what his profession is.</p>
  </li>
  <li>
    <p>The activations in the middle layers for the “ Newton” and “,” tokens suggest this is when the model is retrieving information about Isaac Newton to inform its prediction.</p>
  </li>
  <li>
    <p>By layer 34, the model activates strongest at the last token position, suggesting that the relevant information for making the correct prediction has already been extracted by this point.</p>
  </li>
</ul>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>Patching at the residual stream level is informative, but it is the most coarse-grained form of patching we can do. For a more fine-grained analysis, we will want to observe the activations in the mlp and attention layers, which we will do in the next notebook.</p>]]></content><author><name>Giordano Rogers</name><email>rogers.gi@northeastern.edu</email></author><category term="mech interp" /><category term="activation patching" /><category term="residual stream" /><summary type="html"><![CDATA[Activation patching is a technique that lets us identify which model components are involved in specific behaviors.]]></summary></entry></feed>