Tim Etler

Sequentially Injecting Streams Within Streams: Enabling Parallel Recursive Streaming AI Swarms

August 23, 2025

The ability to sequentially inject a stream within another stream enables some really impressive optimizations. Each injected stream can perform its own work in parallel, while consumers of the output stream can still get the contents in order without having to worry about how the stream is implemented underneath.

When combined with streaming LLM calls, this enables us to split complex monolithic prompts into child agents which each have their own simpler more focused prompts. This provides a solution to prompt complexity issues while also allowing the parallel construction of the later parts of the stream allowing us to provide better answers with ludicrous speed.

Parallel recursive AI agent spawning is already possible on the server, but existing methods break streaming outputs and often block on the output of the parent agents before spawning children.

In this contrived example I created a franchise series summary agent swarm that has three levels of agents. A franchise agent that lists out the movies, an installment agent that lists out the acts, and an act synopsis agent that provides detailed multi paragraph synopsis of each act (be sure to watch until the end):

This produces a lot of tokens very fast. So many in fact that this run on Claude 4 Sonnet cost me 30¢ (a single run on 4 Opus costs $1.50)! The estimated token count is about 8.8k tokens and it spawned 50 agents which finished in about 30 seconds, if it were being generated in a single agent call it would have taken 6½ minutes!

This is a parallel recursive streaming AI swarm with broken down prompts and recursive parallel streams. In this example it’s summarizing Star Wars. The output is sequential, but the child agents are not blocked. After the main franchise agent spawns a child agent, it keeps generating tokens, and while that first child agent is still running, it can spawn another child agent, and those child agents can also span another child agent, and another, and they all run at the same time, and the client gets the tokens the instant they are ready. It allows for streams to produce other streams that can run in parallel while maintaining order.

Watching the demo closely you will notice that it begins with outputting movie titles, then there’s a delay while it swaps to the child agent stream writing out the acts (this can be reduced by pre-warming the http call or by using cluster local LLMs that have a lower round trip time). The first act agent streams the first act of “The Phantom Menace”, and by the time it’s done the second act is already almost done and it jumps ahead. After act 3 finishes we jump past act 4 and we’re almost done streaming the final act of the movie. Blink and you’ll miss it, but once the first movie completes we suddenly jump ahead to the middle of “A New Hope”, completely skipping the rest of the prequels! They were all generated in the background! Once the next act of “A New Hope” completes we jump ahead again to “Return of the Jedi” and before you know it the entire franchise summary is done with the sequel trilogy already complete (and a big API bill along it). It’s so fast the screen can’t even keep up. Using this in production will necessitate artificial throttling.

It’s all there, purple and white, sequentially ordered. You can see how much was generated so quickly. The speed is insane, but it enables even more power by letting you simplify your prompts through splitting your agents.

If you want to play with this yourself, you can find the proof of concept repo you can play with here.

Prompt Complexity and Attention Dilution

I’ve been exploring new ways to UI in AI Agent responses at my company Vetted, and recently created a way to add JSX component tag support to our markdown responses. It is enabling new UI experiences within our AI answers that I’m very excited about, but getting the agents to output the JSX component tags with the correct format has been pushing the latest LLM models to their limits, and we’ve hit a complexity wall. The current model limits are preventing us from adding even more advanced component composition to our answers.

When a prompt gets too large and complicated on the backend, our standard approach to fixing it has been to simply break the prompt down into smaller tasks and interpolate that output into an answer, or feed that data into another higher level agent. We already implement parallel recursive agent swarms in our backend to distribute our research and keep our agents accurate with tightly scoped tasks. But the big problem with that approach for providing the final response to the frontend is that the client has a necessary extra requirement. It needs to support streaming.

On the backend this isn’t an issue. We can run the agents in parallel and have agents spawn sub agents recursively and wait for each block of work to complete before processing it, but on the client blocking is not an option. Our answers take a while to produce, and while our users are actually willing to wait a while for high quality research, they are not willing to wait before they start seeing content streaming in. It’s the same time-to-first-paint problem. You don’t need to have the entire page ready and interactive immediately, but you do need to show something is happening as soon as possible. It’s why so much time and effort has been invested in streaming rendering technology.

Breaking down the task of a monolithic agent into simpler tasks would enable several very important capabilities. This would solve “attention dilution” problems where an overloaded prompt is unable to follow complex instructions accurately as attention on the instructions are spread too thin. It would also solve a model choice problem. With a monolithic prompt, only one model can be used at a time meaning tradeoffs must be made between speed, cost, and specialization. By breaking up a prompt, it would be possible to optimize each sub task to do one thing well, while as optimizing the model used for each sub task, unlocking new optimizations that were not previously possible.

Injecting new streams also opens up the opportunity to interpolate live running agents with data from external sources like databases, caches, API, or other data sources. An injected stream doesn’t need to be an AI stream, it could be any data stream and being able to inject newly discovered work as data is being streamed in could lead to some big streaming orchestrations and optimization improvements.

To sum up, we needed a way to provide ordered consumption of parallel production over streams.

Injecting and Hot Swapping Streams

So what was the solution? What would I need to get multiple prompts to work without breaking streaming?

I figured that first, you’d need a way to swap between streams. When a parent agent spawns a child agent, the child agent needs to be able to take over the duties of streaming. Second, you’d need a way to split the stream in the middle of its response. Not only do you need a way to swap between streams, you need a way to inject the child stream into the parent stream, swap to the child stream, then swap back to the parent stream. Third, you’d need to be able to do this on the fly without knowing when the parent agent would invoke a child agent.

Conceptually you could do this with a transform stream that can process the input stream data as it is received and parse for directives to spawn a child stream. You could then orchestrate those streams to swap from the main stream to the child stream and back again.

I searched high and low for an existing solution to accomplish this, but I came up with nothing. As far as I could tell, there are no existing re-usable solutions to solving this problem.

A Simple Solution to an Incredibly Hard Problem

When I first started working on the solution it seemed like I had my work cut out for me. At first I thought I would need some complex central orchestration layer that needed to keep track of all the child agents and child child agents and so forth, and that it would come with tremendous buffering memory overhead.

Then I started breaking the problem down more. I wanted to use every tool I had available to me so I started reading the web streams spec front to back. While doing that I was reminded of something. Streams in Javascript are also Async Iterables. I realized that opened up a new angle I wasn’t considering. I didn’t need to orchestrate streams, I needed to orchestrate async iterables. What if instead of needing a complex central orchestration layer, I could combine async interables together?

Refreshing my memory further of the capabilities of async iterables, I found another key component. The iterable generator patterns allows you to delegate to other iterables using yield*. This is what would enable you to swap from one stream to another, but you need to be able to add new work dynamically without knowing what would need to be done up front. The generator would start with an empty chain of iterators meaning once you consumed it, it would immediately end before any work could be added to it. I needed the generator to be able to stop and wait for a new generator when it ran out of iterators to delegate to. The solution was to combine async iterables, generator delegation, and an unresolved promise that would eventually resolve with the next iterable or a termination signal.

Introducing the Relay Pattern

I’ve looked everywhere I can and have not been able to find any mention of this pattern anywhere, so I’ve taken the liberty of naming it. I call it the “Relay Pattern” because like a relay race, each iterable runs through its stint of the race. When it completes its stint, it waits for the next iterable to be ready and then delegates the main iterator to the next iterator. At its core it lets you chain new iterables on the fly before you know what those iterables will be and when a consumer consumes the main async iterable they can do so transparently without needing to be aware that a swap even happened.

AsyncIterableSequencer

Putting this pattern together I’ve created a library, async-iterable-sequencer that implements this pattern. As the name suggests, it is an async iterable that sequences other async iterables. The code is so simple I can paste it here. Notwithstanding types, it’s only 25 lines of code:

export type AnyIterable<T> = AsyncIterable<T> | Iterable<T>;
export type Chainable<T> = AnyIterable<T> | null;
export type Chain<T> = (iterator: Chainable<T>) => void;
export interface AsyncIterableSequencerReturn<T> {
  sequence: AsyncGenerator<T>;
  chain: Chain<T>;
}
export function asyncIterableSequencer<T>(): AsyncIterableSequencerReturn<T> {
  let resolver: Chain<T>;
  const next = (): AsyncGenerator<T> => {
    const { promise, resolve } = Promise.withResolvers<AnyIterable<T>>();
    resolver = (nextIterator) => {
      resolve(nextIterator ? flatten(nextIterator, next()) : empty());
    };
    const generator = async function* () {
      yield* await promise;
    };
    return generator();
  };
  return {
    sequence: next(),
    chain: (iterator) => {
      resolver(iterator);
    },
  };
}
async function* flatten<T>(...iterators: AnyIterable<T>[]): AsyncGenerator<T> {
  for (const iterator of iterators) {
    yield* iterator;
  }
}
function* empty() {}

The factory function returns a sequence iterator and a chain bound function for adding new iterators. Because chain fulfills a promise it can be called asynchronously at any point in time.

This foundational pattern allows us to create a chain of async iterables, and because streams are also async iterables it now enables us to enqueue entire streams which is one of the biggest requirements of enabling the stream injection we needed. As async iterables do not need to be complete at the time they are chained, this means we can connect multiple batches of async work that can be completed in any order in parallel while still maintaining the sequential ordering in the order they were chained. Because this uses iterable delegation, there’s no additional buffering overhead for orchestrating the streams.

AsyncIterableSequencers enable the parallel streaming part of enabling parallel recursive streaming AI swarms, but we still need to solve the recursive part.

ConductorStream

That brings us to the next library I created; conductor-stream. This takes the AsyncIterableSequencer and combines it with a TransformStream interface. A transform stream allows you to intercept a stream and process the chunks as they pass through the stream, optionally transforming the chunks and enqueuing them onto an output stream with a different output than what was consumed from the input stream. Where a ConductorStream is different is that it doesn’t enqueue chunks like a transform stream, it chains entire async iterables, which includes chaining entire other streams.

This means that an in-progress stream can be parsed for a directive to generate a new stream, and then from that directive it can create a brand new stream and then inject that stream into the middle of the output stream by chaining it before passing along the rest of the input stream. As those injected streams can also be another conductor stream that means that those conductor streams can inject another conductor stream recursively, fulfilling the recursive streaming part of parallel recursive streaming AI swarms.

It’s a “Conductor” in the sense that it doesn’t manage the streams directly with buffering and queues, but rather orchestrates the streams together by chaining them. While the chaining is recursive, it doesn’t incur a call stack as the iterable delegation flattens the chain, allowing the “Conductor” to guide the chain instead of manipulating it directly.

As async iterable sequencers are so simple, so are conductor streams:

import { asyncIterableSequencer, Chain } from "async-iterable-sequencer";

export interface ConductorStreamOptions<I, O> {
  start?: (chain: Chain<O>) => void;
  transform: (chunk: I, chain: Chain<O>) => void;
  finish?: (chain: Chain<O>) => void;
}

export class ConductorStream<I, O> {
  public readable: ReadableStream<O>;
  public writable: WritableStream<I>;

  constructor({ start, transform, finish }: ConductorStreamOptions<I, O>) {
    const { sequence, chain } = asyncIterableSequencer<O>();
    this.readable = ReadableStream.from<O>(sequence);
    this.writable = new WritableStream<I>({
      write: (chunk) => {
        transform(chunk, chain);
      },
      close: () => {
        finish?.(chain);
      },
    });
    start?.(chain);
  }
}

It’s little more than a class wrapper that implements the transform stream interface and exposes some lifecycle callback. All the magic happens within the async iterable sequencer and the conductor stream simply allows you to interface it in the middle of another stream.

With all the building blocks in place we can now build the engine to power parallel recursive streaming AI swarms.

Stream Weaver Frameworks

I call the frameworks that coordinate conductor streams and async iterable sequencers together “Stream Weaver Frameworks” in the sense that they allow you to pull streams from multiple layers and “weave” them into a greater pipeline, taking individual strands and creating a continuous flow of data.

For the AI swarm proof of concept demo I’ve created an example weaver framework called swarm-weaver. It has two kinds of conductors, one for prompts and one for agent execution. The PromptConductor takes a stream of content that it interpolates into a prompt and then it chains an AgentConductor onto it once it has compiled the stream and is ready to be sent to an LLM provider for content generation. The output of that agent conductor is then parsed for child agent directives which in the proof of concept is represented as xml tags.

Franchise Example In Depth

In the example video, it generates a synopsis of a media franchise via multiple layers of child agents. The entrypoint main prompt is used to generate a text output that contains xml directives to spawn a child agent. Here is an example of what those prompts look like (for brevity the prompts are simplified):

main prompt

You are an agent that outlines the installments of book series, movie franchises, and similar large-scale series. Your task is to take a user's request about a series and output a specific template format listing all installments.

**Input**


**Output Format**
\`\`\`
# [series title]

## [installment title]
<installment series="[series name]" title="[installment title]"/>

## [installment title]
<installment series="[series name]" title="[installment title]"/>
\`\`\`

`` is a special interpolation value that is interpolated with the prompt’s input stream. For the root prompt, this stream is stdin.

When the prompt generates xml tags, those tags spawn a child prompt. In this example, the root main agent output spawns child agents for the installment prompt:

installment prompt

Generate chronological act summaries for the specified media content. Each act should be a single, concise sentence that captures the essential narrative progression and provides sufficient context for further elaboration.

**Input**
Series: 
Title: 

**Output Format**
\`\`\`
### Act [number]
<act series="[series name]" title="[title]">
[Single sentence act summary]
</act>
\`\`\`

Here, instead of receiving it receives attribute interpolation values for and ``. These came from the main agent’s output of <installment series="[series name]" title="[installment title]"/>. Again, the installment agent output spawns another child agent and feeds into the act prompt:

act prompt

You are tasked with creating an engaging 1-3 paragraph synopsis for the following media based on the provided act outline.

**Media Information:**
- Series: 
- Title: 

**Act Outline:**

The act receives the series and title passed down again, as well as the special `` interpolation value which in this case receives the act summary content generated within the <act> tag.

Once this is executed, the main agent executes first. It outputs something like:

# Star Wars

## Episode I: The Phantom Menace
<installment series="Star Wars" title="Episode I: The Phantom Menace"/>
...
## Episode IX: The Rise of Skywalker
<installment series="Star Wars" title="Episode IX: The Rise of Skywalker"/>

Non xml content is chained onto the output stream as a string and each xml tag encountered triggers a new prompt conductor on the chain. The chain conducted by the main agent ends up looking like this:

stdout < string < installment_prompt < string < installment_prompt ...

Once the content sent to the installment prompts are complete those prompt conductors chain agent conductors in their place. At this point the chain looks like this:

stdout < string < installment_agent < string < installment_agent ...

Those installment agents generate their own content. Now the final output looks like this:

# Star Wars

## Episode I: The Phantom Menace
### Act 1
<act series="Star Wars" title="Episode I: The Phantom Menace">
Jedi Master Qui-Gon Jinn and his apprentice Obi-Wan Kenobi are dispatched to negotiate a trade dispute with the Trade Federation, but discover a sinister plot involving an invasion of the peaceful planet Naboo and encounter a young slave named Anakin Skywalker on Tatooine.
</act>
...

The chain has been expanded with newly injected streams:

stdout < string < act_agent < string < act_agent < ... < string < installment_agent ...

Each conductor expands its work in place within the chain, maintaining order but creating new subtasks in its position.

With this approach this solves the issue we had with the overly complicated meta prompt. As problems can be broken down without breaking streaming, this allows us to create specific focused prompts that can do fewer things better. Not only that, breaking out the prompts unlocks ludicrous speed. Not only do you get the time-to-first-token benefits of streaming, you also get a swarm of agents building out and expanding content in place in parallel with no downside.

The biggest issue I’ve encountered is that the output is too fast. Presenting it reasonably to the user without huge chunks of content appearing will actually require smoothing and throttling.

Meta Prompting

Since I decided to represent child agent directives as xml tags it got me thinking, what happens if I put an agent inside of an agent? Since text inside a child agent tag is interpolated into that child agent’s prompt, it would make sense that if an agent tag is in another agent tag that the output of the inner agent should be sent to the prompt of the containing agent.

By sending the output of one agent to another, it enables meta prompting, having agents prompt agents. To do this I implemented parsing of the contents of an agent tag to be a stream chained to the containing agent tag’s prompt conductor. This content can be another agent tag that spawns its own meta prompting child agent whose output gets chained to the input stream of the container prompt’s special `` interpolation value.

Using this you can use sub agents as idea generators or context generators for the containing agent. The child tags don’t even have to spawn agents, they could be standard sub routines that stream in crawl data, or a database call, or an API call, or anything you want!

I believe this can enable incredibly powerful patterns, and with the relay pattern powering all of this, everything is unblocked and processed and streamed as soon as the content is resolved and available.

Beyond AI

What this all allows for is ordered consumption of parallel production over streams. The stream weaver framework pattern enables unblocking generative AI responses, but this pattern is broadly applicable to other areas as well. I’m exploring these ideas further in an experimental repo with an (incredibly dense) technical paper. In particular, I believe these patterns could enable new ways to orchestrate and power up streaming web rendering frameworks. I believe it can also help simplify the orchestration of data pipelines and cross execution context communication. If anyone is interested in exploring these ideas, please don’t hesitate to share your thoughts.