Design the Ambience: Expanding Realities Beyond the Screen with StreamDiffusion and MediaPipe

Introduction

This project builds on the earlier work titled "Real-Time Coding Adventure of the Bio-Cybernetic System." The original project revolved around creating a "digital creature" that interacted with user input and environmental changes. Inspired by Physarum Polycephalum, it used mouse and webcam data to track light and motion, generating sequences that shifted the interface’s behavior. By occasionally obstructing the display and then gradually returning to normal after a period of stillness, the system encouraged users to rethink their relationship with the screen. It highlighted how feedback loops can disrupt the usual one-way interaction we have with interfaces, pushing for a more reflective and dynamic experience. (Figure 1)

Figure 1: Real-Time Coding Adventure of the Bio-Cybernetic System Project Slide

The project was grounded in ideas like performative idiom and circular causality. Performative idiom challenges the notion of fixed, pre-defined systems, focusing instead on processes that adapt and evolve in unpredictable ways, much like Ashby’s Homeostat or Pask’s Colloquy of Mobiles. Circular causality, on the other hand, explores how feedback loops create systems that can self-regulate and respond to their environment. Together, these concepts shaped the project’s goal of questioning sedentary, mouse-driven interactions and reimagining them as active, multi-directional exchanges. That said, the original system was limited to what could happen within the boundaries of the computer screen.

The new project takes this further by using Stable Diffusion and Projection Mapping to create a dynamic ambient environment that responds to user interactions in real time. It tracks subtle inputs like mouse location, movement speed, typing rhythm, and posture, and projects them into the physical space around the user. This creates a feedback loop that extends beyond the screen, allowing not only the user but also others nearby to engage with and influence the system.

Some key changes in the new system include:

Going beyond the screen: The projection mapping takes everyday interactions, like mouse and keyboard movements, and translates them into ambient visuals that fill the surrounding environment.
Using generative AI: By integrating StreamDiffusion, the system combines both measurable data (like movement speeds and posture) and qualitative inputs (like text prompts or screen captures) to generate visuals that feel imaginative and alive.
Interactive projections: The projected environment becomes part of the interaction, giving users something new to react to and creating a back-and-forth exchange between the digital and physical spaces.

This project pushes the boundaries of human-computer interaction by moving the focus away from the computer itself and into the surrounding space. It highlights how small, often overlooked actions like typing and mouse movements can shape the environment, inviting users to not only see their interactions differently but also share the experience with others. By breaking out of the screen’s confines, it transforms a closed-loop system into something more open, interactive, and thought-provoking.

Equipment and Computational Pipeline Setup

The workflow begins with the user deciding to interact with the computer, and this interaction is captured through inputs such as mouse and keyboard usage, as well as the user’s physical movements. These interactions generate a range of system inputs, including cursor XY coordinates, cursor movement speed, keyboard typing speed, live PC screen captures, and human contour segmentation. Each type of input is processed and integrated into a computational pipeline, culminating in real-time generative visuals projected back into the user’s environment. (Figure 2, 3)

Figure 2: Overall System Loop Flow Chart (red lines are the human input, grey lines are algorithmic process, green lines are external projections)

Figure 3: Stream Diffusion Script Combining Stream Diffusion (real-time image generation) and MediaPipe (posture recognition)

Input Data Capture and Processing

Cursor and Keyboard Data

Cursor position and movement speed, along with keyboard typing speed, are captured using Python libraries such as pyQt5 and pynput. This data is processed in a standalone Python script and sent to a TouchDesigner project file using the OSC module in TouchDesigner.

Screen Capture and Image Segmentation

The PC’s live screen is captured, and human contour segmentation is processed directly in TouchDesigner. These visual inputs provide additional parameters for the system to analyze and integrate into the generative output. (Figure 4)

Stream Diffusion Pipeline

After pre-processing, the data is fed into the StreamDiffusion pipeline, which leverages Stable Diffusion for real-time image generation. With a predetermined text prompt, StreamDiffusion synthesizes all the inputs to create a sequence of images, generating around 16 frames per second. The result is not simply displayed on the screen but instead projected back into the physical workspace, bridging the digital and ambient environments.

Figure 4: Input’s Correlation with Stream Diffusion Image Generation Output

Projection and Ambient Feedback

The generative visuals are projected using a carefully configured system:

Projection Mapping: The output is mapped to the user’s surroundings. Using the CamShnapper module, the human contour is projected onto the user’s back, while the remaining visuals are cast onto the workspace.
Input Integration:

The PC screen defines the overall composition, with features like window positions and icons influencing the image layout.
The cursor generates a dynamic colored square, where its size and dimensions correspond to cursor speed. This square acts as a standout object in the final image.
The human contour creates a large white area in the projection, contrasting with other colors to establish a distinct visual zone that can influence topology or foreground in the image.
The keyboard typing speed adds a crystallized texture layer to the projection, emphasizing dynamic user input.

The system ensures each input contributes in a visually distinct way, making the relationships between the input parameters and the generated visuals clear and observable.

Equipment Setup

To achieve this setup, the following equipment was used:

Projector: For projecting the generated visuals onto the user and the workspace.
Laptops: To align and present processed input parameters.
DJI Osmo Video Camera: To capture the user’s movements and generate human contour data.
Main Computer: The central system for user interaction, running the computational pipeline and handling generative processes.

This setup creates a seamless loop where user actions influence the generated visuals, which are then projected back into the environment, forming a dynamic feedback system that blends the physical and digital worlds. (figure 5, 6)

Figure 5: Monitor, Video Camera, Computer Setup

Figure 6: User with Setup

Trials and Generation Result

Iteration 1 - Plants

(video link: https://youtu.be/_m71EAxqBdY)

For the first trial, I used the simple prompt “plant.” What stood out was how the system interpreted the straight lines of my computer’s window edges as architectural or interior elements, creating spaces to host the generated plants. The human contour often transformed into a table or light surface in the foreground, while the cursor input became a large, colorful plant. (Figure 7)

Figure 7: inputs and outputs for trial 1

It was fascinating to see how the cursor, as it moved across the screen, morphed into different plant forms or elements. (Figure 8,9,10)

Figure 8

Figure 9

Figure 10

Since the projections are situated in real-world workspaces, some elements, like a lamp rendered by the cursor, seamlessly blended into the physical environment, making them feel like natural extensions of the workspace. (Figure 11)

Figure 11: Projection as the Natural Extension of the Workspace

Iteration 2 - Urban Plan

(video link: https://youtu.be/_m71EAxqBdY)

For the second trial, I used “urban plan” as the prompt. Here, the human contour consistently served as “negative” space, shaping the urban fabric that the StreamDiffusion model generated. Interestingly, the organic forms of human silhouettes were often transformed into rectangular sub-parts to align with the overall composition. (Figure 12)

Figure 12: Figure 7: inputs and outputs for trial 2

Although “urban plan” typically implies a flat, 2D output, the system occasionally produced perspective-style drawings. This may have been influenced by the layout of the computer windows combined with the user’s movement and contour position. Subtle changes in these compositions often led to strikingly different results. (figure 13,14,15)

Figure 13: Perspective Output of Trial 3

Figure 14: Screen Input

Figure 15: Human Figure Input

Iteration 3 Physorum Screen Interaction with Urban Plan

(video link: https://youtu.be/OAg5alXv8xY)

In the third trial, I combined the earlier Physarum simulation with the “urban plan” prompt to create a layered result. The idea stemmed from previous studies linking Physarum’s efficient route-finding behaviors to urban layout design. I was curious to see how merging these two systems would affect the output.

While the correlation wasn’t immediately clear, the StreamDiffusion model did reconstruct Physarum trails as green spaces within the urban fabric. Untouched areas of the screen often became urban blocks, suggesting an emergent relationship between the simulation and the generated cityscape. (Figure 16)

Figure 16: inputs and outputs for trail 3

Reflection and Further Development

Project Highlights and Discoveries

This project demonstrated the potential to extend human-computer interaction beyond the confines of the computer screen, integrating the digital and physical realms. By projecting 3D-generated visuals onto surrounding surfaces, such as white walls, the system transformed typically private digital interactions into shared, ambient experiences. This shift revealed how even subtle parameters, such as cursor movement or typing speed, could influence the generated output and contribute to a layered, interactive environment.

Through this process, I became more mindful of the nuances in how I interact with the computer. Small changes in behavior, such as varying the speed of my cursor or the rhythm of my typing, created noticeable differences in the generated visuals. This heightened awareness emphasized the significance of seemingly minor user actions, offering new perspectives on how interaction can shape outcomes in computational systems.

The project also aligns conceptually with world-making, as described in Nelson Goodman’s essay “Words, Works, Worlds.” Goodman’s methods—composition, decomposition, weighting, and framing—closely parallel how this system processes inputs. Interactions with the computer were decomposed into measurable parameters, reweighted, and then reconstructed into cohesive visuals through the diffusion model. This framing allowed for an exploration of how everyday digital interactions could be reframed as creative, generative acts.

Additionally, this project prompted a reconsideration of generative AI’s role in design. Previously, I viewed AI as a tool that often operates beyond human control, leading to an over-reliance on automated outputs. However, this framework allowed for a more collaborative interaction. The system wasn’t merely producing static results but facilitating a continuous feedback loop, where my inputs shaped the visuals in real time. This iterative process introduced a dynamic relationship between human agency and AI, reframing AI as a responsive partner rather than a deterministic tool.

Technical Limitations

Despite its strengths, the project faced several technical constraints that limited its execution. One challenge involved the projection setup. I initially planned to project visuals onto my body by wearing white clothing to enhance visibility. However, the projector’s fixed focus, calibrated for the white wall, caused the image on my back to appear blurry. Similarly, when I used a white hat to extend projections to my head, the DJI Osmo video camera failed to recognize me as a human figure, disrupting the system’s ability to process human contours as input parameters. These issues highlight the challenges of integrating dynamic human movements with projection-based systems.

Another limitation was the resolution of the StreamDiffusion output, which was restricted to 514 x 514 pixels due to performance considerations. While sufficient for smaller-scale displays, this resolution was inadequate for large-area projections, resulting in a loss of visual clarity and detail. Enhancing the resolution would greatly improve the system’s ability to produce impactful, high-quality visuals.

Further Development Potential

The project offers several directions for further exploration and refinement:

Expanding Input Modalities: Incorporating additional environmental inputs, such as lighting conditions, temperature, or sound, could create a richer, multisensory interaction. Outputs could also extend beyond visuals to include auditory or tactile elements, enhancing the system’s ability to engage users on multiple levels.
Extended User Studies: Longer-term observation would provide a deeper understanding of how the projections influence user behavior and perception. For example, how might ambient projections alter a user’s default interactions with their computer over time? Additionally, observing how others in the environment respond to or interact with the projections could yield insights into the system’s broader social implications.
Applications in Design: The framework has potential for use in creative workflows, particularly in design processes. Designers could engage with the system using their body movements, mouse inputs, and keyboard actions to generate design outputs in real time. This approach could shift the act of designing from a purely cognitive task to a more embodied, interactive practice, merging physical gestures with computational creativity.

In conclusion, this project serves as an initial exploration of how human-computer interaction can extend beyond the screen, integrating physical spaces and enabling dynamic feedback loops. While the current implementation has notable technical limitations and requires further refinement, it provides a foundation for rethinking the relationship between humans, machines, and creative systems. The framework shows potential for applications in design and other fields, particularly in promoting a more interactive and embodied approach to digital tools.

References

Ashby, W. Ross. Design for a Brain: The Origin of Adaptive Behaviour. Springer Science & Business Media, 1952.

Beer, Stafford. Brain of the Firm. 2nd ed., Wiley, 1981.

Cumulo-Autumn. StreamDiffusion. GitHub Repository. Accessed December 14, 2024. https://github.com/cumulo-autumn/StreamDiffusion.

Goodman, Nelson. Ways of Worldmaking. Hackett Publishing Company, 1978.

Mediapipe-TouchDesigner. Mediapipe Integration for TouchDesigner. GitHub Repository. Accessed December 14, 2024. https://github.com/torinmb/mediapipe-touchdesigner.

Pask, Gordon. Conversation Theory: Applications in Education and Epistemology. Elsevier, 1976.

Suchman, Lucy. Human-Machine Reconfigurations: Plans and Situated Actions. 2nd ed., Cambridge University Press, 2007.