Real-time · Codeless · Hyperscaler-validated

Real-time GPU
reengineering.
Zero code changes.

Chameleon generates hardware-tuned machine instructions at runtime — eliminating wasted execution and memory pressure. 20–114× faster. 90–99% less energy. Validated on AWS, OCI, GCP, and the Rowan University Digital Engineering Hub.

20–114×
Faster execution
90–99%
Less energy
30+
Linux distros validated
0
Code changes required
The problem

Why GPU utilization sits at 10–35%.

GPU underutilization is not a hardware problem. It is a structural limitation of how modern software defines and executes work. Today's systems tell machines how to do work in advance — code is written, compiled, scheduled, and orchestrated long before it reaches the GPU. Once execution begins, those decisions are largely frozen.

Flaw 01

The abstraction tax nobody talks about

Every layer in the software stack — hardware to OS to runtime to framework to application — was designed to make programming easier for humans. Every layer also extracts a performance cost. The machine does work the programmer never asked for, executing instructions that exist only to manage the abstractions below.

Evidence NVIDIA T4 theoretical: ~65 TFLOPS (FP16).
Typical training utilization: 30–45%.
Four abstraction layers between intent and silicon.
Flaw 02

Compilers freeze decisions too early

Compilers optimize execution once, offline. Middleware introduces coordination overhead and idle gaps. AI-generated code inherits the same static execution limits. Execution cannot adapt once workloads are running — even when real runtime conditions have changed.

Evidence Compiler overhead in a standard C++ build pipeline can account for 15–40% of runtime inefficiency through missed optimization opportunities at abstraction boundaries.
Flaw 03

Memory is the bottleneck — not compute

GPUs spend much of their time waiting — on memory, synchronization, orchestration layers, or execution paths that no longer match real runtime conditions. As memory becomes the bottleneck, execution efficiency becomes the advantage. Buying more hardware doesn't fix an architecture problem.

Evidence Data center spend trajectory: more GPUs, larger clusters, more power.
What nobody says: fix the software.
Flaw 04

AI accelerates the problem, not the fix

AI-generated code ships faster than any team can audit it — and inherits the same static execution limits as hand-written code. AI agents decide what to do. They can't deterministically restructure how work runs on the hardware below them.

Evidence AI code output velocity now exceeds review capacity by orders of magnitude. The performance ceiling is the abstraction stack — not the model.

The core issue is not inefficiency at the margins. It is that software locks execution decisions too early, while GPUs are inherently dynamic systems.

See also: The chip is a factory of factories ↗

How Chameleon works

Describe what you want. Chameleon handles the rest.

Chameleon replaces static execution models with Composite Job Designs — runtime definitions of what the work is, not how to do it. Execution is designed at runtime, not frozen at compile time. Optimization occurs continuously, not once.

Step 01 · Input

Provide intent

Describe the behavior in natural language, or submit existing GLSL. No refactoring. No framework lock-in. No CUDA, ROCm, OneAPI, or PyTorch dependency.

What you give us:
Natural language, kernels, or a workload harness with a stable execution boundary.

Step 02 · Generate

Generate optimized execution

Chameleon analyzes the input and generates optimized SPIR-V for the target hardware as a Composite Job Design. Continuous adaptation based on observed hardware behavior — not compiler assumptions.

What happens inside:
Instruction-level realization, kernel fusion, scheduling and synchronization rebalancing, memory-access reshaping.

Step 03 · Execute

Integrate and run

Drop the result into your application to optimize workloads. Runs across NVIDIA, AMD, Intel, and other accelerators via the SPIR-V execution path. Same intent executes across devices with no recompilation or retraining.

What you get back:
Execution-ready artifacts, packaging guidance, validation notes. Drop-in, no rewrite.

Validated results

The numbers aren't projections. They're measured.

Independently validated across Amazon Web Services, Oracle Cloud Infrastructure, Google Cloud, and the Rowan University Digital Engineering Hub. One executable, three vendors: Nvidia Tesla T4, A10G, A100-SXM4-40GB, and A10 on AWS, OCI, and GCP — plus AMD Radeon 8060S integrated graphics at RU DEHub. Zero changes to customer workloads. Every point below is a measured result.

Validated on
  • AWS
  • Oracle Cloud
  • Google Cloud
  • NVIDIA
  • AMD
  • Rowan University
  • + 30 Linux distros
20–114×
Faster runtimes

Sustained across Nvidia Tesla T4, A10G, A10, A100, and AMD Radeon 8060S instances. Portable across GPU types and vendors, identical intent, no recompilation.

90–99%
Less energy per workload

Gains from eliminating wasted execution and memory pressure — not from increasing raw compute. A direct translation into sustainability wins.

90–97%
GPU utilization sustained

Of theoretical peak, achieved during live composite workloads. Well above the 10–35% industry baseline on the same hardware.

Filter

Results · 10 workloads

10 featured workloads · Rowan numbers from the signed Phase 1 evaluation · tap a row for detail
SCE Lightwires 114× Rowan University Light & energy propagation field
Chip time125.5 ms → 1.1 ms
Power draw63 W → 26 W
Energy reduction99.6%
PlatformRowan University Digital Engineering Hub · AMD Radeon 8060S · Ubuntu 24.04

Light and energy propagation across a field — stresses memory traversal and sustained-load stability. Chip time dropped from 125.5 ms to 1.1 ms (114× faster) on an integrated AMD Radeon 8060S edge GPU. Represents the upper end of the 20–114× range Rowan reported in its Phase 1 evaluation.

SCE Cell Sym 109× RU DEHub Cellular composition & symmetry
Chip time109.1 ms → 1.0 ms
Power draw60 W → 23 W
Energy reduction99.6%
PlatformRU DEHub · AMD Radeon 8060S · Ubuntu 24.04

Composite scene workload mixing geometry, lighting, and math. Chip time fell from 109.1 ms to 1.0 ms (109× faster) while power draw dropped 62%. Energy-per-frame reduction of 99.6% on the same integrated edge GPU.

SCE Rust Mandala 108× RU DEHub Recursive fractal geometry
Chip time107.8 ms → 1.0 ms
Power draw59 W → 22 W
Energy reduction99.7%
PlatformRU DEHub · AMD Radeon 8060S · Ubuntu 24.04

Recursive mathematical pattern with very high arithmetic density. Chip time dropped from 107.8 ms to 1.0 ms (108× faster), demonstrating that compute-bound workloads benefit as much as memory-bound ones from execution-design restructuring.

SCE_SDF 84× RU DEHub Signed distance field lighting · AMD
Chip time67.5 ms → 0.8 ms
Power draw72 W → 18 W
Energy reduction99.7%
PlatformRU DEHub · AMD Radeon 8060S · Ubuntu 24.04

Signed distance field lighting — the same workload appears on all four platforms in the chart above and now in the bench rows below. Here on AMD integrated graphics: 67.5 ms to 0.8 ms (84× faster), power down 75%. Compare to AWS Nvidia Tesla T4 (54×), GCP Nvidia Tesla T4 (49×), and OCI Nvidia A10 (41×) below — same executable, same intent, four different hardware-and-cloud combinations.

Red Nebula 65× RU DEHub Volumetric color blending & fade
Chip time150.0 ms → 2.3 ms
Power draw67 W → 39 W
Energy reduction99.1%
PlatformRU DEHub · AMD Radeon 8060S · Ubuntu 24.04

Volumetric field with dense 2D data arrays — memory-bound, representative of simulation and scientific grids. Chip time fell from 150.0 ms to 2.3 ms (65× faster) with 99.1% less energy per frame.

SCE_SDF 54× AWS · T4 / A10G / A100 Same workload on Nvidia discrete GPUs
Chip time375 ms → 7 ms
Power draw63 W → 60 W (72–46 W range)
Energy reduction98.2%
Utilization100% → 50%
PlatformAWS · Nvidia Tesla T4 · Ubuntu 24.04

Same SCE_SDF executable that hit 84× on AMD hardware at Rowan runs here on Nvidia Tesla T4 at AWS — 54× speedup, 98% less energy per frame, utilization cut in half. No recompile, no framework swap, no vendor-specific SDK. Performance-per-watt gains above 25×.

SCE_SDF 49× GCP · Tesla T4 Same workload, Google Cloud hardware
Chip time469 ms → 9.5 ms
Power draw68 W → 64 W
Energy reduction98.1%
PlatformGCP · Nvidia Tesla T4 · Ubuntu 24.04

The same SCE_SDF binary on Google Cloud's Nvidia Tesla T4: 469 ms to 9.5 ms (49× faster), 98% less energy per frame. The T4 is the same chip family as the AWS row above; the convergence shape is identical across hyperscalers, isolating the speedup to the substrate rather than to any cloud-specific tuning.

SCE_SDF 41× OCI · Nvidia A10 Same workload, Oracle Cloud hardware
Chip time190.3 ms → 4.6 ms
Power draw150 W → 125 W
Energy reduction98.0%
Utilization100% → 44%
PlatformOCI · Nvidia A10 · Ubuntu 24.04

The fourth platform for the same SCE_SDF workload — OCI Nvidia A10: 190.3 ms to 4.6 ms (41× faster), 98% energy reduction. Convergence behavior remains stable across AMD integrated (Rowan), Nvidia T4 (AWS), Nvidia T4 (GCP), and Nvidia A10 (OCI) — confirming that Chameleon's execution-design pipeline delivers consistent cross-vendor, cross-cloud optimization independent of hardware vendor, driver stack, or hyperscaler environment.

Mandelbrot 49× AWS · Tesla T4 Arithmetic-intensive compute
Chip time146 ms → 3 ms
Power draw57 W → 49 W
Energy reduction98.2%
Utilization93% → 37%
PlatformAWS · Nvidia Tesla T4 · Ubuntu 24.04

Compute-heavy workload that stresses arithmetic intensity more than memory bandwidth. Chip time fell from 146 ms to 3 ms (≈49× faster) with ≈99% less energy per frame. Utilization dropped from 93% to 37% — the same hardware finishing the same work in a fraction of the time, with a fraction of the duty cycle.

Mandelbrot 32× GCP · Tesla T4 Same compute workload, Google Cloud hardware
Chip time151.6 ms → 4.7 ms
Power draw67 W → 38 W
Energy reduction98.7%
PlatformGCP · Nvidia Tesla T4 · Ubuntu 24.04

The same arithmetic-intensive Mandelbrot kernel that ran at 49× on AWS T4 lands at 32× on GCP T4 — same chip family, same workload, same executable, two different cloud operators. The 49×/32× spread is normal hyperscaler variance from network, scheduler, and host-VM differences, not from anything Chameleon does. Energy reduction stays at 98–99% in both.

— What we guarantee

Process, not multipliers.

Chameleon guarantees systematic, real-time exploration of the viable execution space available on the underlying hardware — within declared intent and constraints. This is what expert performance engineers do over days or weeks, performed continuously, automatically, and at hardware speed.

A specific multiplier (20×, 35×, 55×) is not guaranteed for every workload. Results depend on memory bandwidth, compute-vs-memory balance, control-flow irregularity, and tensor topology. The numbers above are what we've measured. Your numbers come from your workload on your hardware, with a reproducible repro guide for independent verification.

See it in action

Chameleon, optimizing in real time.

Two short demos of Chameleon's instruction-regeneration pipeline running on live GPU hardware. Each video plays inline — we don't load anything from YouTube until you press play.

Video #1 · Real-time workload optimization

Watch how Chameleon dynamically re-synthesizes GPU instructions as workloads run — no recompile, no code changes.

Video #2 · Breakthrough across heterogeneous compute

See why Chameleon is a breakthrough across GPUs, CPUs, TPUs, and heterogeneous compute.

Where Chameleon runs

Public cloud. Data center. Silicon. Edge. Everywhere.

Chameleon is environment-agnostic. Linux is fully supported today across 30+ validated distributions and designed for 50+. Public cloud, data center, edge, and heterogeneous silicon are all on the map.

Supported today
On the roadmap
01 · Cloud

Public cloud

  • AWS
  • Oracle Cloud (OCI)
  • Google Cloud
  • Microsoft Azure
02 · Infrastructure

Data center

  • Kubernetes
  • Docker
  • VMware
  • Red Hat
03 · Hardware

Silicon & accelerators

  • NVIDIA
  • AMD
  • Intel
  • Arm · Apple · Broadcom
  • Qualcomm · Imagination
04 · Environments

Edge & mission

  • ISS / Space edge
  • Satellite / austere
  • 5G / MEC
  • Rugged / tactical
05 · Platforms

Operating systems

  • Ubuntu 24.04 (certified)
  • Debian · Fedora · Arch · SUSE
  • + 30 Linux distros
  • Windows · macOS · Android · iOS
Pilot program

Three ways to evaluate.

Pick the path that matches your environment. Cloud pilots are the fastest way to hands-on results. On-prem suits secure networks, labs, and data centers. Hosted pilots give you a guided demo with no setup required.

Pilot scope

The current pilot runs a set of built-in workloads and measures performance and power on a single GPU at a time. Phase 1 is a validation path — not a production deployment.

Multi-GPU optimization is planned for Q3 2026. Over 60 composite workloads have been optimized to date across AI, media, simulation, and signal processing.

What we need to scope your run

Target hardware: GPU model and driver version.
OS: Linux distribution and version.
Objective: performance, energy, throughput, latency, or policy constraints.

Share those three things and we'll confirm the fastest pilot path — usually within a day.

Looking for the technical detail? Our Wantware whitepaper covers the execution-layer thesis, infrastructure economics, validated results, and pilot structure in 10 pages — written for hyperscaler and datacenter teams.

Ready to start

Same GPUs. Dramatically more work.

Teams spending $100K/month on GPUs can often reduce costs to $10K–$25K/month while increasing throughput. No refactoring. No framework lock-in. No code changes.

Pilot binaries are evaluation only — not production deployment