
The blueprint problem
Getting an AI agent running on your laptop is one thing. Getting it running on your own infrastructure - securely, repeatably, at scale - is something else entirely.
It is the difference between sketching a house on a napkin and actually building one. The sketch gives you the shape. It tells you where the rooms go. But when you start laying bricks, you discover that the sketch left out the plumbing, the electrical, and the fact that the lot slopes fifteen degrees.
My team needed to deploy LangSmith Hybrid on AWS. LangSmith Hybrid splits the work: LangChain manages the control plane - the UI, tracing, deployment management - while the data plane runs in your own Kubernetes cluster. Your agent pods, your Redis, your databases. Their orchestration.
We started with a sketch. It was a good sketch. But when we tried to build from it, we learned a few things.
The first draft
LangChain provides official Terraform modules for standing up the infrastructure: VPC, EKS, PostgreSQL, Redis, S3. Five modules, five resources. Clean and organized.
We cloned the repo, filled in our variables, and ran terraform apply. It worked - mostly. The Terraform modules taught us the shape of the infrastructure. We could see what we needed: subnets here, a managed node group there. That understanding was invaluable.
But as we iterated, the cracks started to show.
LangChain's modules used different variable names than the standard Terraform AWS modules. instance_class in one place, instance_type in another. Small differences, but they meant inspecting every module's variables.tf. The root-level Kubernetes and Helm providers created a circular dependency with the EKS module, because the EKS module manages its own providers internally. You have to know to let it.
Failed deployments left behind KMS aliases, CloudWatch log groups, RDS subnet groups, ElastiCache subnet groups - orphaned resources that blocked fresh runs. We wrote shell scripts to manage targeted applies and dependency-ordered destroys. deploy.sh, list-aws-resources.sh, destroy-aws-resources.sh. The scripts worked, but maintaining them felt like building scaffolding for our scaffolding.
And then there was the clock. EKS deployments take 15-20 minutes. Our AWS SSO sessions frequently expired mid-deploy. You would kick off a run, go get coffee, come back, and find a timeout error staring at you from the terminal.
None of this made Terraform the wrong tool. It was the right tool for learning what we needed. But we were no longer learning. We were building. And for building, we needed something with more grip.
The turning point
We moved to Pulumi - specifically, Pulumi with Python. The motivation was simple: we wanted our infrastructure code to live in the same language and the same monorepo as the rest of our tooling. No more context-switching between HCL and Python. No more duplicating values between dev.tfvars and Python config files.
The first thing we did was extract all of those scattered Terraform variables into a single typed dataclass in config.py:
@dataclass(frozen=True)
class LangSmithConfig:
environment: str
vpc_cidr: str
eks_cluster_name: str
eks_cluster_version: str
eks_node_instance_type: str
postgres_instance_class: str
redis_node_type: str
s3_bucket_prefix: str
langsmith_api_key: pulumi.Output[str]
langsmith_workspace_id: str
# ... and more
Every field has a sensible default defined as a module-level constant:
_VPC_CIDR = "10.0.0.0/16"
_EKS_CLUSTER_VERSION = "1.31"
_EKS_NODE_INSTANCE_TYPE = "m5.xlarge"
_POSTGRES_INSTANCE_CLASS = "db.t3.medium"
_REDIS_NODE_TYPE = "cache.t3.micro"
The stack config file (Pulumi.dev.yaml) only needs 5 keys: environment, eksClusterName, s3BucketPrefix, langsmithWorkspaceId, and langsmithApiKey. Everything else uses the Python defaults unless you explicitly override them. One source of truth.
But the real win was not configuration. It was dependencies.
The listener problem
LangSmith Hybrid requires a listener - a registration with the control plane API that tells LangChain "this cluster exists, send deployments here." In the Terraform era, managing this listener was a separate step. We ran a standalone Python script, manage_listeners.py, outside of any infrastructure pipeline. If you forgot to run it, or ran it against the wrong environment, things drifted.
In Pulumi, we turned the listener into a dynamic resource - a first-class citizen in the infrastructure graph:
class LangSmithListener(dynamic.Resource):
listener_id: pulumi.Output[str]
def __init__(self, name, *, api_key, workspace_id,
compute_id, namespaces, opts=None):
super().__init__(
_ListenerProvider(), name,
{"api_key": api_key, "workspace_id": workspace_id,
"compute_id": compute_id, "namespaces": namespaces,
"listener_id": None},
opts,
)
The provider behind it handles the full lifecycle. On pulumi up, it calls the LangSmith API to create the listener (or adopts an existing one). On pulumi destroy, it deletes it. On refresh, it verifies the listener still exists. No more manual scripts. No more drift.
The data plane Helm chart then references the listener's output directly:
k8s.helm.v3.Release(
f"{cluster_name}-dataplane",
name="dataplane",
chart="langgraph-dataplane",
values={
"config": {
"langsmithApiKey": langsmith_api_key,
"langgraphListenerId": listener.listener_id,
},
},
opts=pulumi.ResourceOptions(depends_on=[keda, listener]),
)
One command. Everything in order. Everything connected.
Stale images
With the infrastructure stable, we turned to deploying actual agents. This is where Kubernetes image caching caught us off guard.
With a static tag like hello-world-graph:0.1.0, the default imagePullPolicy: IfNotPresent meant nodes would serve a cached image instead of pulling the latest build. The deployment would "succeed," the pod would be Running, but it was running old code.
The solution was appending the Git SHA to every image tag. We built this into langsmith-client, a CLI package that wraps the build-push-deploy cycle:
def _get_git_sha() -> str | None:
try:
result = subprocess.run(
["git", "rev-parse", "--short", "HEAD"],
capture_output=True, text=True, timeout=5,
)
if result.returncode == 0:
return result.stdout.strip()
except Exception:
pass
return None
Now every build produces a tag like hello-world-graph:0.1.0-ff95149. Unique tags forced a fresh pull on every deployment.
Greater context
The fixes above are specific to our stack. But the principles behind them apply to any complex deployment.
Defaults belong in code, not config files
YAML configuration files should contain only what actually varies between environments. Everything else - the VPC CIDR, the node instance type, the Postgres engine version - belongs in typed constants with clear names. When someone new joins the project, they can read the config file and immediately understand what makes this environment different - not what makes it the same as every other one.
Automate lifecycle, not just provisioning
Infrastructure tools are great at creating things. They are less great at managing the lifecycle of things that live outside the infrastructure graph - API registrations, firewall rules, image tags. Every manual step you leave outside your pipeline is a step that will eventually drift.
The listener was the most dramatic example. But we applied the same principle to image tagging (automated via the Git SHA) and to external API firewall rules (automated via a separate Pulumi package that accepts NAT gateway IPs as input).
Conclusion
Our Terraform files were a good first draft. They taught us what we were building. Pulumi let us build it with the structure and dependency management the project needed as it matured.
If you are standing up LangSmith Hybrid - or any complex Kubernetes deployment - and you find yourself writing shell scripts to manage your infrastructure tool, that is a signal. Not that your tool is wrong, but that your project has outgrown its scaffolding.
Graduate from the napkin sketch. Build the house.
