Project Bluebear is a conversational Gen AI solution that lets business users analyze datasets ranging from megabytes to petabytes using Amazon Bedrock Agents and Apache Spark. Submit natural language queries and get PySpark code that is automatically generated, validated, and executed.
Two execution backends work together:
- Spark on AWS Lambda (SoAL) -- lightweight, real-time processing for datasets up to 500 MB with sub-second validation latency.
- Amazon EMR Serverless -- scalable execution for larger datasets (MBs to PBs) with production-grade reliability.
Natural language is the interface. No ETL frameworks, no deployment pipelines -- just ask a question and get results.
- Authenticate -- User authenticates via Amazon Cognito and receives a JWT token with the
spark-api/spark.executescope. - Natural Language Prompt -- User asks a question via the React UI: "Show me total sales by region over the last 12 months."
- Gateway Authorization -- The AgentCore Gateway validates the JWT and evaluates Cedar authorization policies to confirm the caller is permitted to invoke the agent and access the requested data sources.
- Bedrock Code Generation -- Amazon Bedrock (Claude) generates a PySpark script based on the prompt, dataset schema, and historical context.
- Cedar Policy Check -- Before execution, Cedar policies verify the generated code does not contain destructive operations (DROP, DELETE, TRUNCATE) and that the target S3 buckets are project-tagged.
- Fast Validation Loop -- Generated code runs on SoAL to validate syntax and logic (~550 ms). Errors are fed back to the model for repair, iterating until success.
- Production Execution -- Once validated, the same PySpark script executes on EMR Serverless against the full dataset. The agent authenticates to downstream services (S3, Glue, Lambda, EMR) using scoped IAM execution roles -- no long-lived credentials.
- Results & Visualization -- Results are returned to the React UI as tables and charts.
- Audit & Monitoring -- Every step is captured by CloudTrail (agent invocations, model calls, data access). EventBridge rules fire alerts on failures, throttling, or unauthorized access attempts.
┌─────────────────────────────────────────────┐
│ Monitoring & Audit │
│ CloudTrail ─── EventBridge ─── SNS/Ops │
└──────────────────┬──────────────────────────┘
│ logs every call
┌──────────┐ JWT ┌──────────────────▼──────────────────┐
│ Cognito │──────────▶│ AgentCore Gateway │
│ User Pool │ (scope: │ ┌────────────────────────────┐ │
└──────────┘ spark. │ │ Cedar Authorization │ │
execute) │ │ Policies │ │
│ │ • Session-scoped access │ │
│ │ • Deny destructive SQL │ │
│ │ • S3 bucket tag checks │ │
│ │ • Validated-code-only EMR │ │
│ └────────────────────────────┘ │
└──────────────────┬──────────────────┘
│
┌──────────────────▼──────────────────┐
│ Spark Supervisor Agent │
│ (IAM execution role) │
└──┬────────┬────────┬────────┬───────┘
│ │ │ │
┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌─▼────┐
│ S3 │ │Glue │ │ SoAL│ │ EMR │
│(IAM)│ │(IAM)│ │(IAM)│ │(IAM) │
└─────┘ └─────┘ └─────┘ └──────┘
Scoped to bucket Scoped Scoped to
+ prefix to DB application
| Component | Purpose | Details |
|---|---|---|
| Amazon Cognito | Inbound authentication | User pool with JWT tokens and spark-api/spark.execute OAuth scope. Rejects unauthenticated requests at the gateway. |
| Cedar Policies | Agent authorization | Declarative rules evaluated before tool execution: session scoping, destructive-operation denial, S3 tag checks, validated-code gating. |
| AgentCore Runtime | Agent + tool hosting | Runs the Spark orchestrator agent and its tools inside AgentCore with scoped IAM execution roles. |
| Spark Orchestrator Agent | End-to-end workflow | Orchestrates: read data, generate PySpark, execute via Spark-code-interpreter, format results. Authenticates to downstream services via IAM roles. |
| Data Read Tool | Dataset access | Reads from S3 / Glue catalog (extensible to Snowflake, Databricks via MCP). IAM-scoped to tagged resources. |
| Code Generation Tool | PySpark generation | Generates or refines PySpark based on user request and schema metadata. |
| Spark-code-interpreter Tool | Code validation | Interprets generated code, iteratively fixes errors. Cedar policies block destructive operations. |
| Result Generation Tool | User-friendly output | Aggregates Spark results into tables, charts, and natural-language summaries. |
| User Interface (React + FastAPI) | Frontend & API | React frontend and FastAPI backend that collect queries, validate JWT, invoke AgentCore, and render results. |
| CloudTrail | Audit trail | Captures all API calls across AgentCore, Bedrock, Lambda, EMR, S3, and Secrets Manager. Insights detect anomalies. |
| EventBridge | Real-time alerts | Rules trigger on agent failures, EMR job failures, Lambda throttling, and unauthorized access. Routes to SNS for ops notifications. |
| AWS Services | Infrastructure | S3, EMR Serverless, Lambda, CloudWatch, VPC for storage, compute, and network isolation. |
- Natural language to PySpark code generation using Amazon Bedrock Claude
- Iterative validation loop (SoAL) -- error detection, model repair, re-validation
- Dual execution backends: SoAL (fast, <500 MB) + EMR Serverless (scalable, MBs to PBs)
- React + FastAPI UI with Cloudscape Design components:
- Glue Data Catalog browsing and table selection
- PostgreSQL connection management and table selection
- CSV upload to S3
- Code editor with syntax highlighting (Monaco)
- Tabular result visualization
- Security & governance:
- Inbound authentication via Amazon Cognito JWT with custom OAuth scopes (
spark-api/spark.execute) - Cedar authorization policies on AgentCore -- session scoping, destructive-operation denial, S3 tag checks, validated-code gating for EMR
- Least-privilege IAM execution roles per agent and downstream service (S3, Glue, Lambda, EMR)
- Outbound auth to all downstream services via short-lived IAM role credentials -- no stored secrets
- VPC-enabled execution with private S3 access
- Code review in UI before execution; Cedar blocks DROP/DELETE/TRUNCATE at the agent level
- Inbound authentication via Amazon Cognito JWT with custom OAuth scopes (
- Observability & alerting:
- CloudTrail audit trail across AgentCore, Bedrock, Lambda, EMR, S3, and Secrets Manager
- CloudTrail Insights for anomaly detection (API call rate and error rate spikes)
- EventBridge rules for real-time alerts: agent failures, EMR job failures, Lambda throttling, unauthorized access
- SNS notifications to ops teams on critical events
- Cost-effective: pay only for compute time; reuse PySpark scripts across SoAL, EMR, and Glue
Use SoAL when:
- Dataset size < 500 MB
- Need < 1 second latency (iterative code validation)
- Ad-hoc or development queries
Use EMR Serverless when:
- Dataset size > 500 MB up to PBs
- Complex multi-step Spark jobs (joins, aggregations)
- Production analytics with SLA requirements
This solution uses SoAL for validation and EMR Serverless for production execution, so code never needs to be rewritten for scale.
- AWS account with access to:
- Amazon Bedrock (Claude model access, AgentCore)
- AWS Lambda (functions, ECR container images)
- Amazon EMR Serverless (applications, job execution)
- Amazon S3 (data buckets)
- AWS IAM (roles, policies)
- AWS CloudFormation (stack management)
- Amazon VPC (optional, for private connectivity)
- Amazon Cognito (JWT authentication)
- AWS CLI v2 configured with appropriate credentials
- Docker with buildx support
- Python 3.11+
- Node.js 18+ and npm
- bedrock-agentcore-starter-toolkit (
pip install bedrock-agentcore-starter-toolkit)
Deploy in a region supporting Bedrock + Lambda + EMR Serverless (e.g., us-east-1, us-west-2).
git clone https://github.com/aws-samples/spark-code-interpreter.git
cd spark-code-interpreterEnsure AWS credentials are configured:
aws configure
# or
export AWS_PROFILE=your-profile-name./scripts/deploy-all.shThis deploys everything: Bedrock agents, Spark Lambda Docker image, and the CloudFormation stack. See DEPLOYMENT_GUIDE.md for step-by-step details.
./start-ui.shThis starts the FastAPI backend on port 8000 and the React frontend on port 3000.
Or start them manually:
# Terminal 1: Backend
cd backend && ./run.sh
# Terminal 2: Frontend
cd frontend && npm install && npm run dev# Via test script
./scripts/test-calculation.sh "what is 7*10"
# Or via the UI at http://localhost:3000.
├── frontend/ # React + Cloudscape UI
├── backend/ # FastAPI backend
├── agent-code/ # Bedrock AgentCore agents
│ ├── spark-supervisor-agent/
│ └── code-generation-agent/
├── agent-wrapper/ # Wrapper Lambda code
├── cloudformation/ # Infrastructure templates
├── Docker/ # Spark Lambda Docker image
├── scripts/ # Deployment & test scripts
├── config/ # Configuration files
├── images/ # Architecture diagrams
├── start-ui.sh # Launch both frontend + backend
├── README.md # This file
└── DEPLOYMENT_GUIDE.md # Detailed deployment instructions
Configuration is layered: config/deployment-config.json (agent ARNs from deploy) → backend/settings.json (runtime overrides) → environment variables (container overrides).
| Component | Default Value |
|---|---|
| Bedrock Model | us.anthropic.claude-sonnet-4-5-20250929-v1:0 |
| Wrapper Lambda Timeout | 900s (15 min) |
| Spark Lambda Timeout | 300s (5 min) |
| S3 Structure | s3://spark-data-{account}-{region}/{session-id}/ |
| Frontend Port | 3000 |
| Backend Port | 8000 |
Environment variable overrides: SPARK_SUPERVISOR_ARN, CODE_GEN_AGENT_ARN, BEDROCK_MODEL, SPARK_S3_BUCKET.
"Analyze sales trends by region for Q4 2024.
Show total revenue, top 3 products, and year-over-year growth."
-
Bedrock Claude generates PySpark code:
df = spark.read.parquet("s3a://my-bucket/sales_data/*.parquet") df_q4 = df.filter((df.date >= "2024-10-01") & (df.date < "2025-01-01")) revenue_by_region = df_q4.groupBy("region").agg(sum("sales")) top_products = df_q4.groupBy("product").agg(sum("sales")).sort(desc("sum(sales)")).limit(3) revenue_by_region.show() top_products.show()
-
SoAL Validation (on sample data): syntax valid, schema matches, executes in 520 ms.
-
EMR Serverless Production Run: auto-scales, executes on full dataset, returns results.
-
React UI: displays code, tables, and result summary.
Every component runs with the minimum permissions required. No shared admin roles.
Spark Lambda execution role -- scoped to the specific S3 bucket and prefix:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3DataAccess",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::spark-data-${AccountId}-${Region}",
"arn:aws:s3:::spark-data-${AccountId}-${Region}/*"
]
}
]
}AgentCore runtime role -- only invoke specific agents and access specific models:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "InvokeAgents",
"Effect": "Allow",
"Action": ["bedrock-agentcore:InvokeAgentRuntime"],
"Resource": [
"arn:aws:bedrock-agentcore:${Region}:${AccountId}:runtime/spark_supervisor_agent-*",
"arn:aws:bedrock-agentcore:${Region}:${AccountId}:runtime/code_generation_agent-*"
]
},
{
"Sid": "InvokeModels",
"Effect": "Allow",
"Action": ["bedrock:InvokeModel", "bedrock:InvokeModelWithResponseStream"],
"Resource": "arn:aws:bedrock:${Region}::foundation-model/anthropic.claude-*"
}
]
}EMR Serverless execution role -- scoped to the specific application and S3 paths:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "EMRJobExecution",
"Effect": "Allow",
"Action": [
"emr-serverless:StartJobRun",
"emr-serverless:GetJobRun",
"emr-serverless:CancelJobRun"
],
"Resource": "arn:aws:emr-serverless:${Region}:${AccountId}:/applications/${ApplicationId}/*"
},
{
"Sid": "GlueCatalogRead",
"Effect": "Allow",
"Action": ["glue:GetDatabase*", "glue:GetTable*", "glue:GetPartition*"],
"Resource": "*",
"Condition": {
"StringEquals": {"glue:ResourceTag/Project": "bluebear"}
}
}
]
}All inbound requests to the AgentCore gateway are authenticated via Amazon Cognito JWT tokens with custom OAuth scopes.
Cognito configuration (defined in cloudformation/spark-complete-stack.yml):
- User Pool with password policies and MFA support
- Resource Server defining the
spark-apiscope namespace - Custom scope:
spark-api/spark.execute-- only users granted this scope can invoke the agent - App Client with client credentials grant for service-to-service calls
Request flow:
Client → Cognito (authenticate) → JWT token with spark-api/spark.execute scope
→ AgentCore Gateway (validates JWT, checks scope)
→ Spark Supervisor Agent
The FastAPI backend validates the JWT before forwarding to AgentCore. Unauthenticated requests are rejected at the gateway level.
AgentCore agents authenticate to downstream AWS services using IAM execution roles attached to the agent runtime. No long-lived credentials are stored.
| Agent → Service | Auth Method |
|---|---|
| Supervisor Agent → Code Gen Agent | IAM role (bedrock-agentcore:InvokeAgentRuntime) |
| Supervisor Agent → Spark Lambda | IAM role (lambda:InvokeFunction) |
| Supervisor Agent → EMR Serverless | IAM role (emr-serverless:StartJobRun) |
| Supervisor Agent → S3 | IAM role (s3:GetObject, s3:PutObject) |
| Supervisor Agent → Glue Catalog | IAM role (glue:GetTable, glue:GetDatabase) |
| Supervisor Agent → Secrets Manager | IAM role (secretsmanager:GetSecretValue) for PostgreSQL credentials |
| Supervisor Agent → Bedrock Models | IAM role (bedrock:InvokeModel) |
Each agent runtime has its own IAM role. The Spark Supervisor Agent role cannot access resources outside its designated S3 bucket, Glue databases, or specific Lambda functions.
Use Cedar policies to define fine-grained authorization rules for what the agent can and cannot do. Cedar policies are evaluated by AgentCore before tool execution, providing a declarative authorization layer beyond IAM.
Policy: Allow Spark code execution only for authorized sessions
permit (
principal == AgentCore::Agent::"spark_supervisor_agent",
action == AgentCore::Action::"InvokeTool",
resource == AgentCore::Tool::"spark_code_interpreter"
) when {
context.session_id.isValid() &&
context.execution_platform in ["lambda", "emr"]
};
Policy: Deny access to production EMR for non-validated code
forbid (
principal == AgentCore::Agent::"spark_supervisor_agent",
action == AgentCore::Action::"InvokeTool",
resource == AgentCore::Tool::"emr_execution"
) when {
context.code_validated == false
};
Policy: Restrict S3 data access to project-tagged buckets
permit (
principal == AgentCore::Agent::"spark_supervisor_agent",
action == AgentCore::Action::"InvokeTool",
resource == AgentCore::Tool::"data_read"
) when {
resource.s3_bucket.hasTag("Project", "bluebear")
};
Policy: Deny destructive operations
forbid (
principal is AgentCore::Agent,
action == AgentCore::Action::"InvokeTool",
resource == AgentCore::Tool::"spark_code_interpreter"
) when {
context.spark_code.contains("DROP TABLE") ||
context.spark_code.contains("DELETE FROM") ||
context.spark_code.contains("TRUNCATE")
};
Cedar policies are stored alongside the agent configuration and evaluated at runtime. They provide deterministic, auditable authorization that complements IAM roles.
Enable AWS CloudTrail to capture all API calls made by the agents and infrastructure. This provides a complete audit trail of who invoked what, when, and with what parameters.
Enable CloudTrail for AgentCore, Lambda, and EMR:
aws cloudtrail create-trail \
--name spark-code-interpreter-trail \
--s3-bucket-name my-cloudtrail-bucket \
--is-multi-region-trail \
--enable-log-file-validation
aws cloudtrail start-logging --name spark-code-interpreter-trailKey events to monitor:
| Event Source | Event Name | What It Captures |
|---|---|---|
bedrock-agentcore.amazonaws.com |
InvokeAgentRuntime |
Every agent invocation (prompt, session, duration) |
bedrock.amazonaws.com |
InvokeModel |
Every model call (model ID, token count) |
lambda.amazonaws.com |
Invoke |
Every Spark Lambda execution |
emr-serverless.amazonaws.com |
StartJobRun |
Every EMR job submission |
s3.amazonaws.com |
PutObject, GetObject |
Data access patterns |
secretsmanager.amazonaws.com |
GetSecretValue |
Credential access for PostgreSQL |
CloudTrail Insights -- enable anomaly detection to flag unusual patterns:
aws cloudtrail put-insight-selectors \
--trail-name spark-code-interpreter-trail \
--insight-selectors '[{"InsightType": "ApiCallRateInsight"}, {"InsightType": "ApiErrorRateInsight"}]'This automatically detects spikes in agent invocations or error rates.
Configure Amazon EventBridge rules to trigger alerts when critical events occur. This enables real-time monitoring of agent health, failures, and security anomalies.
Rule 1: Agent invocation failures
{
"source": ["aws.bedrock-agentcore"],
"detail-type": ["AgentCore Agent Runtime Invocation"],
"detail": {
"errorCode": [{"exists": true}]
}
}aws events put-rule \
--name spark-agent-invocation-failures \
--event-pattern '{
"source": ["aws.bedrock-agentcore"],
"detail-type": ["AWS API Call via CloudTrail"],
"detail": {
"eventSource": ["bedrock-agentcore.amazonaws.com"],
"eventName": ["InvokeAgentRuntime"],
"errorCode": [{"exists": true}]
}
}'Rule 2: EMR Serverless job failures
aws events put-rule \
--name spark-emr-job-failures \
--event-pattern '{
"source": ["aws.emr-serverless"],
"detail-type": ["EMR Serverless Job Run State Change"],
"detail": {
"state": ["FAILED", "CANCELLED"]
}
}'Rule 3: Lambda throttling or errors
aws events put-rule \
--name spark-lambda-errors \
--event-pattern '{
"source": ["aws.lambda"],
"detail-type": ["AWS API Call via CloudTrail"],
"detail": {
"eventSource": ["lambda.amazonaws.com"],
"eventName": ["Invoke"],
"errorCode": ["TooManyRequestsException", "ServiceException"]
}
}'Rule 4: Unauthorized access attempts
aws events put-rule \
--name spark-unauthorized-access \
--event-pattern '{
"source": ["aws.bedrock-agentcore"],
"detail-type": ["AWS API Call via CloudTrail"],
"detail": {
"errorCode": ["AccessDeniedException", "UnauthorizedException"]
}
}'Route alerts to SNS for notifications:
# Create SNS topic
aws sns create-topic --name spark-agent-alerts
aws sns subscribe --topic-arn arn:aws:sns:us-east-1:${ACCOUNT_ID}:spark-agent-alerts \
--protocol email --notification-endpoint ops-team@example.com
# Attach to all EventBridge rules
for rule in spark-agent-invocation-failures spark-emr-job-failures spark-lambda-errors spark-unauthorized-access; do
aws events put-targets --rule $rule \
--targets "Id=sns-target,Arn=arn:aws:sns:us-east-1:${ACCOUNT_ID}:spark-agent-alerts"
doneSummary of alert coverage:
| Alert | Trigger | Severity |
|---|---|---|
| Agent invocation failure | Any AgentCore error | High |
| EMR job failure | Job state = FAILED or CANCELLED | High |
| Lambda throttling | TooManyRequestsException | Medium |
| Unauthorized access | AccessDeniedException on any agent endpoint | Critical |
| CloudTrail anomaly | Unusual API call rate spike (via Insights) | Medium |
Deploy SoAL and EMR Serverless in a VPC for private S3 access. The CloudFormation stack supports VPC parameters (VpcId, PrivateSubnetIds).
The React UI allows users to review generated PySpark before execution, approve or reject operations, and export code for audit.
All data in transit uses TLS. Enable S3 default encryption and EMRFS encryption for data at rest.
- Pay per 100 ms request + data transfer
- ~$0.02-$0.10 per validation iteration
- Best for: small datasets (<500 MB), iterative testing
- Pay per DPU-hour
- $0.35/DPU-hour; typical 1 TB query = $5-$20
- Scales to zero when idle
- Filter data early in generated code to reduce processing
- Use Parquet over CSV for better compression and columnar performance
- Partition S3 data by date (
s3://bucket/year=2024/month=01/) - EMR Serverless auto-scales down to 0 when idle
- Set CloudWatch budget alerts on Lambda + EMR costs
aws logs tail /aws/lambda/sparkOnLambda-spark-code-interpreter --followIncrease timeout in Settings UI or backend/settings.json.
aws emr-serverless get-job-run \
--application-id <app-id> \
--job-run-id <job-id> \
--region us-east-1Request quota increase in AWS Console → Service Quotas. The backend includes exponential backoff for throttled requests.
Check Lambda logs for JAR classpath errors. The Docker image includes Hadoop-AWS JARs for S3A support.
The MCP gateway may time out at ~30s while the Lambda continues. Check S3 for results.
./scripts/cleanup.shOr manually:
aws cloudformation delete-stack --stack-name spark-code-interpreter --region us-east-1- YouTube: Spark Code Interpreter - Big Data for Business Users
- AWS Blog: Spark on AWS Lambda (SoAL)
- AWS Docs: Amazon EMR Serverless
- GitHub: spark-code-interpreter
Version: 3.0.0 | Model: Claude Sonnet 4.5 | UI: React + FastAPI

