Dear applicants, please keep in mind that applications without provided salary expectations and active LN profile will not be considered. Hope for your understanding.
We are hiring a Senior Applied AI Engineer to own the reliability, evaluation, and production stability of advanced multi-agent AI systems operating at real production scale. This role is focused on transforming LLM-powered workflows from “demo-ready” prototypes into resilient, observable, production-grade systems capable of handling non-deterministic model behavior, complex routing logic, and human-in-the-loop escalation flows. You will work closely with technical leadership and product stakeholders to design, evaluate, optimize, and maintain agentic AI systems across multiple communication channels and workflows. This is a highly hands-on engineering role for someone who thrives in production environments and understands the realities of deploying AI systems under live traffic conditions.
Details
Location: LATAM
Work Model: Fully Remote
Employment Type: Full-time
Seniority Level: Senior
Industry: AI / Agentic Systems / SaaS
Start Date: ASAP
English Level: Fluent English Required
Time Zone: LATAM-friendly collaboration preferred
About the Role
This position is dedicated to AI agent reliability, evaluation pipelines, observability, and continuous optimization of production LLM systems. The ideal candidate combines strong backend engineering expertise with deep practical experience operating AI products in real-world environments. You will take ownership of evaluation frameworks, scoring systems, tracing infrastructure, production debugging, and the iterative optimization loop between prompts, architecture decisions, and system behavior. The role requires both technical depth and product intuition, especially around how evaluation systems directly impact product quality and user experience.
Key Responsibilities
- Design, build, and maintain evaluation pipelines for production AI agent systems
- Instrument multi-agent workflows with tracing and observability tooling
- Build evaluation datasets using real production traffic and interaction logs
- Develop quality scoring and robustness scoring systems for LLM outputs
- Improve reliability of AI systems handling non-deterministic model behavior
- Implement and optimize HITL (Human-in-the-Loop) escalation workflows
- Analyze production failures and drive architectural improvements
- Own the full feedback loop between evaluations, prompt optimization, architecture updates, and re-testing
- Contribute to prompt engineering and model optimization strategies
- Collaborate on multi-agent orchestration and workflow reliability decisions
- Work across backend systems, deployment pipelines, monitoring, and operational sustainment
- Participate in production support and on-call responsibilities
- Maintain high engineering standards around scalability, observability, and maintainability
- Operate independently across development, testing, deployment, and production ownership
Requirements
- 5+ years of backend or AI engineering experience in production environments
- Strong hands-on experience with production LLM or agentic AI systems
- Proven experience debugging and maintaining non-deterministic AI workflows under live traffic
- Experience building or operating evaluation/evals pipelines for AI systems
- Strong understanding of scorer design, feedback loops, and AI system evaluation methodologies
- Excellent Python backend engineering skills
- Production experience with:
- FastAPI
- Django
- Celery
- LangGraph or similar orchestration frameworks
- Experience with observability and tracing tools such as:
- Langfuse
- Grafana
- Loki
- OpenTelemetry or equivalent
- Experience deploying and operating distributed backend systems
- Strong understanding of AI reliability, prompt behavior, and model failure handling
- Ability to independently own projects end-to-end
- Experience working in asynchronous remote teams
- Strong written communication skills in English
Nice to Have
Experience with:
- DSPy
- DPO
- RLHF-related optimization workflows
- Experience with multi-agent orchestration systems
- Production experience with:
- GPT-4.x
- Claude
- Whisper
- Multi-model AI stacks
- Experience building AI tooling for communication or workflow automation
- Background in high-growth startups or product-focused engineering teams
- Experience with distributed systems and event-driven architectures
- Familiarity with AI observability and experiment tracking frameworks
- Exposure to vector databases, retrieval systems, or memory architectures
- Experience scaling AI products with real customer usage
Tech Stack:
- Python
- FastAPI
- Django
- Celery
- LangGraph
- Langfuse
- Grafana
- Loki
- LLM APIs (OpenAI / Anthropic / multi-model stacks)
What Success Looks Like
- AI agents reliably handle real production traffic with measurable quality improvements
- Evaluation pipelines provide actionable scoring and monitoring insights
- Observability systems surface failures before they impact users
- Human escalation triggers operate accurately and consistently
- Prompt and architecture iterations measurably improve production outcomes
- AI systems become resilient, scalable, and maintainable over time
Interview Process
- HR / Introductory Call
- Technical Deep Dive
- Take-Home Technical Assessment
- Final Team & Culture Interview
- Offer Stage