Available for 10–20 hrs/week retainers and fixed-scope sprints. Get in touch →

Services

What I offer

Four focused areas where senior ML engineering makes a direct, measurable difference. Every engagement delivers working software, not slide decks.

Agentic Automation

LLM workflows that actually ship

Most LLM demos fail in production because they were never designed for production. I design, build, and deploy agentic pipelines that handle edge cases gracefully, stay observable, and stay within your cost budget.

Typical timeline: 3–8 weeks for a production-ready pipeline; 1–2 weeks for a scoped prototype

Outcomes

  • Production-grade document extraction, routing, and classification pipelines
  • Multi-step agent orchestration with structured outputs and retry logic
  • Human-in-the-loop review interfaces with audit trails
  • Cost-aware model selection, caching, and prompt optimization
  • Evaluation frameworks to measure accuracy before and after changes

What you get

  • Working pipeline code with full documentation
  • Prompt library with documented decision rationale
  • Evaluation harness with golden test set
  • Deployment configuration (Docker / cloud functions)
  • Runbook for operations and monitoring

Recommenders & Ranking

Retrieval and ranking built for real traffic

Recommender systems are among the highest-leverage investments in consumer and B2B products. I build two-stage retrieval + ranking architectures that scale, and I integrate them with your experimentation stack so improvements are measurable.

Typical timeline: 6–12 weeks for full two-stage system; 2–4 weeks for targeted retrieval or ranking upgrade

Outcomes

  • Significant uplift in engagement, click-through, or revenue metrics
  • Sub-50ms retrieval at thousands of queries per second
  • Graceful handling of cold-start for new users and items
  • Measurable lift in A/B tests against existing baselines
  • Reduced offline-to-online model performance gap

What you get

  • Feature store design and implementation (or integration with existing)
  • Candidate generation service with vector search integration
  • Ranking model training pipeline
  • Online serving API with logging for feedback loops
  • A/B testing integration and metric dashboard

MLOps & Productionization

From notebook to reliable production system

Research models that never made it to production are not assets. I build the training infrastructure, serving layer, and observability tooling that converts ML experiments into reliable, maintainable systems.

Typical timeline: 4–10 weeks depending on complexity; audits of existing infrastructure in 1–2 weeks

Outcomes

  • Reproducible, parameterized training pipelines with lineage tracking
  • Low-latency model serving with autoscaling (AWS, GCP, or Kubernetes)
  • Model registry with staged rollout and rollback capability
  • Drift detection, alerting, and automated retraining triggers
  • Significant reduction in time-to-deploy for new model versions

What you get

  • Reproducible training pipeline
  • Experiment tracking and artifact management
  • Serving infrastructure with CI/CD and deployment automation
  • Monitoring dashboards for model performance and data drift

Measurement & Experimentation

Know what's actually working

Bad measurement is expensive. Instrumented A/B tests and causal analyses replace intuition with evidence, so product teams can ship changes confidently and ML teams can claim credit for real improvements.

Typical timeline: 2–6 weeks for platform build or upgrade; ongoing advisory as needed

Outcomes

  • Correctly powered experiments that answer the right question
  • Reduced time-to-decision on product and model changes
  • Reliable guardrail metrics that prevent regressions
  • Causal estimates of impact in non-randomized settings
  • Shared statistical language between data, product, and engineering

What you get

  • Experiment platform design (or audit and improvement of existing)
  • Statistical testing framework: frequentist, Bayesian, or sequential
  • Power analysis and sample size calculator
  • Metric taxonomy with primary, secondary, and guardrail metrics
  • Documentation and team enablement guide

Good fit / Not a fit

Clarity upfront saves everyone time. Here's an honest read on where I add the most value.

Good fit

  • You have a production ML system that isn't performing or isn't shipping fast enough.
  • You need senior ML capacity for a defined period without a full-time hire.
  • You're building a new AI-powered feature and want to get the architecture right from the start.
  • Your team has strong software engineers but limited ML depth.
  • You want rigorous measurement to validate — or invalidate — an ML investment.
  • You're dealing with compute costs that have grown faster than the business.

Not a fit

  • You want a data science generalist who will own analytics, dashboards, and ML.
  • The engagement requires more than 20 hrs/week of dedicated capacity.
  • You're in a regulated industry and need compliance-specific guidance (HIPAA, FedRAMP, etc.).
  • You need a full-time team lead or people manager.
  • The project is primarily about business intelligence or BI tooling.

Sounds like a fit?

Send a message with your goals and constraints. Short is fine — I'll ask follow-up questions.