Question 1

What does requests per second (RPS) mean for LLM APIs?

Accepted Answer

RPS measures how many inference requests your system sends to the model per second. For a chatbot serving 1,000 users simultaneously, each sending a message every 20 seconds, that's 50 RPS. Throughput directly affects infrastructure cost, queue depth, and perceived latency.

Question 2

What is the difference between p50, p95, and p99 latency?

Accepted Answer

Percentile latencies tell you how fast your slowest requests are. p50 (median) means 50% of requests complete in that time or less. p95 is more important for user experience — 95% of requests complete in that time or less. p99 captures near-worst-case scenarios. Always design to p95 for interactive AI applications.

Question 3

When should I self-host instead of using the OpenAI or Anthropic API?

Accepted Answer

Self-hosting makes financial sense when your token volume exceeds roughly $5,000-$10,000/month on managed APIs. It also makes sense when you need data privacy, custom fine-tuning, predictable latency, or freedom from rate limits.

Question 4

How accurate are these infrastructure estimates?

Accepted Answer

These are directional estimates based on published GPU throughput benchmarks and public cloud pricing (May 2026). Real-world performance varies based on batch size, prompt length, system prompt reuse, KV-cache hit rate, and network conditions. Use these for planning, not SLA commitments.

AI Latency & Throughput Calculator

Throughput Requirements

Configuration

Frequently Asked Questions