Routers
Routers are a core concept in the Glide Gateway. Each router operates on its own pool of models and decides which model in the pool is going to handle the current request based on additional context like model latency over time, health, which models handed the previous requests, etc.
There are four routing strategies available:
Routing Strategies
Priority
The priority routing (also known as failover routing) picks the first healthy model in the pool in order models are defined in the Glide config.
This is the default routing strategy.
Here is an example of a priority router:
routers:
language:
- id: default
strategy: priority
models:
- id: openai
openai:
api_key: "${env:OPENAI_API_KEY}"
- id: cohere
cohere:
api_key: "${env:COHERE_API_KEY}"
The router would always try to serve requests with the openai
model first.
If it's not available, then the router will pick the cohere
model.
Once the openai
model is healthy again, the router will switch back to it.
Least Latency
The least latency routing strategy selects the model with the lowest average latency over time. If the least latency model becomes unhealthy, it will pick the second the best, etc.
Since we don't know the true distribution of model latencies (and it's fairly dynamic thing), we attempt to estimate it and keep it updated over time.
The estimation process has two stages:
- Warmup. On this stage, Glide picks each model in a round-robin manner and collects latency samples until all models are warmed up
- Serving. Once all models are warmed up, Glide starts serving requests with the least latency model. To be fair to models that happen to have higher latency, Glide updates their latencies periodically.
This is how it looks in the configuration:
routers:
language:
- id: default
strategy: least_latency
models:
- id: openai
latency:
decay: 0.06
warmup_samples: 3
update_interval: 30s
openai:
api_key: "${env:OPENAI_API_KEY}"
- id: cohere
latency:
decay: 0.06
warmup_samples: 3
update_interval: 30s
cohere:
api_key: "${env:COHERE_API_KEY}"
The latency configuration (located on the model item level):
routers.language[].models[].latency.decay
(default: 0.06) - the decay rate of the exponential moving average of the model latencyrouters.language[].models[].latency.warmup_samples
(default: 3) - the number of samples to collect before starting to serve the least latency model. The higher number of samples is set the more precise the latency estimation is, but it takes more time to finish the warmuprouters.language[].models[].latency.update_interval
(default: 30s) - the interval to update the model latency
Round-Robin
The round-robin routing strategy sends API traffic to each model in the pool in a circular order.
Let's take the following configuration as an example:
routers:
language:
- id: default
strategy: round_robin
models:
- id: openai
openai:
api_key: "${env:OPENAI_API_KEY}"
- id: cohere
cohere:
api_key: "${env:COHERE_API_KEY}"
Given that both models are healthy, requests are going to be processed like this:
- 1st request:
openai
- 2nd request:
cohere
- 3rd request:
openai
- 4th request:
cohere
- etc.
Weighted Round-Robin
Weighted round-robin routing strategy sends requests proportionally to the weights assigned models in a pool.
For example, let's have a look at this configuration:
routers:
language:
- id: default
strategy: weighted_round_robin
models:
- id: openai
weight: 0.8 # 80% of traffic
openai:
api_key: "${env:OPENAI_API_KEY}"
- id: cohere
weight: 0.1 # 10% of traffic
cohere:
api_key: "${env:COHERE_API_KEY}"
- id: cohere
weight: 0.1 # 10% of traffic
anthropic:
api_key: "${env:ANTHROPIC_API_KEY}"
In this case, traffic will be split as follows if all models are healthy:
- 80% of traffic will be sent to the
openai
model - 10% of traffic will be sent to the
cohere
model - 10% of traffic will be sent to the
anthropic
model
Now, if one of the model is fallen out of the pool, the remaining models are going to be reweighed according to healthy model weights.
In this example, if the openai
model gets down, the traffic will be split equally between cohere
and anthropic
models because they have equal weights
as they were configured like this:
openai
: 0% (unhealthy)cohere
: 50% of trafficanthropic
: 50% of traffic
The weight configuration (located on the model item level):
routers.language[].models[].weight
(default: 1) - the weight of the model in the pool. Doesn't have to be in 0-1 interval or add up to 1.