Routers

Routers are a core concept in the Glide Gateway. Each router operates on its own pool of models and decides which model in the pool is going to handle the current request based on additional context like model latency over time, health, which models handed the previous requests, etc.

There are four routing strategies available:

Routing Strategies

Priority

The priority routing (also known as failover routing) picks the first healthy model in the pool in order models are defined in the Glide config.

This is the default routing strategy.

Here is an example of a priority router:

routers:
  language:
    - id: default
      strategy: priority
      models:
        - id: openai
          openai:
            api_key: "${env:OPENAI_API_KEY}"
        - id: cohere
          cohere:
            api_key: "${env:COHERE_API_KEY}"

The router would always try to serve requests with the openai model first. If it's not available, then the router will pick the cohere model. Once the openai model is healthy again, the router will switch back to it.

Least Latency

The least latency routing strategy selects the model with the lowest average latency over time. If the least latency model becomes unhealthy, it will pick the second the best, etc.

Since we don't know the true distribution of model latencies (and it's fairly dynamic thing), we attempt to estimate it and keep it updated over time.

The estimation process has two stages:

Warmup. On this stage, Glide picks each model in a round-robin manner and collects latency samples until all models are warmed up
Serving. Once all models are warmed up, Glide starts serving requests with the least latency model. To be fair to models that happen to have higher latency, Glide updates their latencies periodically.

This is how it looks in the configuration:

routers:
  language:
    - id: default
      strategy: least_latency
      models:
        - id: openai
          latency:
            decay: 0.06
            warmup_samples: 3
            update_interval: 30s
          openai:
            api_key: "${env:OPENAI_API_KEY}"
        - id: cohere
          latency:
            decay: 0.06
            warmup_samples: 3
            update_interval: 30s
          cohere:
            api_key: "${env:COHERE_API_KEY}"

The latency configuration (located on the model item level):

routers.language[].models[].latency.decay (default: 0.06) - the decay rate of the exponential moving average of the model latency
routers.language[].models[].latency.warmup_samples (default: 3) - the number of samples to collect before starting to serve the least latency model. The higher number of samples is set the more precise the latency estimation is, but it takes more time to finish the warmup
routers.language[].models[].latency.update_interval (default: 30s) - the interval to update the model latency

Round-Robin

The round-robin routing strategy sends API traffic to each model in the pool in a circular order.

Let's take the following configuration as an example:

routers:
  language:
    - id: default
      strategy: round_robin
      models:
        - id: openai
          openai:
            api_key: "${env:OPENAI_API_KEY}"
        - id: cohere
          cohere:
            api_key: "${env:COHERE_API_KEY}"

Given that both models are healthy, requests are going to be processed like this:

1st request: openai
2nd request: cohere
3rd request: openai
4th request: cohere
etc.

Weighted Round-Robin

Weighted round-robin routing strategy sends requests proportionally to the weights assigned models in a pool.

For example, let's have a look at this configuration:

routers:
  language:
    - id: default
      strategy: weighted_round_robin
      models:
        - id: openai
          weight: 0.8 # 80% of traffic
          openai:
            api_key: "${env:OPENAI_API_KEY}"
        - id: cohere
          weight: 0.1 # 10% of traffic
          cohere:
            api_key: "${env:COHERE_API_KEY}"
        - id: cohere
          weight: 0.1 # 10% of traffic
          anthropic:
            api_key: "${env:ANTHROPIC_API_KEY}"

In this case, traffic will be split as follows if all models are healthy:

80% of traffic will be sent to the openai model
10% of traffic will be sent to the cohere model
10% of traffic will be sent to the anthropic model

Now, if one of the model is fallen out of the pool, the remaining models are going to be reweighed according to healthy model weights. In this example, if the openai model gets down, the traffic will be split equally between cohere and anthropic models because they have equal weights as they were configured like this:

openai: 0% (unhealthy)
cohere: 50% of traffic
anthropic: 50% of traffic

The weight configuration (located on the model item level):

routers.language[].models[].weight (default: 1) - the weight of the model in the pool. Doesn't have to be in 0-1 interval or add up to 1.

Resiliency Observability

Routers

Routing Strategies

Priority

Least Latency

Round-Robin

Weighted Round-Robin

Project

Resources

Feedback