What this pattern does:

Serve a large language model (LLM) with GPUs in Google Kubernetes Engine (GKE) mode. Create a GKE Standard cluster that uses multiple L4 GPUs and prepares the GKE infrastructure to serve any of the following models: 1. Falcon 40b. 2. Llama 2 70b

Caveats and Consideration:

Depending on the data format of the model, the number of GPUs varies. In this design, each model uses two L4 GPUs.


