What you’ll learn
- How to configure a Serverless endpoint with a cached model.
- How to programmatically locate a cached model in your handler function.
- How to create a custom handler function for text generation.
- How to integrate the Phi-3 model with the Hugging Face Transformers library.
Requirements
Before starting this tutorial, make sure:- You have a Runpod account with sufficient credits.
- You have a Runpod API key.
- You have a GitHub account.
- Your Runpod account is connected to GitHub.
Step 1: Create your handler function
Create a file namedhandler.py that processes inference requests using the cached model. This handler enforces offline mode to ensure it only uses cached models and includes a helper function to resolve the correct snapshot path.
handler.py
Understanding the handler
If you want to learn more about each component of this handler function, expand the section below:Handler function details
Handler function details
The handler is divided into four main sections: configuration, path resolution, model loading, and request handling. Let’s examine each part:The handler starts by defining two key paths: Cached models use a specific directory structure. A model like The If Model loading happens at the module level, outside any function. This means it runs once when the worker starts, not on every request. The model stays in GPU memory and gets reused across all incoming jobs, which is essential for performance.The The The pipeline outputs a list of dictionaries (one per input sequence). Since we’re processing a single prompt, we take The final line registers the handler function with the Runpod SDK and starts the worker’s event loop, which polls for jobs and dispatches them to your handler.
Configuration and offline mode
MODEL_ID specifies which Hugging Face model to load (configurable via environment variable, or using the “Model” endpoint setting), and HF_CACHE_ROOT points to where Runpod stores cached models. When you enable model caching on your endpoint, Runpod automatically downloads the model to this location before your worker starts.Setting HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE to "1" forces the Hugging Face libraries into offline mode. This is a safety measure that prevents the worker from accidentally downloading models at runtime, which would defeat the purpose of caching. If the cached model isn’t found, the worker fails immediately with a clear error rather than silently downloading gigabytes of data.Path resolution
microsoft/Phi-3-mini-4k-instruct gets stored at:resolve_snapshot_path() function navigates this structure to find the actual model files. It first tries to read the refs/main file, which contains the commit hash that the “main” branch points to. This is the most reliable method because it matches exactly what Hugging Face would load if you called from_pretrained() with network access.refs/main doesn’t exist (which can happen with older cache formats), the function falls back to listing the snapshots directory and using the first available snapshot. This ensures compatibility with different caching scenarios.Model loading
local_files_only=True parameter provides an additional layer of safety alongside offline mode. The device_map="auto" setting lets the Accelerate library automatically place model layers across available GPUs, and torch_dtype="auto" uses the model’s native precision (typically float16 or bfloat16) to minimize memory usage.Finally, wrapping the model and tokenizer in a pipeline provides a convenient high-level interface for text generation that handles tokenization, generation, and decoding in a single call.Request handling
handler function is what your worker users to process each incoming request. The job parameter is a dictionary containing the request data, with user inputs nested under the "input" key. The handler extracts parameters with sensible defaults: if a user doesn’t specify max_tokens, they get 256; if they don’t specify temperature, they get 0.7.outputs[0]["generated_text"] to get the generated string. The handler returns a dictionary that becomes the output field in the API response.The try/except block around generation catches any errors (out of memory, invalid inputs, etc.) and returns them in a structured format rather than crashing the worker.Step 2: Create the requirements file
Create arequirements.txt file to specify the Python dependencies for your worker.
requirements.txt
Step 3: Create a Dockerfile
Create aDockerfile to package your handler into a container image.
Dockerfile
Step 4: Set up your GitHub repository
Create a GitHub repository with your handler, requirements, and Dockerfile.-
Create a new repository on GitHub (for example,
phi3-cached-worker). - Add your files to the repository:
YOUR_USERNAME with your GitHub username.
Step 5: Deploy from GitHub
Deploy your worker directly from GitHub.- Navigate to the Serverless section and select New Endpoint.
-
Under Import Git Repository, select your
phi3-cached-workerrepository. -
Configure deployment options:
- Branch: Select
main(or your preferred branch). - Dockerfile Path: Leave as default if Dockerfile is in the root.
- Select Next.
- Branch: Select
-
Configure endpoint settings:
- Endpoint Name: Choose a descriptive name (for example, “phi3-cached-inference”).
- Endpoint Type: Make sure it’s set to Queue.
- GPU Configuration: Select one or more GPU types with at least 16GB VRAM.
- Workers: Leave the defaults in place (minimum: 0, maximum: 3).
- Container Disk: Allocate at least 20 GB (or more if you’re using a larger model).
-
Enable cached models:
- Scroll to the Model section.
- Enter the model name:
… or your preferred model that’s available on Hugging Face.
- (Optional) If using a gated model, add your Hugging Face token.
- Select Deploy Endpoint.
Step 7: Test your endpoint
Once deployed, send requests to your endpoint using the Runpod API. ReplaceYOUR_ENDPOINT_ID with your actual endpoint ID.
- Python
- cURL
Benefits of using cached models
By using Runpod’s cached model feature in this tutorial, you gain several advantages:- Faster cold starts: Workers start in seconds instead of minutes.
- Cost savings: No billing during model download time.
- Simplified deployment: Models are automatically available to all workers.
- Better scalability: Quick worker scaling without waiting for downloads.
Next steps
Now that you have a working Phi-3 endpoint with cached models, you can:- Experiment with different Phi model variants (Phi-3-medium, Phi-3.5, etc.).
- Add more sophisticated prompt templates and chat formatting.
- Implement streaming responses for real-time generation.
- Integrate with existing applications using the Runpod SDK.