Quick Start
This article will demonstrate how to deploy a large language model using vLLM on HyperAI through a practical example. We will deploy the DeepSeek-R1-Distill-Qwen-1.5B model, which is a lightweight model based on Qwen.
Model Introduction
DeepSeek-R1-Distill-Qwen-1.5B is a lightweight Chinese-English bilingual dialogue model:
- 1.5B parameters, can be deployed on a single card
- Minimum VRAM requirement: 3GB
- Recommended VRAM configuration: 4GB and above
Development and Testing in Model Training
Create a New Model Training
- Select RTX 5090 compute power
- Select vLLM 0.16.0-2204-gpu image
- In data binding, select the DeepSeek-R1-Distill-Qwen-1.5B model and bind it to
/openbayes/input/input0
Prepare the Startup Script start.sh
After the container starts, create the following start.sh script. This script automatically detects the number of GPUs and dynamically configures the service port based on the runtime environment (model training or model deployment).
#!/bin/bash
# Get GPU count
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
# Set port, model deployment default exposed port is 80 while model training default exposed port is 8080
PORT=8080
if [ ! -z "$OPENBAYES_SERVING_PRODUCTION" ]; then
PORT=80
fi
# Start vLLM service
echo "Starting vLLM service..."
vllm serve /openbayes/input/input0 \
--served-model-name DeepSeek-R1-Distill-Qwen-1.5B \
--disable-log-requests \
--trust-remote-code \
--host 0.0.0.0 --port $PORT \
--gpu-memory-utilization 0.98 \
--max-model-len 8192 --enable-prefix-caching \
--tensor-parallel-size $GPU_COUNTTest the service in the container
Run the following command to start the vLLM service:
bash start.shAfter the service starts, you can use the following curl command to test the model inference functionality:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "DeepSeek-R1-Distill-Qwen-1.5B",
"messages": [
{
"role": "user",
"content": "请用中文解释什么是大语言模型"
}
],
"temperature": 0.7,
"max_tokens": 100
}'Open a new terminal in JupyterLab and execute the above curl command for testing:
Note
Port 8080 is used when testing in the model training environment, while the model deployment environment automatically switches to port 80. This is a standard requirement for HyperAI model deployment services—all deployment services must provide external access through port 80.
Deploy Model Service
After completing development and testing, you can convert the model into a production-ready deployment service through the following two methods:
Method 1: One-Click Deployment (Recommended)
HyperAI provides a "One-Click Deployment" feature that directly converts model training into a model deployment service without the need for repeated configuration.
Using the One-Click Deployment Feature
- On the model training details page, click the dropdown menu next to the “Start” button in the upper-right corner and select “Create New Deployment.”
- Confirm the deployment configuration information (the system will automatically inherit the training container's configuration)
- Click "Confirm Deployment", and the system will automatically create the corresponding model deployment service
Deployment Configuration Confirmation
The system automatically inherits the following configurations:
- Computing resources
- Base image
- Workspace data
- Data binding relationships
You can adjust configurations as needed on the confirmation page.
Deployment Success
After submission, the system automatically creates the model deployment and starts the service. Upon success, it redirects to the deployment details page where you can immediately use the online testing tool to verify the interface.
Method 2: Manually Create Model Deployment
If you need to configure the deployment environment more flexibly, or create a new model deployment from scratch, follow these steps:
Configure Computing Power, Image, and Data Binding
- Select RTX 5090 computing power
- Select vLLM 0.16.0-2204-gpu image
- In data binding, select the DeepSeek-R1-Distill-Qwen-1.5B model and bind it to
/hyperai/input/input0 - Bind the training container's workspace to
/hyperai/home
Launch Deployment
Click the "Deploy" button and wait for the model deployment status to change to "Running".
Click on the running model deployment version to view the deployment details and runtime logs.
Online Testing
The model deployment details page provides an online testing tool that supports visually writing and sending HTTP requests in the web interface to quickly verify model interface functionality, without using local command line or third-party tools.
Key features:
- Select request method (GET, POST, etc.)
- Fill in interface path and parameters
- Customize request headers and request body (supports JSON and other formats)
- View response content and response headers in real-time
- Support streaming output to experience large model streaming inference
GET Request Example
Used to retrieve model information or perform health checks. Select the GET method, fill in the interface path (e.g., /v1/models), and click "Send" to view the model list or status.
POST Request Example
Used to have conversations with large language models. Select the POST method, fill in the path /v1/chat/completions, enter the conversation content in the request body (as shown below), and click "Send" to experience model inference.
{
"model": "qwen3-32b",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What's the weather like in Beijing?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g., San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use"
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto"
}Streaming Call Example
Used to experience large model streaming inference. Add the "stream": true field to the POST request body, and after sending the request, you can view the model's progressive output in real-time, suitable for scenarios that require consuming results while generating.
{
"model": "qwen3-32b",
"stream": true,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "How is the weather in Beijing?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g., San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use"
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto"
}Command Line Testing
If you prefer using command line tools (such as curl) for interface testing, refer to the following method:
Obtain the service URL generated by HyperAI from the model deployment page, then use the following command to test model availability:
curl -X POST http://<model deployment url>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "DeepSeek-R1-Distill-Qwen-1.5B",
"messages": [
{
"role": "user",
"content": "你好,请介绍一下自己"
}
],
"temperature": 0.7,
"max_tokens": 100
}'Next Steps
- Learn more about managing model deployments
- Check out the vLLM official documentation