HyperAIHyperAI

Quick Start

This article will demonstrate how to deploy a large language model using vLLM on HyperAI through a practical example. We will deploy the DeepSeek-R1-Distill-Qwen-1.5B model, which is a lightweight model based on Qwen.

Model Introduction

DeepSeek-R1-Distill-Qwen-1.5B is a lightweight Chinese-English bilingual dialogue model:

  • 1.5B parameters, can be deployed on a single card
  • Minimum VRAM requirement: 3GB
  • Recommended VRAM configuration: 4GB and above

Development and Testing in Model Training

Create a New Model Training

  • Select RTX 5090 compute power
  • Select vLLM 0.16.0-2204-gpu image
  • In data binding, select the DeepSeek-R1-Distill-Qwen-1.5B model and bind it to /openbayes/input/input0

Prepare the Startup Script start.sh

After the container starts, create the following start.sh script. This script automatically detects the number of GPUs and dynamically configures the service port based on the runtime environment (model training or model deployment).

start.sh
#!/bin/bash

# Get GPU count
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)

# Set port, model deployment default exposed port is 80 while model training default exposed port is 8080
PORT=8080
if [ ! -z "$OPENBAYES_SERVING_PRODUCTION" ]; then
    PORT=80
fi

# Start vLLM service
echo "Starting vLLM service..."
vllm serve /openbayes/input/input0 \
    --served-model-name DeepSeek-R1-Distill-Qwen-1.5B \
    --disable-log-requests \
    --trust-remote-code \
    --host 0.0.0.0 --port $PORT \
    --gpu-memory-utilization 0.98 \
    --max-model-len 8192 --enable-prefix-caching \
    --tensor-parallel-size $GPU_COUNT

Test the service in the container

Run the following command to start the vLLM service:

bash start.sh

After the service starts, you can use the following curl command to test the model inference functionality:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "请用中文解释什么是大语言模型"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Open a new terminal in JupyterLab and execute the above curl command for testing:

Note

Port 8080 is used when testing in the model training environment, while the model deployment environment automatically switches to port 80. This is a standard requirement for HyperAI model deployment services—all deployment services must provide external access through port 80.

Deploy Model Service

After completing development and testing, you can convert the model into a production-ready deployment service through the following two methods:

HyperAI provides a "One-Click Deployment" feature that directly converts model training into a model deployment service without the need for repeated configuration.

Using the One-Click Deployment Feature

  1. On the model training details page, click the dropdown menu next to the “Start” button in the upper-right corner and select “Create New Deployment.”
  2. Confirm the deployment configuration information (the system will automatically inherit the training container's configuration)
  3. Click "Confirm Deployment", and the system will automatically create the corresponding model deployment service

Deployment Configuration Confirmation

The system automatically inherits the following configurations:

  • Computing resources
  • Base image
  • Workspace data
  • Data binding relationships

You can adjust configurations as needed on the confirmation page.

Deployment Success

After submission, the system automatically creates the model deployment and starts the service. Upon success, it redirects to the deployment details page where you can immediately use the online testing tool to verify the interface.

Method 2: Manually Create Model Deployment

If you need to configure the deployment environment more flexibly, or create a new model deployment from scratch, follow these steps:

Configure Computing Power, Image, and Data Binding

  • Select RTX 5090 computing power
  • Select vLLM 0.16.0-2204-gpu image
  • In data binding, select the DeepSeek-R1-Distill-Qwen-1.5B model and bind it to /hyperai/input/input0
  • Bind the training container's workspace to /hyperai/home

Launch Deployment

Click the "Deploy" button and wait for the model deployment status to change to "Running".

Click on the running model deployment version to view the deployment details and runtime logs.

Online Testing

The model deployment details page provides an online testing tool that supports visually writing and sending HTTP requests in the web interface to quickly verify model interface functionality, without using local command line or third-party tools.

Key features:

  • Select request method (GET, POST, etc.)
  • Fill in interface path and parameters
  • Customize request headers and request body (supports JSON and other formats)
  • View response content and response headers in real-time
  • Support streaming output to experience large model streaming inference

GET Request Example

Used to retrieve model information or perform health checks. Select the GET method, fill in the interface path (e.g., /v1/models), and click "Send" to view the model list or status.

POST Request Example

Used to have conversations with large language models. Select the POST method, fill in the path /v1/chat/completions, enter the conversation content in the request body (as shown below), and click "Send" to experience model inference.

{
  "model": "qwen3-32b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What's the weather like in Beijing?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Streaming Call Example

Used to experience large model streaming inference. Add the "stream": true field to the POST request body, and after sending the request, you can view the model's progressive output in real-time, suitable for scenarios that require consuming results while generating.

{
  "model": "qwen3-32b",
  "stream": true,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "How is the weather in Beijing?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Command Line Testing

If you prefer using command line tools (such as curl) for interface testing, refer to the following method:

Obtain the service URL generated by HyperAI from the model deployment page, then use the following command to test model availability:

curl -X POST http://<model deployment url>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "你好,请介绍一下自己"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Next Steps