Quick Start

This article will demonstrate how to deploy a large language model using vLLM on HyperAI through a practical example. We will deploy the DeepSeek-R1-Distill-Qwen-1.5B model, which is a lightweight model based on Qwen.

Model Introduction

DeepSeek-R1-Distill-Qwen-1.5B is a lightweight Chinese-English bilingual dialogue model:

1.5B parameters, can be deployed on a single card
Minimum VRAM requirement: 3GB
Recommended VRAM configuration: 4GB and above

Development and Testing in Model Training

Create a New Model Training

Select RTX 5090 compute power
Select vLLM 0.16.0-2204-gpu image
In data binding, select the DeepSeek-R1-Distill-Qwen-1.5B model and bind it to /openbayes/input/input0

Prepare the Startup Script `start.sh`

After the container starts, create the following start.sh script. This script automatically detects the number of GPUs and dynamically configures the service port based on the runtime environment (model training or model deployment).

start.sh

#!/bin/bash

# Get GPU count
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)

# Set port, model deployment default exposed port is 80 while model training default exposed port is 8080
PORT=8080
if [ ! -z "$OPENBAYES_SERVING_PRODUCTION" ]; then
    PORT=80
fi

# Start vLLM service
echo "Starting vLLM service..."
vllm serve /openbayes/input/input0 \
    --served-model-name DeepSeek-R1-Distill-Qwen-1.5B \
    --disable-log-requests \
    --trust-remote-code \
    --host 0.0.0.0 --port $PORT \
    --gpu-memory-utilization 0.98 \
    --max-model-len 8192 --enable-prefix-caching \
    --tensor-parallel-size $GPU_COUNT

Test the service in the container

Run the following command to start the vLLM service:

bash start.sh

After the service starts, you can use the following curl command to test the model inference functionality:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "请用中文解释什么是大语言模型"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Open a new terminal in JupyterLab and execute the above curl command for testing:

Note

Port 8080 is used when testing in the model training environment, while the model deployment environment automatically switches to port 80. This is a standard requirement for HyperAI model deployment services—all deployment services must provide external access through port 80.

Deploy Model Service

After completing development and testing, you can convert the model into a production-ready deployment service through the following two methods:

Method 1: One-Click Deployment (Recommended)

HyperAI provides a "One-Click Deployment" feature that directly converts model training into a model deployment service without the need for repeated configuration.

Using the One-Click Deployment Feature

On the model training details page, click the dropdown menu next to the “Start” button in the upper-right corner and select “Create New Deployment.”
Confirm the deployment configuration information (the system will automatically inherit the training container's configuration)
Click "Confirm Deployment", and the system will automatically create the corresponding model deployment service

Deployment Configuration Confirmation

The system automatically inherits the following configurations:

Computing resources
Base image
Workspace data
Data binding relationships

You can adjust configurations as needed on the confirmation page.

Deployment Success

After submission, the system automatically creates the model deployment and starts the service. Upon success, it redirects to the deployment details page where you can immediately use the online testing tool to verify the interface.

Method 2: Manually Create Model Deployment

If you need to configure the deployment environment more flexibly, or create a new model deployment from scratch, follow these steps:

Configure Computing Power, Image, and Data Binding

Select RTX 5090 computing power
Select vLLM 0.16.0-2204-gpu image
In data binding, select the DeepSeek-R1-Distill-Qwen-1.5B model and bind it to /hyperai/input/input0
Bind the training container's workspace to /hyperai/home

Launch Deployment

Click the "Deploy" button and wait for the model deployment status to change to "Running".

Click on the running model deployment version to view the deployment details and runtime logs.

Online Testing

The model deployment details page provides an online testing tool that supports visually writing and sending HTTP requests in the web interface to quickly verify model interface functionality, without using local command line or third-party tools.

Key features:

Select request method (GET, POST, etc.)
Fill in interface path and parameters
Customize request headers and request body (supports JSON and other formats)
View response content and response headers in real-time
Support streaming output to experience large model streaming inference

GET Request Example

Used to retrieve model information or perform health checks. Select the GET method, fill in the interface path (e.g., /v1/models), and click "Send" to view the model list or status.

POST Request Example

Used to have conversations with large language models. Select the POST method, fill in the path /v1/chat/completions, enter the conversation content in the request body (as shown below), and click "Send" to experience model inference.

{
  "model": "qwen3-32b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What's the weather like in Beijing?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Streaming Call Example

Used to experience large model streaming inference. Add the "stream": true field to the POST request body, and after sending the request, you can view the model's progressive output in real-time, suitable for scenarios that require consuming results while generating.

{
  "model": "qwen3-32b",
  "stream": true,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "How is the weather in Beijing?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Command Line Testing

If you prefer using command line tools (such as curl) for interface testing, refer to the following method:

Obtain the service URL generated by HyperAI from the model deployment page, then use the following command to test model availability:

curl -X POST http://<model deployment url>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "你好，请介绍一下自己"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Next Steps

Learn more about managing model deployments
Check out the vLLM official documentation

Model Introduction

DeepSeek-R1-Distill-Qwen-1.5B is a lightweight Chinese-English bilingual dialogue model:

1.5B parameters, can be deployed on a single card
Minimum VRAM requirement: 3GB
Recommended VRAM configuration: 4GB and above

Development and Testing in Model Training

Create a New Model Training

Select RTX 5090 compute power
Select vLLM 0.16.0-2204-gpu image
In data binding, select the DeepSeek-R1-Distill-Qwen-1.5B model and bind it to /openbayes/input/input0

Prepare the Startup Script `start.sh`

start.sh

#!/bin/bash

# Get GPU count
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)

# Set port, model deployment default exposed port is 80 while model training default exposed port is 8080
PORT=8080
if [ ! -z "$OPENBAYES_SERVING_PRODUCTION" ]; then
    PORT=80
fi

# Start vLLM service
echo "Starting vLLM service..."
vllm serve /openbayes/input/input0 \
    --served-model-name DeepSeek-R1-Distill-Qwen-1.5B \
    --disable-log-requests \
    --trust-remote-code \
    --host 0.0.0.0 --port $PORT \
    --gpu-memory-utilization 0.98 \
    --max-model-len 8192 --enable-prefix-caching \
    --tensor-parallel-size $GPU_COUNT

Test the service in the container

Run the following command to start the vLLM service:

bash start.sh

After the service starts, you can use the following curl command to test the model inference functionality:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "请用中文解释什么是大语言模型"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Open a new terminal in JupyterLab and execute the above curl command for testing:

Note

Deploy Model Service

After completing development and testing, you can convert the model into a production-ready deployment service through the following two methods:

Method 1: One-Click Deployment (Recommended)

HyperAI provides a "One-Click Deployment" feature that directly converts model training into a model deployment service without the need for repeated configuration.

Using the One-Click Deployment Feature

On the model training details page, click the dropdown menu next to the “Start” button in the upper-right corner and select “Create New Deployment.”
Confirm the deployment configuration information (the system will automatically inherit the training container's configuration)
Click "Confirm Deployment", and the system will automatically create the corresponding model deployment service

Deployment Configuration Confirmation

The system automatically inherits the following configurations:

Computing resources
Base image
Workspace data
Data binding relationships

You can adjust configurations as needed on the confirmation page.

Deployment Success

Method 2: Manually Create Model Deployment

If you need to configure the deployment environment more flexibly, or create a new model deployment from scratch, follow these steps:

Configure Computing Power, Image, and Data Binding

Select RTX 5090 computing power
Select vLLM 0.16.0-2204-gpu image
In data binding, select the DeepSeek-R1-Distill-Qwen-1.5B model and bind it to /hyperai/input/input0
Bind the training container's workspace to /hyperai/home

Launch Deployment

Click the "Deploy" button and wait for the model deployment status to change to "Running".

Click on the running model deployment version to view the deployment details and runtime logs.

Online Testing

Key features:

Select request method (GET, POST, etc.)
Fill in interface path and parameters
Customize request headers and request body (supports JSON and other formats)
View response content and response headers in real-time
Support streaming output to experience large model streaming inference

GET Request Example

Used to retrieve model information or perform health checks. Select the GET method, fill in the interface path (e.g., /v1/models), and click "Send" to view the model list or status.

POST Request Example

{
  "model": "qwen3-32b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What's the weather like in Beijing?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Streaming Call Example

{
  "model": "qwen3-32b",
  "stream": true,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "How is the weather in Beijing?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Command Line Testing

If you prefer using command line tools (such as curl) for interface testing, refer to the following method:

Obtain the service URL generated by HyperAI from the model deployment page, then use the following command to test model availability:

curl -X POST http://<model deployment url>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "你好，请介绍一下自己"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Next Steps

Learn more about managing model deployments
Check out the vLLM official documentation

Model Introduction

DeepSeek-R1-Distill-Qwen-1.5B is a lightweight Chinese-English bilingual dialogue model:

1.5B parameters, can be deployed on a single card
Minimum VRAM requirement: 3GB
Recommended VRAM configuration: 4GB and above

Development and Testing in Model Training

Create a New Model Training

Select RTX 5090 compute power
Select vLLM 0.16.0-2204-gpu image
In data binding, select the DeepSeek-R1-Distill-Qwen-1.5B model and bind it to /openbayes/input/input0

Prepare the Startup Script `start.sh`

start.sh

#!/bin/bash

# Get GPU count
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)

# Set port, model deployment default exposed port is 80 while model training default exposed port is 8080
PORT=8080
if [ ! -z "$OPENBAYES_SERVING_PRODUCTION" ]; then
    PORT=80
fi

# Start vLLM service
echo "Starting vLLM service..."
vllm serve /openbayes/input/input0 \
    --served-model-name DeepSeek-R1-Distill-Qwen-1.5B \
    --disable-log-requests \
    --trust-remote-code \
    --host 0.0.0.0 --port $PORT \
    --gpu-memory-utilization 0.98 \
    --max-model-len 8192 --enable-prefix-caching \
    --tensor-parallel-size $GPU_COUNT

Test the service in the container

Run the following command to start the vLLM service:

bash start.sh

After the service starts, you can use the following curl command to test the model inference functionality:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "请用中文解释什么是大语言模型"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Open a new terminal in JupyterLab and execute the above curl command for testing:

Note

Deploy Model Service

After completing development and testing, you can convert the model into a production-ready deployment service through the following two methods:

Method 1: One-Click Deployment (Recommended)

HyperAI provides a "One-Click Deployment" feature that directly converts model training into a model deployment service without the need for repeated configuration.

Using the One-Click Deployment Feature

On the model training details page, click the dropdown menu next to the “Start” button in the upper-right corner and select “Create New Deployment.”
Confirm the deployment configuration information (the system will automatically inherit the training container's configuration)
Click "Confirm Deployment", and the system will automatically create the corresponding model deployment service

Deployment Configuration Confirmation

The system automatically inherits the following configurations:

Computing resources
Base image
Workspace data
Data binding relationships

You can adjust configurations as needed on the confirmation page.

Deployment Success

Method 2: Manually Create Model Deployment

If you need to configure the deployment environment more flexibly, or create a new model deployment from scratch, follow these steps:

Configure Computing Power, Image, and Data Binding

Select RTX 5090 computing power
Select vLLM 0.16.0-2204-gpu image
In data binding, select the DeepSeek-R1-Distill-Qwen-1.5B model and bind it to /hyperai/input/input0
Bind the training container's workspace to /hyperai/home

Launch Deployment

Click the "Deploy" button and wait for the model deployment status to change to "Running".

Click on the running model deployment version to view the deployment details and runtime logs.

Online Testing

Key features:

Select request method (GET, POST, etc.)
Fill in interface path and parameters
Customize request headers and request body (supports JSON and other formats)
View response content and response headers in real-time
Support streaming output to experience large model streaming inference

GET Request Example

Used to retrieve model information or perform health checks. Select the GET method, fill in the interface path (e.g., /v1/models), and click "Send" to view the model list or status.

POST Request Example

{
  "model": "qwen3-32b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What's the weather like in Beijing?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Streaming Call Example

{
  "model": "qwen3-32b",
  "stream": true,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "How is the weather in Beijing?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Command Line Testing

If you prefer using command line tools (such as curl) for interface testing, refer to the following method:

Obtain the service URL generated by HyperAI from the model deployment page, then use the following command to test model availability:

curl -X POST http://<model deployment url>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "你好，请介绍一下自己"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Next Steps

Learn more about managing model deployments
Check out the vLLM official documentation

Quick Start

On this page

Quick Start

On this page

Quick Start

On this page