Check out our latest project ✨ OpenChapter.io: free ebooks the way its meant to be 📖

Godot Volcengine TTS

An asset by HuLunTunTao
The page banner background of a mountain and forest
Godot Volcengine TTS hero image

Quick Information

0 ratings
Godot Volcengine TTS icon image
HuLunTunTao
Godot Volcengine TTS

Godot Volcengine TTS 是一个面向 Godot 4.4+ 的第三方火山引擎豆包 TTS SDK,封装了火山引擎语音合成大模型的公开 API。它提供高度封装、使用简便的播放节点,也保留了双向 WebSocket、单向流式和 HTTP 合成三个底层 Client,可用于实时角色对白、LLM 生成语音、UI 提示音、过场动画和固定台词缓存。本项目的 GitHub 仓库包含本地测试场景,支持直接测试接口,也支持将合成语音保存到本地,便于在开发阶段预生成固定台词音频,减少运行期重复合成成本。请注意:通过 Godot Asset Library 下载的插件包只包含 addons/ 目录,不包含测试场景、截图和仓库级文档;如需使用测试场景,请从 GitHub 克隆完整仓库。详细文档和源码请见 GitHub 仓库:https://github.com/HuLunTunTao/godot_volcengine_tts如果本插件对您有帮助,欢迎给仓库点一个 Star,感谢支持。---Godot Volcengine TTS is an unofficial third-party Godot 4.4+ SDK for Volcengine Doubao TTS. It wraps Volcengine’s public speech synthesis APIs and provides a high-level playback node, plus lower-level clients for bidirectional WebSocket streaming, unidirectional streaming, and HTTP synthesis. It can be used for live character dialogue, LLM-generated speech, UI prompts, cutscenes, and cached voice lines.The GitHub repository includes a local test scene for trying the API directly and saving synthesized audio to local files, which is useful for pre-generating fixed voice lines during development. Note: the Godot Asset Library package only includes the addons directory, so it does not include the test scene, screenshots, or repository-level documentation. Clone the full GitHub repository if you want to use the test scene.Documentation and source code:https://github.com/HuLunTunTao/godot_volcengine_ttsIf you find this plugin useful, please consider starring the repository.

Supported Engine Version
4.4
Version String
0.2.0
License Version
MIT
Support Level
community
Modified Date
1 month ago
Git URL
Issue URL

Godot Volcengine TTS

An unofficial third-party Godot 4 client SDK for Volcengine Doubao TTS. It was originally extracted from a tactics RPG project that needed character dialogue, low-latency streamed speech, and cached voice lines.

Compliance notes, service terms, and disclaimers are kept in the repository root README.

Chinese documentation: README.zh-CN.md

Supported Endpoints

Endpoint Class Use case SSML Output
wss://openspeech.bytedance.com/api/v3/tts/bidirection VolcengineTTSBidirectionalClient LLM token streaming and high-level playback No PCM streaming
wss://openspeech.bytedance.com/api/v3/tts/unidirectional/stream VolcengineTTSUnidirectionalClient One full text request, streamed chunks Yes PCM/MP3/WAV/Opus chunks
https://openspeech.bytedance.com/api/v3/tts/unidirectional VolcengineTTSHttpClient Pre-generate or cache complete audio Yes Complete audio bytes

VolcengineStreamingVoicePlayer owns all three clients and adds Godot playback on top of them.

Important naming note: in this addon, "one-shot streaming playback" means the high-level speak() helper. It currently uses the official bidirectional WebSocket endpoint, not /tts/unidirectional/stream. It starts a bidirectional session, feeds the full text once, sends FinishSession, and streams returned PCM bytes into AudioStreamGenerator. The real official unidirectional streaming endpoint is implemented separately as voice.uni_client.synthesize_streaming(...).

This is a deliberate maintenance tradeoff. speak() and true LLM token-streaming can share the same bidirectional session lifecycle, session-id filtering, stale-signal protection, cancellation semantics, PCM queue, and backpressure playback path. Keeping one high-level playback pipeline reduces the chance that stop/reentry/playback behavior diverges between "one-shot" and "token-streaming" modes. The unidirectional client remains available as a lower-level API for SSML streaming and custom chunk handling.

Volcengine updates available voices over time. Check the official voice list before choosing a voice_type:

https://www.volcengine.com/docs/6561/1257544

Installation

Copy this folder into your project:

addons/godot_volcengine_tts/

Then enable Godot Volcengine TTS in Project Settings > Plugins. The editor plugin only exists to satisfy Godot's addon format; runtime code uses the class_name scripts directly.

Quick Start

extends Node

func _ready() -> void:
    var voice := VolcengineStreamingVoicePlayer.new()
    voice.audio_bus = &"Master"
    add_child(voice)

    for client in [voice.bidi_client, voice.uni_client, voice.http_client]:
        client.api_key = "your-volcengine-api-key"
        client.resource_id = "seed-tts-2.0"
        client.default_model = "seed-tts-2.0-expressive"

    await voice.speak("Hello from Godot.", "zh_male_dayi_uranus_bigtts")

Configuration

Most projects only need to set api_key, resource_id, and optionally default_model on the clients they use:

for client in [voice.bidi_client, voice.uni_client, voice.http_client]:
    client.api_key = "your-volcengine-api-key"
    client.resource_id = "seed-tts-2.0"
    client.user_uid = "player-or-device-id"
    client.default_model = "seed-tts-2.0-expressive"

You can also override the host or API path. This is useful for private gateways, reverse proxies, compatible endpoints, or future Volcengine path changes:

voice.bidi_client.base_url = "openspeech.bytedance.com"
voice.bidi_client.path = "/api/v3/tts/bidirection"

voice.uni_client.base_url = "your-gateway.example.com"
voice.uni_client.path = "/api/v3/tts/unidirectional/stream"

voice.http_client.base_url = "your-gateway.example.com"
voice.http_client.path = "/api/v3/tts/unidirectional"

base_url is the host only. Do not include https://, wss://, or a trailing path. The clients add the protocol themselves: WebSocket clients use wss://, and the HTTP client connects with TLS on port 443.

Other useful runtime knobs:

Property Owner Default Notes
audio_bus VolcengineStreamingVoicePlayer &"Master" Godot audio bus for high-level playback
sample_rate VolcengineStreamingVoicePlayer 24000 Default PCM playback sample rate for speak() and start_streaming()
buffer_length VolcengineStreamingVoicePlayer 0.5 AudioStreamGenerator buffer length
auto_context_chain VolcengineStreamingVoicePlayer false Reuses the previous session id as section_id
connect_timeout_msec all clients 8000 WebSocket/HTTP connection timeout
session_timeout_msec WS clients 20000 Inactivity timeout while waiting for WS packets
read_timeout_msec HTTP client 30000 Inactivity timeout while reading HTTP chunks

API Quick Reference

API Endpoint Result
voice.speak(text, voice_id, opts) Bidirectional WS Plays streamed PCM; returns true on natural completion
voice.start_streaming(voice_id, opts) / feed_text(chunk) / finish_streaming() Bidirectional WS Plays streamed PCM from incremental text
voice.fetch_audio(text, voice_id, opts) HTTP chunked Returns complete audio bytes
voice.uni_client.synthesize_streaming(text, voice_id, on_chunk, opts, out_session) Unidirectional WS Calls on_chunk for every streamed audio chunk
voice.stop() Active high-level playback Cancels playback and wakes waiters with false
voice.current_session_id() High-level player Returns the last completed bidirectional session id

Signals:

  • voice.speak_finished: emitted when high-level playback ends, fails, or is stopped.
  • voice.bidi_client.audio_chunk_received(session_id, chunk): lower-level bidirectional audio event.
  • voice.bidi_client.session_finished(session_id) and voice.bidi_client.session_failed(session_id, reason): lower-level bidirectional completion events.

Use Case A: One-Shot Streaming Playback

var voice := VolcengineStreamingVoicePlayer.new()
add_child(voice)

voice.bidi_client.api_key = "..."
voice.bidi_client.resource_id = "seed-tts-2.0"

var ok := await voice.speak("Hold the bridge.", "zh_male_dayi_uranus_bigtts", {
    "emotion": "happy",
    "emotion_scale": 4,
    "speech_rate": 10,
})

speak() returns true when playback completes naturally and false when it fails or is interrupted by stop() or a later request.

Implementation details:

  • Uses VolcengineTTSBidirectionalClient.
  • Forces format = "pcm" because AudioStreamGenerator consumes PCM frames.
  • Uses sample_rate = voice.sample_rate unless overridden in opts.
  • Runs start_session(voice, opts), feed_text(text), then finish_session().
  • Queues PCM chunks and pushes them with backpressure into AudioStreamGeneratorPlayback.

This path does not support SSML because the underlying bidirectional protocol does not support SSML.

Use Case B: True Bidirectional Streaming

await voice.start_streaming("zh_male_dayi_uranus_bigtts")

for chunk in ["The bridge is ready. ", "Hold the line."]:
    voice.feed_text(chunk)
    await get_tree().create_timer(0.3).timeout

voice.finish_streaming()
await voice.speak_finished

Use this when text arrives incrementally from an LLM or other generator. All chunks share one server session, so prosody can remain more continuous than separate requests.

Use Case C: Official Unidirectional Streaming

var rate := 24000
var generator := AudioStreamGenerator.new()
generator.mix_rate = float(rate)
generator.buffer_length = 0.5

var player := AudioStreamPlayer.new()
add_child(player)
player.stream = generator
player.play()
var playback := player.get_stream_playback() as AudioStreamGeneratorPlayback

var out_session := {}
var ok := await voice.uni_client.synthesize_streaming(
    "Hello from the unidirectional endpoint.",
    "zh_male_dayi_uranus_bigtts",
    func(chunk: PackedByteArray) -> void:
        # Convert signed 16-bit little-endian PCM into Vector2 frames here.
        pass,
    {"format": "pcm", "sample_rate": rate},
    out_session,
)

This is the official /api/v3/tts/unidirectional/stream path. It sends one SendText packet containing the full request JSON, then receives TTSResponse audio frames until SessionFinished.

Use this path directly when you need SSML with streamed audio or when you want to own playback, buffering, or byte handling yourself.

Use Case D: Complete Audio Bytes

var mp3 := await voice.fetch_audio("Welcome back, commander.", "zh_male_dayi_uranus_bigtts", {
    "format": "mp3",
})

var file := FileAccess.open("user://welcome.mp3", FileAccess.WRITE)
file.store_buffer(mp3)

fetch_audio() calls the HTTP chunked endpoint and returns one PackedByteArray. The HTTP path is useful for pre-generated dialogue, persistent caches, cutscenes, and UI prompts.

Context Chaining

voice.auto_context_chain = true
await voice.speak("First line.", voice_id)
await voice.speak("Second line.", voice_id)
voice.reset_context_chain()

When auto_context_chain is enabled, consecutive speak() or finish_streaming() completions save the returned bidirectional session_id. The next request passes it as section_id unless you already provided one. This is intended for TTS 2.0 continuity. Call reset_context_chain() when the speaker, scene, or dialogue context changes.

You can also pass context manually:

await voice.speak("Second line.", voice_id, {
    "context_texts": ["Speak with a proud but restrained tone."],
    "section_id": previous_session_id,
})

Options

opts is a flat Dictionary. TtsOptions.build_req_params() moves each key into the Volcengine req_params structure.

Key Type Notes
format String "mp3" / "pcm" / "wav" / "ogg_opus"
sample_rate int Common values include 16000, 24000, 44100, 48000
bit_rate int MP3 only
emotion String Emotion for supported voices
emotion_scale int Commonly 1-5
speech_rate int Commonly -50 to 100
loudness_rate int Commonly -50 to 100
model String For example "seed-tts-2.0-expressive"
ssml String Full <speak>...</speak>, uni/HTTP only
context_texts Array[String] TTS 2.0 context hints
section_id String Previous session id for context continuation
silence_duration int Tail silence in milliseconds
disable_markdown_filter bool Passes through Volcengine additions
explicit_language String For example "zh-cn" / "en" / "ja"
enable_subtitle bool Server may send subtitle/timestamp frames, but this addon does not expose callbacks yet

Escape hatches:

await voice.speak("Line.", voice_id, {
    "audio_params_extra": {"future_audio_field": "value"},
    "additions_extra": {"with_frontend_text": true},
    "raw_req_params": {
        "speaker": voice_id,
        "audio_params": {"format": "pcm", "sample_rate": 24000},
    },
})

raw_req_params bypasses all merging. Use it only when Volcengine adds fields before this addon has first-class names for them. For bidirectional speak(), text is still sent later through feed_text(), so the start-session raw_req_params normally should not include text.

Protocol Notes

Common request headers:

  • X-Api-Key: <api_key>
  • X-Api-Resource-Id: <resource_id>
  • X-Api-Connect-Id: <uuid> for WebSocket connections
  • X-Api-Request-Id: <uuid> for HTTP requests
  • X-Control-Require-Usage-Tokens-Return: *

Bidirectional WebSocket flow:

  1. Connect to wss://<base_url>/api/v3/tts/bidirection.
  2. Send StartConnection event 1; wait for ConnectionStarted event 50.
  3. Send StartSession event 100 with namespace = "BidirectionalTTS" and req_params.
  4. Send one or more TaskRequest event 200 packets with text chunks.
  5. Send FinishSession event 102.
  6. Receive audio frames, then SessionFinished event 152, or SessionFailed event 153.
  7. Send FinishConnection event 2 best-effort and close.

Bidirectional client packets use a binary frame header [0x11, 0x14, 0x10, 0x00], followed by an event number, optional session id, and JSON payload.

Unidirectional WebSocket flow:

  1. Connect to wss://<base_url>/api/v3/tts/unidirectional/stream.
  2. Send one SendText packet with header [0x11, 0x10, 0x10, 0x00]. This packet has no event number and contains the full request JSON.
  3. Receive sentence start 350, audio response 352, sentence end 351, and session finished 152 events.
  4. Send FinishConnection event 2 with header [0x11, 0x14, 0x10, 0x00] and close.

HTTP flow:

  1. POST JSON to https://<base_url>/api/v3/tts/unidirectional.
  2. Read a chunked response where each line is JSON.
  3. Append base64-decoded data from code == 0 lines.
  4. Stop when code == 20000000.

Interruptions And Concurrency

VolcengineStreamingVoicePlayer runs one playback task at a time. A new speak() or start_streaming() interrupts the previous playback task before starting.

voice.speak("First line.", voice_id)  # not awaited
voice.speak("Second line.", voice_id) # interrupts the first line

The interrupted await voice.speak(...) wakes up and returns false.

Explicit stop:

voice.stop()

stop() closes active WebSockets, clears pending PCM chunks, stops the AudioStreamPlayer, emits speak_finished if something was active, and makes the waiting speak() return false. Calling stop() while idle is safe.

speak_finished means "no more audio will continue" for success, failure, or interruption. Use the speak() return value to distinguish natural completion from failure/interruption.

Gotchas

  • For live Godot playback, prefer PCM. Streaming MP3 chunks are not directly pushed into AudioStreamGenerator.
  • speak() forces PCM even if you pass "format": "mp3". Use fetch_audio() for MP3 bytes.
  • Bidirectional streaming does not support SSML. Use uni_client or fetch_audio() for SSML.
  • Some saturn_ and _saturn_bigtts voices do not support SSML; the addon warns but lets the server decide.
  • default_model is only injected automatically for saturn_ voices. Passing model values to incompatible voices can trigger resource/speaker mismatch errors on stricter endpoints.
  • Timeouts are inactivity timeouts. Receiving any packet extends the deadline.
  • HTTP response chunks are not guaranteed to align with JSON line boundaries; the HTTP client buffers text until newline before decoding.
  • Subtitle/timestamp frames may be returned by the server, but this addon currently ignores them.

Limitations

  • SSE endpoint support is not implemented.
  • Subtitle/timestamp callbacks are not exposed yet.
  • Voice lists are not bundled. The caller owns the voice_type string.
  • System TTS fallback is not built in. Handle failures in your own game code.

License

MIT. See LICENSE.

Godot Volcengine TTS 是一个面向 Godot 4.4+ 的第三方火山引擎豆包 TTS SDK,封装了火山引擎语音合成大模型的公开 API。它提供高度封装、使用简便的播放节点,也保留了双向 WebSocket、单向流式和 HTTP 合成三个底层 Client,可用于实时角色对白、LLM 生成语音、UI 提示音、过场动画和固定台词缓存。

本项目的 GitHub 仓库包含本地测试场景,支持直接测试接口,也支持将合成语音保存到本地,便于在开发阶段预生成固定台词音频,减少运行期重复合成成本。请注意:通过 Godot Asset Library 下载的插件包只包含 addons/ 目录,不包含测试场景、截图和仓库级文档;如需使用测试场景,请从 GitHub 克隆完整仓库。

详细文档和源码请见 GitHub 仓库:
https://github.com/HuLunTunTao/godot_volcengine_tts

如果本插件对您有帮助,欢迎给仓库点一个 Star,感谢支持。

---

Godot Volcengine TTS is an unofficial third-party Godot 4.4+ SDK for Volcengine Doubao TTS. It wraps Volcengine’s public speech synthesis APIs and provides a high-level playback node, plus lower-level clients for bidirectional WebSocket streaming, unidirectional streaming, and HTTP synthesis. It can be used for live character dialogue, LLM-generated speech, UI prompts, cutscenes, and cached voice lines.

The GitHub repository includes a local test scene for trying the API directly and saving synthesized audio to local files, which is useful for pre-generating fixed voice lines during development. Note: the Godot Asset Library package only includes the addons directory, so it does not include the test scene, screenshots, or repository-level documentation. Clone the full GitHub repository if you want to use the test scene.

Documentation and source code:
https://github.com/HuLunTunTao/godot_volcengine_tts

If you find this plugin useful, please consider starring the repository.

Reviews

0 ratings

Your Rating

Headline must be at least 3 characters but not more than 50
Review must be at least 5 characters but not more than 500
Please sign in to add a review

Quick Information

0 ratings
Godot Volcengine TTS icon image
HuLunTunTao
Godot Volcengine TTS

Godot Volcengine TTS 是一个面向 Godot 4.4+ 的第三方火山引擎豆包 TTS SDK,封装了火山引擎语音合成大模型的公开 API。它提供高度封装、使用简便的播放节点,也保留了双向 WebSocket、单向流式和 HTTP 合成三个底层 Client,可用于实时角色对白、LLM 生成语音、UI 提示音、过场动画和固定台词缓存。本项目的 GitHub 仓库包含本地测试场景,支持直接测试接口,也支持将合成语音保存到本地,便于在开发阶段预生成固定台词音频,减少运行期重复合成成本。请注意:通过 Godot Asset Library 下载的插件包只包含 addons/ 目录,不包含测试场景、截图和仓库级文档;如需使用测试场景,请从 GitHub 克隆完整仓库。详细文档和源码请见 GitHub 仓库:https://github.com/HuLunTunTao/godot_volcengine_tts如果本插件对您有帮助,欢迎给仓库点一个 Star,感谢支持。---Godot Volcengine TTS is an unofficial third-party Godot 4.4+ SDK for Volcengine Doubao TTS. It wraps Volcengine’s public speech synthesis APIs and provides a high-level playback node, plus lower-level clients for bidirectional WebSocket streaming, unidirectional streaming, and HTTP synthesis. It can be used for live character dialogue, LLM-generated speech, UI prompts, cutscenes, and cached voice lines.The GitHub repository includes a local test scene for trying the API directly and saving synthesized audio to local files, which is useful for pre-generating fixed voice lines during development. Note: the Godot Asset Library package only includes the addons directory, so it does not include the test scene, screenshots, or repository-level documentation. Clone the full GitHub repository if you want to use the test scene.Documentation and source code:https://github.com/HuLunTunTao/godot_volcengine_ttsIf you find this plugin useful, please consider starring the repository.

Supported Engine Version
4.4
Version String
0.2.0
License Version
MIT
Support Level
community
Modified Date
1 month ago
Git URL
Issue URL

Open Source

Released under the AGPLv3 license

Plug and Play

Browse assets directly from Godot

Community Driven

Created by developers for developers