Audio

Authentication

参考接入 BytePower

Pronunciation Assessment

接口功能：语音发音评估，支持准确性、流利度、完整性等多维度评估

Method & Path

POST {domain}/bp/ai/audio/pronunciation
POST {domain}/bp/server/user/{user_id}/ai/audio/pronunciation

Request

说明：

request body 是 chunk 格式，需要分段发送
request body 编码为 Transfer-Encoding，但是不能在 header 添加 Transfer-Encoding
第一个 chunk 为 pronunciation 的参数
后续 chunk 为 audio stream 内容
audio stream，需要使用 base64 编码，并且不能添加 data:audio/mp3;base64, 前缀
audio stream 除开最后一个 chunk，其它 chunk base64 编码时不能填充
目前只支持 pcm,16hz,1 声道，16bit 音频格式

第一个 chunk 的参数:

Parameters	Type	Required	Desc
strategy	string	true	评估策略，目前支持 `azure_pronunciation`
data	object	true	评估配置参数

data 对象参数：

Parameters	Type	Required	Default	Desc
language	string	false	en-US	语言，默认 en-US
reference_text	string	true	-	用于评估的标准参考文本
grading_system	string	false	HundredMark	分数系统：FivePoint(0-5) 或 HundredMark(0-100)
granularity	string	false	Phoneme	评估粒度：Phoneme、Word 或 FullText
enable_miscue	boolean	false	false	启用误读检测 (Omission/Insertion)
enable_prosody_assessment	boolean	false	false	启用韵律评估 (重音、语调、语速、节奏)
nbest_phoneme_count	string	false	-	NBest 音素候选数量，用于候选音素评估
content_topic	string	false	-	内容主题，用于内容评估

example:

POST /bp/ai/audio/pronunciation HTTP/1.1
Content-Type: text/plain

B7\r\n{"strategy":"azure_pronunciation","data":{"reference_text":"Hello world","grading_system":"HundredMark","granularity":"Phoneme","enable_miscue":true,"enable_prosody_assessment":true}}\r\nx\r\{x_byte_audio}\r\n0\r\n\r\n

Response

json

{
  "text": "Hello.",
  "strategy": "azure_audio_pronunciation",
  "current_count": 2,
  "current_total_count": 2,
  "audio_url": "https://xxx.wav",
  "unit_price": {
    "input_per_price": 0.0277,
    "output_per_price": 0,
    "input_token": 0,
    "output_token": 0
  },
  "asset": [
    {
      "name": "ai_free_asset",
      "type": "consumable",
      "quantity": 36,
      "recoverable": false,
      "valid_seconds": 0
    },
    {
      "name": "test_tmpe",
      "type": "consumable",
      "quantity": 4,
      "recoverable": true,
      "last_recovery_time": "2026-02-28T00:00:00Z",
      "valid_seconds": 0
    }
  ],
  "raw_data": {
    "Channel": 0,
    "DisplayText": "Hello.",
    "Duration": 4500000,
    "Id": "a287451405db4e3a892a379191ce52d1",
    "Offset": 32500000,
    "RecognitionStatus": "Success",
    "SNR": 48.073566,
    "NBest": [
      {
        "Confidence": 0.90666413,
        "Display": "Hello.",
        "ITN": "hello",
        "Lexical": "hello",
        "MaskedITN": "hello",
        "PronunciationAssessment": {
          "AccuracyScore": 81,
          "CompletenessScore": 100,
          "FluencyScore": 100,
          "PronScore": 80.4,
          "ProsodyScore": 60.4
        },
        "Words": [
          {
            "Word": "hello",
            "Offset": 32500000,
            "Duration": 4500000,
            "PronunciationAssessment": {
              "AccuracyScore": 81,
              "ErrorType": "None",
              "Feedback": {
                "Prosody": {
                  "Break": {
                    "BreakLength": 0,
                    "ErrorTypes": ["None"]
                  },
                  "Intonation": {
                    "ErrorTypes": ["Monotone"],
                    "Monotone": {
                      "SyllablePitchDeltaConfidence": 0.16252346
                    }
                  }
                }
              }
            },
            "Syllables": [
              {
                "Syllable": "hə",
                "Grapheme": "hel",
                "Offset": 28800000,
                "Duration": 1900000,
                "PronunciationAssessment": { "AccuracyScore": 86 }
              },
              {
                "Syllable": "loʊ",
                "Grapheme": "lo",
                "Offset": 30800000,
                "Duration": 2500000,
                "PronunciationAssessment": { "AccuracyScore": 67 }
              }
            ],
            "Phonemes": [
              {
                "Phoneme": "h",
                "Offset": 28800000,
                "Duration": 1100000,
                "PronunciationAssessment": { "AccuracyScore": 76 }
              },
              {
                "Phoneme": "ə",
                "Offset": 30000000,
                "Duration": 700000,
                "PronunciationAssessment": { "AccuracyScore": 100 }
              },
              {
                "Phoneme": "l",
                "Offset": 30800000,
                "Duration": 900000,
                "PronunciationAssessment": { "AccuracyScore": 100 }
              },
              {
                "Phoneme": "oʊ",
                "Offset": 31800000,
                "Duration": 1500000,
                "PronunciationAssessment": { "AccuracyScore": 46 }
              }
            ]
          }
        ]
      }
    ]
  }
}

响应字段说明：

Field	Type	Desc
text	string	识别到的音频文本
strategy	string	使用的评估策略
current_count	int	当前策略的累计使用次数
current_total_count	int	当前策略的历史总计使用次数
audio_url	string	音频文件 URL（仅在启用音频存储时返回）
unit_price	object	计费信息，包含输入输出单价和 token 消耗
asset	array	用户资产列表
raw_data	object	Azure 语音评估服务原始返回数据

unit_price 字段说明：

Field	Type	Desc
input_per_price	number	输入单价
output_per_price	number	输出单价
input_token	int	输入 token 数量
output_token	int	输出 token 数量

raw_data 顶层字段说明：

Field	Type	Desc
RecognitionStatus	string	识别状态，成功为 `Success`
DisplayText	string	识别到的展示文本
Duration	int	音频时长（100ns 单位）
Offset	int	音频起始偏移（100ns 单位）
SNR	number	信噪比
NBest	array	候选识别结果列表

NBest 元素中 PronunciationAssessment 评分说明：

Score	Desc
AccuracyScore	准确性评分，指示语音发音与参考文本的匹配程度
CompletenessScore	完整性评分，指示说出了参考文本的多少内容
FluencyScore	流利度评分，指示语音的自然程度
PronScore	综合发音评分
ProsodyScore	韵律评分，评估重音、语调、语速和节奏（需启用韵律评估）

Words 元素字段说明：

Field	Type	Desc
Word	string	单词
Offset	int	单词起始偏移（100ns 单位）
Duration	int	单词时长（100ns 单位）
PronunciationAssessment	object	单词维度评估，含 AccuracyScore、ErrorType、Feedback
Syllables	array	音节列表，含 Syllable、Grapheme、Offset、Duration、PronunciationAssessment
Phonemes	array	音素列表，含 Phoneme、Offset、Duration、PronunciationAssessment

Error

参数错误：

json

{
  "error": {
    "error_type": "invalid_parameter",
    "message": "invalid_parameter: reference_text is required"
  }
}

请求第三方失败：

json

{
  "error": {
    "error_type": "backend unavailable",
    "message": "XXX"
  }
}

Text-to-Speech

文本转语音

Method & Path

POST {domain}/bp/ai/audio/tts
POST {domain}/bp/server/user/{user_id}/ai/audio/tts

Request

Content-Type: application/json

Parameters	Type	Required	Desc
text	string	true	待转换的文本内容
strategies	string[]	true	语音合成策略列表，支持多策略回退，按顺序尝试直到成功
platform_params	object[]	false	平台参数数组，用于覆盖策略的默认配置（voice、prompt等）

platform_params 说明

platform_params 对象用于动态覆盖策略的默认配置，常用字段包括 platform（平台标识）、voice（音色）、prompt（提示词）等。

支持的平台：

azure_audio - Azure 语音合成服务
google_audio - Google Cloud Text-to-Speech
gemini - Google Gemini 语音合成
openai - OpenAI TTS
aws_audio - Amazon Polly

策略回退机制

TTS 接口支持多策略回退，当一个策略失败时会自动尝试下一个策略：

按顺序尝试：按 strategies 数组顺序依次尝试
失败条件：
- 配置错误（如缺少必要参数）
- API 调用失败且未发送任何音频数据
- 返回空音频
成功返回：任一策略成功即返回结果
全部失败：所有策略都失败时返回 all_strategies_failed 错误

⚠️注意：

如果某个策略已经开始发送 audio_chunk 事件，则不会重试其他策略
这是为了避免客户端接收到不完整或混乱的音频数据
因此建议将更稳定的策略放在前面

请求示例

json

{
  "text": "Hello, how are you today?",
  "strategies": [
    "tts_azure",
    "tts_openai",
    "tts_gemini"
  ],
  "platform_params": [
    {
      "platform": "azure_audio",
      "voice": "en-US-SerenaMultilingualNeural",
      "speed": 1.2
    },
    {
      "platform": "openai",
      "voice": "coral"
    },
    {
      "platform": "gemini",
      "language_code": "en-US",
      "voice": "Orus",
      "need_viseme": false
    }
  ]
}

Response

响应格式：Server-Sent Events (SSE) 流式推送

Content-Type: text/event-stream

完整响应示例

event: start
data: {"content":"start","timestamp":1768468269361}

event: audio_chunk
data: {"content":{"audio":"SUQzBAAAAAAAI1RTU0U..."},"timestamp":1768468272256}

event: audio_chunk
data: {"content":{"audio":"//uQxAAAAAAAAAAAAAA..."},"timestamp":1768468272257}

event: audio
data: {"content":{"audio_url":"https://xxx.mp3","viseme_url":"https://xxx.json","unit_price":{"input_per_price":0,"input_token":8,"output_per_price":0,"output_token":34},"current_count":1,"current_total_count":1,"asset":[{"name":"ai_free_asset","quantity":38,"recoverable":false,"type":"consumable","valid_seconds":0},{"name":"test_tmpe","quantity":4,"recoverable":true,"type":"consumable","last_recovery_time":"2026-02-28T00:00:00Z","valid_seconds":0}]},"timestamp":1768468274046}

event: end
data: {"content":"end","timestamp":1768468274046}

会话控制事件

Event Name	Desc
start	合成开始标记，包含时间戳
end	合成结束标记，包含时间戳
error	错误信息，如果出现错误会直接返回并且中断

音频传输事件

Event Name	Desc
audio_chunk	音频数据分片（base64 编码），多次推送，客户端按顺序接收并拼接成完整音频，支持流式播放
audio	完整音频信息，包含音频 URL、计费信息（unit_price）、使用次数和资产信息（viseme_url 仅在 need_viseme 为 true 时返回）

audio 事件 content 字段说明：

Field	Type	Desc
audio_url	string	音频文件 URL
viseme_url	string	口型数据 URL（仅在 need_viseme 为 true 时返回）
unit_price	object	计费信息
current_count	int	当前策略的累计使用次数
current_total_count	int	当前策略的历史总计使用次数
asset	array	用户资产列表

unit_price 字段说明：

Field	Type	Desc
input_per_price	number	输入单价
output_per_price	number	输出单价
input_token	int	输入 token 数量
output_token	int	输出 token 数量

Error

参数错误：

json

{
   "error": {
      "error_type": "invalid_parameter",
      "message": "invalid_parameter: text exceeds maximum length limit"
   }
}

所有策略失败：

json

{
   "error": {
      "error_type": "all_strategies_failed",
      "message": "all strategies failed"
   }
}

后端服务不可用：

event: error
data: {"error":{"error_type":"backend_unavailable","message":"backend unavailable"}}

Audio

Authentication

Pronunciation Assessment

Method & Path

Request

Response

Error

Text-to-Speech

Method & Path

Request

platform_params 说明

策略回退机制

请求示例

Response

完整响应示例

会话控制事件

音频传输事件

Error

附录

客户端传输数据格式

支持的音频格式

Audio ​

Authentication ​

Pronunciation Assessment ​

Method & Path ​

Request ​

Response ​

Error ​

Text-to-Speech ​

Method & Path ​

Request ​

platform_params 说明 ​

策略回退机制 ​

请求示例 ​

Response ​

完整响应示例 ​

会话控制事件 ​

音频传输事件 ​

Error ​

附录 ​

客户端传输数据格式 ​

支持的音频格式 ​

Audio

Authentication

Pronunciation Assessment

Method & Path

Request

Response

Error

Text-to-Speech

Method & Path

Request

platform_params 说明

策略回退机制

请求示例

Response

完整响应示例

会话控制事件

音频传输事件

Error

附录

客户端传输数据格式

支持的音频格式