Audio
Authentication
Pronunciation Assessment
接口功能:语音发音评估,支持准确性、流利度、完整性等多维度评估
Method & Path
-
POST {domain}/bp/ai/audio/pronunciation -
POST {domain}/bp/server/user/{user_id}/ai/audio/pronunciation
Request
说明:
- request body 是 chunk 格式,需要分段发送
- request body 编码为 Transfer-Encoding,但是不能在 header 添加 Transfer-Encoding
- 第一个 chunk 为 pronunciation 的参数
- 后续 chunk 为 audio stream 内容
- audio stream,需要使用 base64 编码,并且不能添加
data:audio/mp3;base64,前缀 - audio stream 除开最后一个 chunk,其它 chunk base64 编码时不能填充
- 目前只支持 pcm,16hz,1 声道,16bit 音频格式
第一个 chunk 的参数:
| Parameters | Type | Required | Desc |
|---|---|---|---|
| strategy | string | true | 评估策略,目前支持 azure_pronunciation |
| data | object | true | 评估配置参数 |
data 对象参数:
| Parameters | Type | Required | Default | Desc |
|---|---|---|---|---|
| language | string | false | en-US | 语言,默认 en-US |
| reference_text | string | true | - | 用于评估的标准参考文本 |
| grading_system | string | false | HundredMark | 分数系统:FivePoint(0-5) 或 HundredMark(0-100) |
| granularity | string | false | Phoneme | 评估粒度:Phoneme、Word 或 FullText |
| enable_miscue | boolean | false | false | 启用误读检测 (Omission/Insertion) |
| enable_prosody_assessment | boolean | false | false | 启用韵律评估 (重音、语调、语速、节奏) |
| nbest_phoneme_count | string | false | - | NBest 音素候选数量,用于候选音素评估 |
| content_topic | string | false | - | 内容主题,用于内容评估 |
example:
POST /bp/ai/audio/pronunciation HTTP/1.1
Content-Type: text/plain
B7\r\n{"strategy":"azure_pronunciation","data":{"reference_text":"Hello world","grading_system":"HundredMark","granularity":"Phoneme","enable_miscue":true,"enable_prosody_assessment":true}}\r\nx\r\{x_byte_audio}\r\n0\r\n\r\nResponse
json
{
"text": "Hello.",
"strategy": "azure_audio_pronunciation",
"current_count": 2,
"current_total_count": 2,
"audio_url": "https://xxx.wav",
"unit_price": {
"input_per_price": 0.0277,
"output_per_price": 0,
"input_token": 0,
"output_token": 0
},
"asset": [
{
"name": "ai_free_asset",
"type": "consumable",
"quantity": 36,
"recoverable": false,
"valid_seconds": 0
},
{
"name": "test_tmpe",
"type": "consumable",
"quantity": 4,
"recoverable": true,
"last_recovery_time": "2026-02-28T00:00:00Z",
"valid_seconds": 0
}
],
"raw_data": {
"Channel": 0,
"DisplayText": "Hello.",
"Duration": 4500000,
"Id": "a287451405db4e3a892a379191ce52d1",
"Offset": 32500000,
"RecognitionStatus": "Success",
"SNR": 48.073566,
"NBest": [
{
"Confidence": 0.90666413,
"Display": "Hello.",
"ITN": "hello",
"Lexical": "hello",
"MaskedITN": "hello",
"PronunciationAssessment": {
"AccuracyScore": 81,
"CompletenessScore": 100,
"FluencyScore": 100,
"PronScore": 80.4,
"ProsodyScore": 60.4
},
"Words": [
{
"Word": "hello",
"Offset": 32500000,
"Duration": 4500000,
"PronunciationAssessment": {
"AccuracyScore": 81,
"ErrorType": "None",
"Feedback": {
"Prosody": {
"Break": {
"BreakLength": 0,
"ErrorTypes": ["None"]
},
"Intonation": {
"ErrorTypes": ["Monotone"],
"Monotone": {
"SyllablePitchDeltaConfidence": 0.16252346
}
}
}
}
},
"Syllables": [
{
"Syllable": "hə",
"Grapheme": "hel",
"Offset": 28800000,
"Duration": 1900000,
"PronunciationAssessment": { "AccuracyScore": 86 }
},
{
"Syllable": "loʊ",
"Grapheme": "lo",
"Offset": 30800000,
"Duration": 2500000,
"PronunciationAssessment": { "AccuracyScore": 67 }
}
],
"Phonemes": [
{
"Phoneme": "h",
"Offset": 28800000,
"Duration": 1100000,
"PronunciationAssessment": { "AccuracyScore": 76 }
},
{
"Phoneme": "ə",
"Offset": 30000000,
"Duration": 700000,
"PronunciationAssessment": { "AccuracyScore": 100 }
},
{
"Phoneme": "l",
"Offset": 30800000,
"Duration": 900000,
"PronunciationAssessment": { "AccuracyScore": 100 }
},
{
"Phoneme": "oʊ",
"Offset": 31800000,
"Duration": 1500000,
"PronunciationAssessment": { "AccuracyScore": 46 }
}
]
}
]
}
]
}
}响应字段说明:
| Field | Type | Desc |
|---|---|---|
| text | string | 识别到的音频文本 |
| strategy | string | 使用的评估策略 |
| current_count | int | 当前策略的累计使用次数 |
| current_total_count | int | 当前策略的历史总计使用次数 |
| audio_url | string | 音频文件 URL(仅在启用音频存储时返回) |
| unit_price | object | 计费信息,包含输入输出单价和 token 消耗 |
| asset | array | 用户资产列表 |
| raw_data | object | Azure 语音评估服务原始返回数据 |
unit_price 字段说明:
| Field | Type | Desc |
|---|---|---|
| input_per_price | number | 输入单价 |
| output_per_price | number | 输出单价 |
| input_token | int | 输入 token 数量 |
| output_token | int | 输出 token 数量 |
raw_data 顶层字段说明:
| Field | Type | Desc |
|---|---|---|
| RecognitionStatus | string | 识别状态,成功为 Success |
| DisplayText | string | 识别到的展示文本 |
| Duration | int | 音频时长(100ns 单位) |
| Offset | int | 音频起始偏移(100ns 单位) |
| SNR | number | 信噪比 |
| NBest | array | 候选识别结果列表 |
NBest 元素中 PronunciationAssessment 评分说明:
| Score | Desc |
|---|---|
| AccuracyScore | 准确性评分,指示语音发音与参考文本的匹配程度 |
| CompletenessScore | 完整性评分,指示说出了参考文本的多少内容 |
| FluencyScore | 流利度评分,指示语音的自然程度 |
| PronScore | 综合发音评分 |
| ProsodyScore | 韵律评分,评估重音、语调、语速和节奏(需启用韵律评估) |
Words 元素字段说明:
| Field | Type | Desc |
|---|---|---|
| Word | string | 单词 |
| Offset | int | 单词起始偏移(100ns 单位) |
| Duration | int | 单词时长(100ns 单位) |
| PronunciationAssessment | object | 单词维度评估,含 AccuracyScore、ErrorType、Feedback |
| Syllables | array | 音节列表,含 Syllable、Grapheme、Offset、Duration、PronunciationAssessment |
| Phonemes | array | 音素列表,含 Phoneme、Offset、Duration、PronunciationAssessment |
Error
参数错误:
json
{
"error": {
"error_type": "invalid_parameter",
"message": "invalid_parameter: reference_text is required"
}
}请求第三方失败:
json
{
"error": {
"error_type": "backend unavailable",
"message": "XXX"
}
}Text-to-Speech
文本转语音
Method & Path
-
POST {domain}/bp/ai/audio/tts -
POST {domain}/bp/server/user/{user_id}/ai/audio/tts
Request
Content-Type: application/json
| Parameters | Type | Required | Desc |
|---|---|---|---|
| text | string | true | 待转换的文本内容 |
| strategies | string[] | true | 语音合成策略列表,支持多策略回退,按顺序尝试直到成功 |
| platform_params | object[] | false | 平台参数数组,用于覆盖策略的默认配置(voice、prompt等) |
platform_params 说明
platform_params 对象用于动态覆盖策略的默认配置,常用字段包括 platform(平台标识)、voice(音色)、prompt(提示词)等。
支持的平台:
azure_audio- Azure 语音合成服务google_audio- Google Cloud Text-to-Speechgemini- Google Gemini 语音合成openai- OpenAI TTSaws_audio- Amazon Polly
策略回退机制
TTS 接口支持多策略回退,当一个策略失败时会自动尝试下一个策略:
- 按顺序尝试:按 strategies 数组顺序依次尝试
- 失败条件:
- 配置错误(如缺少必要参数)
- API 调用失败且未发送任何音频数据
- 返回空音频
- 成功返回:任一策略成功即返回结果
- 全部失败:所有策略都失败时返回
all_strategies_failed错误
⚠️注意:
- 如果某个策略已经开始发送 audio_chunk 事件,则不会重试其他策略
- 这是为了避免客户端接收到不完整或混乱的音频数据
- 因此建议将更稳定的策略放在前面
请求示例
json
{
"text": "Hello, how are you today?",
"strategies": [
"tts_azure",
"tts_openai",
"tts_gemini"
],
"platform_params": [
{
"platform": "azure_audio",
"voice": "en-US-SerenaMultilingualNeural",
"speed": 1.2
},
{
"platform": "openai",
"voice": "coral"
},
{
"platform": "gemini",
"language_code": "en-US",
"voice": "Orus",
"need_viseme": false
}
]
}Response
响应格式:Server-Sent Events (SSE) 流式推送
Content-Type: text/event-stream
完整响应示例
event: start
data: {"content":"start","timestamp":1768468269361}
event: audio_chunk
data: {"content":{"audio":"SUQzBAAAAAAAI1RTU0U..."},"timestamp":1768468272256}
event: audio_chunk
data: {"content":{"audio":"//uQxAAAAAAAAAAAAAA..."},"timestamp":1768468272257}
event: audio
data: {"content":{"audio_url":"https://xxx.mp3","viseme_url":"https://xxx.json","unit_price":{"input_per_price":0,"input_token":8,"output_per_price":0,"output_token":34},"current_count":1,"current_total_count":1,"asset":[{"name":"ai_free_asset","quantity":38,"recoverable":false,"type":"consumable","valid_seconds":0},{"name":"test_tmpe","quantity":4,"recoverable":true,"type":"consumable","last_recovery_time":"2026-02-28T00:00:00Z","valid_seconds":0}]},"timestamp":1768468274046}
event: end
data: {"content":"end","timestamp":1768468274046}会话控制事件
| Event Name | Desc |
|---|---|
| start | 合成开始标记,包含时间戳 |
| end | 合成结束标记,包含时间戳 |
| error | 错误信息,如果出现错误会直接返回并且中断 |
音频传输事件
| Event Name | Desc |
|---|---|
| audio_chunk | 音频数据分片(base64 编码),多次推送,客户端按顺序接收并拼接成完整音频,支持流式播放 |
| audio | 完整音频信息,包含音频 URL、计费信息(unit_price)、使用次数和资产信息(viseme_url 仅在 need_viseme 为 true 时返回) |
audio 事件 content 字段说明:
| Field | Type | Desc |
|---|---|---|
| audio_url | string | 音频文件 URL |
| viseme_url | string | 口型数据 URL(仅在 need_viseme 为 true 时返回) |
| unit_price | object | 计费信息 |
| current_count | int | 当前策略的累计使用次数 |
| current_total_count | int | 当前策略的历史总计使用次数 |
| asset | array | 用户资产列表 |
unit_price 字段说明:
| Field | Type | Desc |
|---|---|---|
| input_per_price | number | 输入单价 |
| output_per_price | number | 输出单价 |
| input_token | int | 输入 token 数量 |
| output_token | int | 输出 token 数量 |
Error
参数错误:
json
{
"error": {
"error_type": "invalid_parameter",
"message": "invalid_parameter: text exceeds maximum length limit"
}
}所有策略失败:
json
{
"error": {
"error_type": "all_strategies_failed",
"message": "all strategies failed"
}
}后端服务不可用:
event: error
data: {"error":{"error_type":"backend_unavailable","message":"backend unavailable"}}附录
客户端传输数据格式
支持的音频格式
- PCM 16kHz 单声道 16bit
- 建议录音时长:1-60 秒
- 最大文件大小:10MB
