Skip to content

Audio

Authentication

参考接入 BytePower

Pronunciation Assessment

接口功能:语音发音评估,支持准确性、流利度、完整性等多维度评估

Method & Path

  • POST {domain}/bp/ai/audio/pronunciation
  • POST {domain}/bp/server/user/{user_id}/ai/audio/pronunciation

Request

说明:

  • request body 是 chunk 格式,需要分段发送
  • request body 编码为 Transfer-Encoding,但是不能在 header 添加 Transfer-Encoding
  • 第一个 chunk 为 pronunciation 的参数
  • 后续 chunk 为 audio stream 内容
  • audio stream,需要使用 base64 编码,并且不能添加 data:audio/mp3;base64, 前缀
  • audio stream 除开最后一个 chunk,其它 chunk base64 编码时不能填充
  • 目前只支持 pcm,16hz,1 声道,16bit 音频格式

第一个 chunk 的参数:

ParametersTypeRequiredDesc
strategystringtrue评估策略,目前支持 azure_pronunciation
dataobjecttrue评估配置参数

data 对象参数:

ParametersTypeRequiredDefaultDesc
languagestringfalseen-US语言,默认 en-US
reference_textstringtrue-用于评估的标准参考文本
grading_systemstringfalseHundredMark分数系统:FivePoint(0-5) 或 HundredMark(0-100)
granularitystringfalsePhoneme评估粒度:Phoneme、Word 或 FullText
enable_miscuebooleanfalsefalse启用误读检测 (Omission/Insertion)
enable_prosody_assessmentbooleanfalsefalse启用韵律评估 (重音、语调、语速、节奏)
nbest_phoneme_countstringfalse-NBest 音素候选数量,用于候选音素评估
content_topicstringfalse-内容主题,用于内容评估

example:

POST /bp/ai/audio/pronunciation HTTP/1.1
Content-Type: text/plain

B7\r\n{"strategy":"azure_pronunciation","data":{"reference_text":"Hello world","grading_system":"HundredMark","granularity":"Phoneme","enable_miscue":true,"enable_prosody_assessment":true}}\r\nx\r\{x_byte_audio}\r\n0\r\n\r\n

Response

json
{
  "text": "Hello.",
  "strategy": "azure_audio_pronunciation",
  "current_count": 2,
  "current_total_count": 2,
  "audio_url": "https://xxx.wav",
  "unit_price": {
    "input_per_price": 0.0277,
    "output_per_price": 0,
    "input_token": 0,
    "output_token": 0
  },
  "asset": [
    {
      "name": "ai_free_asset",
      "type": "consumable",
      "quantity": 36,
      "recoverable": false,
      "valid_seconds": 0
    },
    {
      "name": "test_tmpe",
      "type": "consumable",
      "quantity": 4,
      "recoverable": true,
      "last_recovery_time": "2026-02-28T00:00:00Z",
      "valid_seconds": 0
    }
  ],
  "raw_data": {
    "Channel": 0,
    "DisplayText": "Hello.",
    "Duration": 4500000,
    "Id": "a287451405db4e3a892a379191ce52d1",
    "Offset": 32500000,
    "RecognitionStatus": "Success",
    "SNR": 48.073566,
    "NBest": [
      {
        "Confidence": 0.90666413,
        "Display": "Hello.",
        "ITN": "hello",
        "Lexical": "hello",
        "MaskedITN": "hello",
        "PronunciationAssessment": {
          "AccuracyScore": 81,
          "CompletenessScore": 100,
          "FluencyScore": 100,
          "PronScore": 80.4,
          "ProsodyScore": 60.4
        },
        "Words": [
          {
            "Word": "hello",
            "Offset": 32500000,
            "Duration": 4500000,
            "PronunciationAssessment": {
              "AccuracyScore": 81,
              "ErrorType": "None",
              "Feedback": {
                "Prosody": {
                  "Break": {
                    "BreakLength": 0,
                    "ErrorTypes": ["None"]
                  },
                  "Intonation": {
                    "ErrorTypes": ["Monotone"],
                    "Monotone": {
                      "SyllablePitchDeltaConfidence": 0.16252346
                    }
                  }
                }
              }
            },
            "Syllables": [
              {
                "Syllable": "hə",
                "Grapheme": "hel",
                "Offset": 28800000,
                "Duration": 1900000,
                "PronunciationAssessment": { "AccuracyScore": 86 }
              },
              {
                "Syllable": "loʊ",
                "Grapheme": "lo",
                "Offset": 30800000,
                "Duration": 2500000,
                "PronunciationAssessment": { "AccuracyScore": 67 }
              }
            ],
            "Phonemes": [
              {
                "Phoneme": "h",
                "Offset": 28800000,
                "Duration": 1100000,
                "PronunciationAssessment": { "AccuracyScore": 76 }
              },
              {
                "Phoneme": "ə",
                "Offset": 30000000,
                "Duration": 700000,
                "PronunciationAssessment": { "AccuracyScore": 100 }
              },
              {
                "Phoneme": "l",
                "Offset": 30800000,
                "Duration": 900000,
                "PronunciationAssessment": { "AccuracyScore": 100 }
              },
              {
                "Phoneme": "oʊ",
                "Offset": 31800000,
                "Duration": 1500000,
                "PronunciationAssessment": { "AccuracyScore": 46 }
              }
            ]
          }
        ]
      }
    ]
  }
}

响应字段说明:

FieldTypeDesc
textstring识别到的音频文本
strategystring使用的评估策略
current_countint当前策略的累计使用次数
current_total_countint当前策略的历史总计使用次数
audio_urlstring音频文件 URL(仅在启用音频存储时返回)
unit_priceobject计费信息,包含输入输出单价和 token 消耗
assetarray用户资产列表
raw_dataobjectAzure 语音评估服务原始返回数据

unit_price 字段说明:

FieldTypeDesc
input_per_pricenumber输入单价
output_per_pricenumber输出单价
input_tokenint输入 token 数量
output_tokenint输出 token 数量

raw_data 顶层字段说明:

FieldTypeDesc
RecognitionStatusstring识别状态,成功为 Success
DisplayTextstring识别到的展示文本
Durationint音频时长(100ns 单位)
Offsetint音频起始偏移(100ns 单位)
SNRnumber信噪比
NBestarray候选识别结果列表

NBest 元素中 PronunciationAssessment 评分说明:

ScoreDesc
AccuracyScore准确性评分,指示语音发音与参考文本的匹配程度
CompletenessScore完整性评分,指示说出了参考文本的多少内容
FluencyScore流利度评分,指示语音的自然程度
PronScore综合发音评分
ProsodyScore韵律评分,评估重音、语调、语速和节奏(需启用韵律评估)

Words 元素字段说明:

FieldTypeDesc
Wordstring单词
Offsetint单词起始偏移(100ns 单位)
Durationint单词时长(100ns 单位)
PronunciationAssessmentobject单词维度评估,含 AccuracyScore、ErrorType、Feedback
Syllablesarray音节列表,含 Syllable、Grapheme、Offset、Duration、PronunciationAssessment
Phonemesarray音素列表,含 Phoneme、Offset、Duration、PronunciationAssessment

Error

参数错误:

json
{
  "error": {
    "error_type": "invalid_parameter",
    "message": "invalid_parameter: reference_text is required"
  }
}

请求第三方失败:

json
{
  "error": {
    "error_type": "backend unavailable",
    "message": "XXX"
  }
}

Text-to-Speech

文本转语音

Method & Path

  • POST {domain}/bp/ai/audio/tts
  • POST {domain}/bp/server/user/{user_id}/ai/audio/tts

Request

Content-Type: application/json

ParametersTypeRequiredDesc
textstringtrue待转换的文本内容
strategiesstring[]true语音合成策略列表,支持多策略回退,按顺序尝试直到成功
platform_paramsobject[]false平台参数数组,用于覆盖策略的默认配置(voice、prompt等)

platform_params 说明

platform_params 对象用于动态覆盖策略的默认配置,常用字段包括 platform(平台标识)、voice(音色)、prompt(提示词)等。

支持的平台:

  • azure_audio - Azure 语音合成服务
  • google_audio - Google Cloud Text-to-Speech
  • gemini - Google Gemini 语音合成
  • openai - OpenAI TTS
  • aws_audio - Amazon Polly

策略回退机制

TTS 接口支持多策略回退,当一个策略失败时会自动尝试下一个策略:

  1. 按顺序尝试:按 strategies 数组顺序依次尝试
  2. 失败条件
    • 配置错误(如缺少必要参数)
    • API 调用失败且未发送任何音频数据
    • 返回空音频
  3. 成功返回:任一策略成功即返回结果
  4. 全部失败:所有策略都失败时返回 all_strategies_failed 错误

⚠️注意:

  • 如果某个策略已经开始发送 audio_chunk 事件,则不会重试其他策略
  • 这是为了避免客户端接收到不完整或混乱的音频数据
  • 因此建议将更稳定的策略放在前面

请求示例

json
{
  "text": "Hello, how are you today?",
  "strategies": [
    "tts_azure",
    "tts_openai",
    "tts_gemini"
  ],
  "platform_params": [
    {
      "platform": "azure_audio",
      "voice": "en-US-SerenaMultilingualNeural",
      "speed": 1.2
    },
    {
      "platform": "openai",
      "voice": "coral"
    },
    {
      "platform": "gemini",
      "language_code": "en-US",
      "voice": "Orus",
      "need_viseme": false
    }
  ]
}

Response

响应格式:Server-Sent Events (SSE) 流式推送

Content-Type: text/event-stream

完整响应示例

event: start
data: {"content":"start","timestamp":1768468269361}

event: audio_chunk
data: {"content":{"audio":"SUQzBAAAAAAAI1RTU0U..."},"timestamp":1768468272256}

event: audio_chunk
data: {"content":{"audio":"//uQxAAAAAAAAAAAAAA..."},"timestamp":1768468272257}

event: audio
data: {"content":{"audio_url":"https://xxx.mp3","viseme_url":"https://xxx.json","unit_price":{"input_per_price":0,"input_token":8,"output_per_price":0,"output_token":34},"current_count":1,"current_total_count":1,"asset":[{"name":"ai_free_asset","quantity":38,"recoverable":false,"type":"consumable","valid_seconds":0},{"name":"test_tmpe","quantity":4,"recoverable":true,"type":"consumable","last_recovery_time":"2026-02-28T00:00:00Z","valid_seconds":0}]},"timestamp":1768468274046}

event: end
data: {"content":"end","timestamp":1768468274046}

会话控制事件

Event NameDesc
start合成开始标记,包含时间戳
end合成结束标记,包含时间戳
error错误信息,如果出现错误会直接返回并且中断

音频传输事件

Event NameDesc
audio_chunk音频数据分片(base64 编码),多次推送,客户端按顺序接收并拼接成完整音频,支持流式播放
audio完整音频信息,包含音频 URL、计费信息(unit_price)、使用次数和资产信息(viseme_url 仅在 need_viseme 为 true 时返回)

audio 事件 content 字段说明:

FieldTypeDesc
audio_urlstring音频文件 URL
viseme_urlstring口型数据 URL(仅在 need_viseme 为 true 时返回)
unit_priceobject计费信息
current_countint当前策略的累计使用次数
current_total_countint当前策略的历史总计使用次数
assetarray用户资产列表

unit_price 字段说明:

FieldTypeDesc
input_per_pricenumber输入单价
output_per_pricenumber输出单价
input_tokenint输入 token 数量
output_tokenint输出 token 数量

Error

参数错误:

json
{
   "error": {
      "error_type": "invalid_parameter",
      "message": "invalid_parameter: text exceeds maximum length limit"
   }
}

所有策略失败:

json
{
   "error": {
      "error_type": "all_strategies_failed",
      "message": "all strategies failed"
   }
}

后端服务不可用:

event: error
data: {"error":{"error_type":"backend_unavailable","message":"backend unavailable"}}

附录

客户端传输数据格式

Transfer-Encoding

支持的音频格式

  • PCM 16kHz 单声道 16bit
  • 建议录音时长:1-60 秒
  • 最大文件大小:10MB

京ICP备19011570号-2