Save your FREE seat for Streaming Media Connect this August. Register Now!

为视频写文字:有人说过“秋天辅助的Cap Shins”吗??

Article Featured Image

Not long after the invention of the modern computer, 一个明显错误的假设是,计算机很快就能胜任处理自然语言数据的工作. 人们通常在3岁左右的时候就能很好地沟通了, so this didn’t seem to be an unreasonable expectation, 因为人们都知道,计算机解决的问题超出了最聪明的三岁孩子的能力. 语言理解是一种感知由另一个人说话引起的复杂声压变化的能力,然后根据当地文化和当前环境为他们分配象征性的解释,这是很难教给计算机的.

With the explosive growth of streaming media in the past decade, 已经投入了大量百家乐软件来应对自动为该视频配上字幕的挑战. Happily, some improvement has been made. The task of captioning is essentially this: Identify candidate speech sounds the speaker might be making; identify candidate words that fit the sequence of plausible sounds; choose the most probable sequence of candidate words; add appropriate punctuation; and segment the resulting text so it appears on screen in a way that can be easily and fluently read as it is spoken. Each of those tasks is difficult in its own right, 不同的自动字幕软件工具在某些方面比其他方面更好.

其中一项最近有所改善的任务是识别音素——语音中的元音和辅音. 这是一个众所周知的难题:因为每个人的声音都是独一无二的, 语音识别器需要经过训练来学习每个用户的特质. Improvement has come from two directions. On the client side, 我们大多数人都带着体积小但功能强大的电脑,它们的键盘很差,但麦克风还不错. 移动和桌面操作系统现在都有语音助手,它们会不断调整自己,以识别你独特的声音和你用它发出声音的方式. On the server side, we have classifiers, 选择输入数据是否属于某种分类或相似的软件, previously encountered data or rather to another category. 服务器端平台可以将语音信号与庞大的音位模式数据集进行比较,并比以前的系统更准确地对候选声音进行分类.

Another of those tasks that has improved, and will continue to improve, 从可用的候选词中选择最可能的单词序列. This is traditionally done with a language model; in its simplest form, 对不同单词一起出现的频率的统计分析. “自动化”(automated)和“标题”(captions)这两个词更有可能同时出现,而不是“秋季辅助帽饰”(autumn aided cap shins).” That likelihood is what language models capture.

教育视频的字幕在推动语音识别研究方面尤其成熟. A school is a fairly closed ecosystem. 我们可以很容易地识别老师是谁在讲课,我们可以很容易地让老师训练一个自定义的演讲模型,以便在她出现在视频中进行字幕时重用. Teachers at large research universities are the brightest minds recruited from all over the world and so their linguistic diversity is extreme; these custom-tuned speech models are critical for accurate captioning when your speakers are from such varied linguistic backgrounds.

教育视频通常包含技术词汇和术语,标准识别器很难识别. However, 我们可以使用老师在视频中使用的视觉辅助工具(通常是幻灯片), 这些辅助工具可以挖掘上下文相关的词汇. This is exactly what Microsoft Garage’s Presentation Translator does. 对这些非典型术语的准确标注是至关重要的, since the captions would be misleadingly bad otherwise.

大学是语音识别领域许多顶尖研究人员工作的地方,对准确的自动字幕的需求也是迫切的. 这是一个完美的例子,体现了大学的三重使命——教育, to research, and to provide public service—demand cooperative action.

[This article appears in the June 2018 issue of Streaming Media magazine as "Autumn Aided Cap Shins."]

Streaming Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

Backward Design for Educational Video Production

软件开发人员在前端开发的易访问性问题和基本概念(如标记控制元素和向辅助技术(屏幕阅读器)报告状态更改)方面接受了培训,这些都是专业开发人员代码测试过程的一部分. Despite this progress, 两种截然不同的力量正在旋转,它们有可能阻碍技术更好地包容残疾人的趋势.

An Impending Accessibility Backlash

软件开发人员在前端开发的易访问性问题和基本概念(如标记控制元素和向辅助技术(屏幕阅读器)报告状态更改)方面接受了培训,这些都是专业开发人员代码测试过程的一部分. Despite this progress, 两种截然不同的力量正在旋转,它们有可能阻碍技术更好地包容残疾人的趋势.


YouTube后台的编辑功能无法与Adobe Premiere Pro等非线性编辑器竞争, 但是有一些强大而独特的工具可以使简单的编辑项目变得更加简单.

New FCC Caption Requirements: What You Need to Know

新的字幕要求于7月1日生效, near-live, and prerecorded broadcast video that is put online.

How to Caption Live Online Video

我们距离实时视频字幕标准还有几年的距离, and the available solutions are anything but plug-and-play. But that doesn't mean it can't be done. It just takes a little effort.