Twelve Labs: Video Understanding Company Secures $10 Million

Video understanding company Twelve Labs recently announced the debut of game-changing technology along with the release of its public beta. Along with the company’s latest advancements, Twelve Labs disclosed a $10 million strategic investment. Investors such as NVentures (NVIDIA’s venture capital arm); Intel, Samsung Next, and others see Twelve Labs’ technology as driving the future of video understanding. This investment, in alignment with the company, will create novel opportunities and exciting product integrations that will change the video landscape.

Twelve Labs is the first in its industry to commercially release video-to-text generative APIs powered by its latest video-language foundation model, Pegasus-1. And this model would enable novel capabilities like Summaries, Chapters, Video Titles, and Captioning from videos – even those without audio or text– with the release of its public beta to truly extend the boundaries of what is possible.

This release comes at a time when language models’ training objective previously had been to guess the most probable next word. This task alone surfaced new possibilities, ranging from planning a set of actions to solve a complex problem to effectively summarizing a 1,000-page-long text to passing the bar exam. Even though mapping visual and audio content to language may be viewed similarly, solving video-language alignment, as Twelve Labs has with this release, is incredibly difficult. But by doing so, Twelve Labs’ latest functionality solves many other problems no one else has been able to overcome.

The company uniquely trained its multimodal AI model to solve complex video-language alignment problems. And Twelve Labs’ proprietary model, evolved, tested, and refined for its public beta, leverages all of the components present in videos like action, object, and background sounds, and it learns to map human language to what’s happening inside a video. This is beyond the capabilities in the existing market, and its APIs are now available as OpenAI rolls out voice and image capabilities for ChatGPT, signaling a shift is underway from interest in unimodal to multimodal.

Twelve Labs enables video to not only tell a holistic story. And it also endows models with powerful capabilities so that users can find the best video to meet their needs, whether it’s pulling a highlight reel or generating a custom report. Twelve Labs users are now able to extract topics, as well as create summaries and chapters of video leveraging multimodal data. These features not only save users substantial amounts of time but also help uncover new insights, suggest marketing content such as catchy headlines or SEO-friendly tags, and unlock new possibilities for video through simple-to-use APIs.

Twelve Labs decided to build the go-to video understanding infrastructure for developers and enterprises innovating the video experience in their respective areas. And it makes video just as easy and useful as text. In essence, Twelve Labs provides the video intelligence layer on top of which customers build their dream features.

For the first time, organizations and developers can do things like retrieve an exact moment within hundreds of thousands of hours of footage by describing that scene in text or generate the relevant body text, be it titles, chapters, summaries, reports, or even tags from videos and incorporating the visual and audio just by prompting the model for it. With these groundbreaking capabilities, Twelve Labs pushes boundaries to provide a text-based interface that solves all video-related downstream tasks, ranging from low-level perception tasks to high-level video understanding.

During its highly successful closed beta, in which more than 17,000 developers tested the platform, Twelve Labs worked to ensure a highly scalable, fast, and reliable experience. And the company saw an explosion of use cases.

KEY QUOTES:

“The Twelve Labs team has consistently pushed the envelope and broken new ground in video understanding since our founding in 2021. Our latest features represent this tireless work. Based on the remarkable feedback we have received, and the breadth of test cases we’ve seen, we are incredibly excited to welcome a broader audience to our platform so that anyone can use best-in-class AI to understand video content without manually watching thousands of hours to find what they are looking for. We believe this is the best, most efficient way to make use of video.”

— Jae Lee, co-founder and CEO of Twelve Labs

“What Twelve Labs has accomplished technically is impressive. Anyone who understands the complexities associated with summarizing video will appreciate this leap forward. We believe Twelve Labs is an exciting AI company and look forward to working with the team on numerous projects in the future.”

— Mohamed (Sid) Siddeek, head of NVentures at NVIDIA

“It’s essential for our business to access exact moments, angles, or events within a game in order to package the best content to our fans, so we prioritize video search tools for our content creators. It’s exciting to see the shift from traditional video labeling and tagging towards contextual video search using natural language. The emergence of multi-modal AI and natural language search can be a game-changer in opening up access to a media library and surfacing the best content you have available.”

— Brad Boim, Senior Director of Asset Management and Post-Production, NFL Media