Led by a founder who sold a video startup to Apple, Panjaya uses deepfake techniques to bite into video dubbing

There’s a big opportunity for generative AI in the world of translation, and a startup called Panjaya is taking the concept to the next level: a hyperrealistic, gen AI-based dubbing tool for videos that re-creates a person’s original voice speaking the new language, with the video and the speaker’s physical movements automatically modifying to match up naturally with the new speech patterns.

After being in stealth for the last three years, the startup is unveiling BodyTalk, the first version of its product, alongside its first outside funding of $9.5 million.

Panjaya is the brainchild of Hilik Shani and Ariel Shalom, two deep learning specialists who have spent the majority of their professional lives quietly working on deep learning technology for the Israeli government and are now respectively the startup’s general manager and CTO. They hung up their G-man hats in 2021 with the startup itch, and 1.5 years ago, they were joined by Guy Piekarz as CEO.

Piekarz is not a founder at Panjaya, but he is a notable name to have onboard: Back in 2013, he sold a startup that he did found to Apple. Matcha, as the startup was called, was an early, buzzy player in streaming video discovery and recommendation, and it was acquired during the very early days of Apple’s TV and streaming strategy, when these were more rumors than actual products. Matcha was bootstrapped and sold for a song: $10 million to $15 million — modest considering the significant steer Apple eventually made into streamed media.

Piekarz stayed with Apple for nearly a decade building Apple TV and then its sports vertical. Then, he was introduced to Panjaya through Viola Ventures, one of its backers (others include R-Squared Ventures, JFrog co-founder and CEO Shlomi Ben Haim, Chris Rice, Guy Schory, Ryan Floyd of Storm Ventures, Ali Behnam of Riviera Partners, and Oded Vardi.

“I had left Apple by then and was planning to do something completely different,” Piekarz said. “However, seeing a demo of the tech blew my mind, and the rest is history.”

BodyTalk is interesting for how it simultaneously brings several pieces of technology that play on different aspects of synthetic media into the frame.

It starts with audio-based translation that currently can offer translations in 29 languages. The translation is then spoken in a voice that mimics the original speaker, which in turn is set to a version of the original video where the speaker’s lips and other movements get modified to fit the new words and phrasing. All this is created automatically on videos after users upload them to the platform, which also comes with a dashboard that includes further editing tools. Future plans include an API, as well as getting closer to real-time processing. (Right now, BodyTalk is “near real-time,” taking minutes to process videos, Piekarz said.)

“We’re using best of breed where where we need to,” Piekarz said of the company’s use of third-party large language models and other tools. “And we’re building our own AI models where the market doesn’t really have a solution.”

An example of that is the company’s lip syncing, he continued. “Our whole lip sync engine is homegrown by our AI research team, because we haven’t found anything that gets to that level and quality of multiple speakers, angles, and all the business use cases we want to support.”

Its focus for the moment is just on B2B; clients include JFrog and the TED media organization. The company has plans to expand further in media, specifically in areas like sports, education, marketing, healthcare, and medicine.

The resulting translation videos are very uncanny, not unlike what you get with deepfakes, although Piekarz winces at that term, which has picked up negative connotations over the years that are the exact opposite of the market the startup is targeting.

“‘Deepfake’ is not something that we’re interested in,” he said. “We’re looking to avoid that whole name.” Instead, he said, think of Panjaya as part of the “deep real category.”

By aiming just for the B2B market, and controlling who gets to access its tools, the company is creating “guardrails” around the technology to protect from misuse, he added. He also thinks that longer term there will be more tools built, including watermarking, to help detect when any videos have been modified to create synthetic media, both legit and nefarious. “We definitely want to be a part of that and not allow misinformation,” he said.

The not-so-fine print

There are a number of startups that compete with Panjaya in the wider area of AI-based translation for videos, including big names like Vimeo and Eleven Labs, as well as smaller players like Speechify and Synthesis. For all of them, building ways to improve how dubbing works feels a little like swimming against a strong tide. That is because captions have become a very standard part of how video is consumed these days.

On TV, it’s for a litany of reasons like poor speakers, background noise in our busy lives, mumbling actors, limited production budgets, and more sound effects. CBS found in a poll of American TV viewers that more than half of them kept subtitles on “some (21%) or all (34%) of the time.”

But some love captions just because they are entertaining to read, and there’s been a whole cult built around that.

On social media and other apps, subtitles are simply baked into the experience. TikTok, as one example, started in November 2023 to turn on captioning by default on all videos.

All the same, there remains a huge market internationally for dubbed content, and even if English is often thought of as the lingua franca of the internet, there is evidence from research groups like CSA that content delivered in native languages gets better engagement, especially in the B2B context. Panjaya’s pitch is that more natural native-language content could do even better.

Some of its customers appear to support that theory. TED says that Talks dubbed using Panjaya’s tooling have seen increased views of 115%, with completion rates doubling for those translated videos.

This article was originally published by Techcrunch.com. Read the original article here.