EMO: Alibaba’s Diffusion Model-Based Talking Portrait Generator

Alibaba’s EMO (or Emote Portrait Alive) framework is a recent entry in a series of attempts to generate a talking head using existing audio (spoken word or vocal audio) and a reference portrait image as inputs. At its core it uses a diffusion model that is trained on 250 hours of video footage and over 150 million images. But unlike previous attempts, it adds what the researchers call a speed controller and a face region controller. These serve to stabilize the generated frames, along with an additional module to stop the diffusion model from outputting frames that feature a result too distinct from the reference image used as input.

In the related paper by [Linrui Tian] and colleagues a number of comparisons are shown between EMO and other frameworks, claiming significant improvements over these. A number of examples of talking and singing heads generated using this framework are provided by the researchers, which gives some idea of what are probably the ‘best case’ outputs. With some examples, like [Leslie Cheung Kwok Wing] singing ‘Unconditional‘ big glitches are obvious and there’s a definite mismatch between the vocal track and facial motions. Despite this, it’s quite impressive, especially with fairly realistic movement of the head including blinking of the eyes.

Meanwhile some seem extremely impressed, such as in a recent video by [Matthew Berman] on EMO where he states that Alibaba releasing this framework to the public might be ‘too dangerous’. The level-headed folks over at PetaPixel however also note the obvious visual imperfections that are a dead give-away for this kind of generative technology. Much like other diffusion model-based generators, it would seem that EMO is still very much stuck in the uncanny valley, with no clear path to becoming a real human yet.

Continue reading “EMO: Alibaba’s Diffusion Model-Based Talking Portrait Generator”

A Wireless Monitor Without Breaking The Bank

The quality of available video production equipment has increased hugely as digital video and then high-definition equipment have entered the market. But there are still some components which are expensive, one of which is a decent quality HD wireless monitor. Along comes [FuzzyLogic] with a solution, in the form of an external monitor for a laptop, driven by a wireless HDMI extender.

In one sense this project involves plugging in a series of components and simply using them for their intended purpose, however it’s more than that in that it involves some rather useful 3D printed parts to make a truly portable wireless monitor, as well as saving the rest of us the gamble of buying wireless HDMI extender without knowing whether it would deliver.

He initially tried an HDMI-to-USB dongle and a streaming Raspberry Pi, however the latency was far too high to be useful. The extender does have a small delay, but not so bad as to be unusable. The whole including the monitor can be powered from a large USB power bank, answering one of our questions. All the files can be downloaded from Printables should you wish to follow the same path, and meanwhile there’s a video with the details below the break.

Continue reading “A Wireless Monitor Without Breaking The Bank”

What If The Matrix Was Made In The 1950s?

We’ve noticed a recent YouTube trend of producing trailers for shows and movies as if they were produced in the 1950s, even when they weren’t. The results are impressive and, as you might expect, leverage AI generation tools. While we enjoy watching them, we were especially interested in [Patrick Gibney’s] peek behind the curtain of how he makes them, as you can see below. If you want to see an example of the result first, check out the second video, showing a 1950s-era The Matrix.

Of course, you could do some of it yourself, but if you want the full AI experience, [Patrick] suggests using ChatGPT to produce a script, though he admits that if he did that, he would tweak the results. Other AI tools create the pictures used and the announcer-style narration. Another tool produces cinematographic shots that include the motion of the “actors” and other things in the scene. More tools create the background music.

Continue reading “What If The Matrix Was Made In The 1950s?”

Close-up of the mod installed into the HDMI switch, tapping the IR receiver

Interfacing A Cheap HDMI Switch With Home Assistant

You know the feeling of having just created a perfect setup for your hacker lab? Sometimes, there’s just this missing piece in the puzzle that requires you to do a small hack, and those are the most tempting. [maxime borges] has such a perfect setup that involves a HDMI 4:2 switch, and he brings us a write-up on integrating that HDMI switch into Home Assistant through emulating an infrared receiver’s signals.

overview picture of the HDMI switch, with the mod installed

The HDMI switch is equipped with an infrared sensor as the only means of controlling it, so naturally, that was the path chosen for interfacing the ESP32 put inside the switch. Fortunately, Home Assistant provides the means to both receive and output IR signals, so after capturing all the codes produced by the IR remote, parsing their meaning, then turning them into a Home Assistant configuration, [maxime] got HDMI input switching to happen from the comfort of his phone.

We get the Home Assistant config snippets right there in the blog post — if you’ve been looking for a HDMI switch for your hacker lair, now you have one model to look out for in particular. Of course, you could roll your own HDMI switch, and if you’re looking for references, we’ve covered a good few hacks doing that as part of building a KVM.

Unraveling The Secrets Of Apple’s Mysterious Fisheye Format

Apple has developed a proprietary — even mysterious — “fisheye” projection format used for their immersive videos, such as those played back by the Apple Vision Pro. What’s the mystery? The fact that they stream their immersive content in this format but have provided no elaboration, no details, and no method for anyone else to produce or play back this format. It’s a completely undocumented format and Apple’s silence is deafening when it comes to requests for, well, anything to do with it whatsoever.

Probably those details are eventually forthcoming, but [Mike Swanson] isn’t satisfied to wait. He’s done his own digging into the format and while he hasn’t figured it out completely, he has learned quite a bit and written it all up on a blog post. Apple’s immersive videos have a lot in common with VR180 type videos, but under the hood there is more going on. Apple’s stream is DRM-protected, but there’s an unencrypted intro clip with logo that is streamed in the clear, and that’s what [Mike] has been focusing on.

Most “fisheye” formats are mapped onto square frames in a way similar to what’s seen here, but this is not what Apple is doing.

[Mike] has been able to determine that the format definitely differs from existing fisheye formats recorded by immersive cameras. First of all, the content is rotated 45 degrees. This spreads the horizon of the video across the diagonal, maximizing the number of pixels available in that direction (a trick that calls to mind the heads in home video recorders being tilted to increase the area of tape it can “see” beyond the physical width of the tape itself.) Doing this also spreads the center-vertical axis of the content across the other diagonal, with the same effect.

There’s more to it than just a 45-degree rotation, however. The rest most closely resembles radial stretching, a form of disc-to-square mapping. It’s close, but [Mike] can’t quite find a complete match for what exactly Apple is doing. Probably we’ll all learn more soon, but for now Apple isn’t saying much.

Videos like VR180 videos and Apple’s immersive format display stereoscopic video that allow a user to look around naturally in a scene. But to really deliver a deeper sense of presence and depth takes light fields.

Analyzing The Code From The Terminator’s HUD

The T-800, also known as the Terminator, was like some kind of non-giving up robot guy. The robot assassin viewed the world through a tinted view with lines of code scrolling by all the while. It was cinematic shorthand to tell the audience they were looking through the eyes of a machine. Now, a YouTuber called [Open Source] has analyzed that code.

The video highlights one interesting finds, concerning graphics seen in the T-800’s vision. They appear to match the output of various code listings and articles in Nibble Magazine, specifically its September 1984 issue. One example spotted was a compass rose, spawned from an Apple Basic listing. it was a basic quiz to help teach children to understand the compass. Another graphic appears to be cribbed from the same issue in the MacPaint Patterns section.

The weird thing is that the original film came out in October 1984 — just a month after that article would have hit the news stands. It suggests perhaps someone involved with the movie was also involved or had access to an early copy of Nibble Magazine — or that the examples in the magazine were just rehashed from some other earlier source.

Code that regularly flickers in the left of the T-800s vision is just 6502 machine code. It’s apparently just a random hexdump from an Apple II’s memory. At other times, there’s also 6502 assembly code on screen which includes various programmer comments still intact. There’s even some code cribbed from the Apple II DOS 3.3 RAM Disk driver.

It’s neat to see someone actually track down the background of these classic graphics. Hacking and computers are usually portrayed in a fairly unrealistic way in movies, and it’s no different in The Terminator (1984). Still, that doesn’t mean the movies aren’t fun!

Continue reading “Analyzing The Code From The Terminator’s HUD”

Bye Bye Green Screen, Hello Monochromatic Screen

It’s not uncommon in 2024 to have some form of green background cloth for easy background effects when in a Zoom call or similar. This is a technology TV and film studios have used for decades, and it’s responsible for many of the visual effects we see every day on our screens. But it’s not perfect — its use precludes wearing anything green, and it’s very bad at anything transparent.

The 1960s Disney film makers seemingly had no problem with this as anyone who has seen Mary Poppins will tell you, so how did they manage to overlay actors with diaphanous accessories over animation? The answer lies in an innovative process which has largely faded from view, and [Corridor Crew] have rebuilt it.

Green screens, or chroma key, to give the effect its real name, relies on the background using a colour not present in the main subject of the shot. This can then be detected electronically or in software, and a switch made between shot and inserted background. It’s good at picking out clean edges between green background and subject, but poor at transparency such as a veil or a bottle of water. The Disney effect instead used a background illuminated with monochromatic sodium light behind the subject illuminated with white light, allowing both a background and foreground image to be filmed using two cameras and a dichroic beam splitter. The background image with its black silhouette of the subject could then be used as a photographic stencil when overlaying a background image.

Sadly even Disney found it very difficult to make more than a few of the dichroic prisms, so the much cheaper green screen won the day. But in the video below the break they manage to replicate it with a standard beam splitter and a pair of filters, successfully filming a colourful clown wearing a veil, and one of them waving their hair around while drinking a bottle of water. It may not find its way back into blockbuster films just yet, but it’s definitely impressive to see in action.

Continue reading “Bye Bye Green Screen, Hello Monochromatic Screen”