The progress of a neural network that is learning how to generate Jimmy Fallon and John Oliver’s faces.

Exploring DeepFakes

Gaurav Oberoi
Published in
12 min readMar 5, 2018

--

In December 2017, a user named “DeepFakes” posted realistic looking explicit videos of famous celebrities on Reddit. He generated these fake videos using deep learning, the latest in AI, to insert celebrities’ faces into adult movies.

In the following weeks, the internet exploded with articles about the dangers of face swapping technology: harassing innocents, propagating fake news, and hurting the credibility of video evidence forever.

It’s true that bad actors will use this technology for harm; but given that the genie is out of the bottle, shouldn’t we pause to consider what else DeepFakes could be used for?

In this post, I explore the capabilities of this tech, describe how it works, and discuss potential applications.

But first: what is DeepFakes and why does it matter?

DeepFakes offers the ability to swap one face for another in an image or a video. Face swapping has been done in films for years, but it required skilled video editors and CGI experts to spend many hours to achieve decent results.

The new breakthrough is that, using deep learning techniques, anybody with a powerful GPU, and training data, can create believable fake videos.

This is so remarkable that I’m going to repeat it: anyone with hundreds of sample images, of person A and person B can feed them into an algorithm, and produce high quality face swaps — video editing skills are not needed.

This also means that it can be done at scale, and given that so many of us have our faces online, it’s trivially easy to insert almost anyone into fake videos. Scary, but hopefully it’s not all doom and gloom, after all, we as a society have already come to accept that photos can easily be faked.

Generating faces is not new: in Star Wars Rogue One, an actress, along with some CGI magic was used to recreate a young Carrie Fisher. The dots helped map her face accurately (learn more here). This is not DeepFakes, but this is the sort of thing DeepFakes lets you do without any skill.

Exploring what DeepFakes is capable of

Before dreaming up how to use this tech, I wanted to get a handle on how it works and how well it performs.

I picked two popular late night TV hosts, Jimmy Fallon and John Oliver, because I can find lots of videos of them with similar poses and lighting — and also enough variation (like lip sync battles) to keep it interesting.

Luckily for me, there’s an active GitHub repo that contains the original DeepFakes code and many more improvements. It’s fairly straightforward to use, but the onus is still on the user to collect and prepare training data.

To make experimentation easy, I wrote a script to work directly with YouTube videos. This makes collecting and preprocessing training data painless, and converting videos one-step. Click here to view my Github repo, and see how easily I generated the videos below (I also share my model weights).

Easy peasy: my script faceit.py let’s you specify a list of YouTube videos for each person. Then run commands to preprocess (download videos from YouTube, extract frames, find faces), train, and convert videos with audio and options to resize, show side-by-side, etc.

Results: Jimmy Fallon to John Oliver

The following videos were generated by training a model on about 15k images of each person’s face (30k images total). I got faces for each celebrity from 6–8 YouTube videos of 3–5 minutes each, with 20 frames per second per video, and by filtering out frames that don’t have their faces present. All of this was done automatically — all I did was specify a list of YouTube video urls.

The total training time was about 72 hours on a NVIDIA GTX 1080 TI GPU. Training is primarily constrained by GPU, but downloading videos, and chopping them into frames is I/O bound and can be parallelized.

Note that while I had thousands of images of each person, decent face swaps can be achieved with as few as 300 images. I went this route because I pulled face images from videos, and it’s far easier to pick a handful of videos as training data, than to find hundreds of images.

The images below are low resolution to keep the size of the animated GIF file small. There’s a YouTube video below with higher resolution and sound.

Oliver performing Iggy Azalea’s song Fancy. The algorithm does ok even with the microphone in the way.
Here’s Oliver hosting Fallon’s show. Note that the face gets glasses, but hair and face shape are untouched.
Nice pep talk for the Olympics coach. The algorithm has learned enough to make Oliver’s “DO IT!” look real.
Please don’t watch this unless you want to see John Oliver dancing to bumpin’ music.

While not perfect, the results above are quite convincing. The key thing to remember is: the algorithm learned how to do this by seeing lots of examples, I didn’t modify the videos in any way. Magical? Let’s look under the covers.

How does it work?

At the core of the Deepfakes code is an autoencoder, a deep neural network that learns how to take an input, compress it down into a small representation or encoding, and then to regenerate the original input from this encoding.

In this standard autoencoder setup, the network is trying to learn how to create an encoding (the bits in the middle), from which it can regenerate the original image. With enough data, it will learn how to do this.

Putting a bottleneck in the middle forces the network to recreate these images instead of just returning what it sees. The encodings help it capture broader patterns, hypothetically, like how and where to draw Jimmy Fallon’s eyebrow.

Deepfakes goes further by having one encoder to compress a face into an encoding, and two decoders, one to turn it back into person A (Fallon), and the other to person B (Oliver). It’s easier to understand with a diagram:

There is only one encoder that is shared between the Fallon and Oliver cases, but the decoders are different. During training, the input faces are warped, to simulate the notion of “we want a face kind of like this”.

In the above, we’re showing how these 3 components get trained:

  1. We pass in a warped image of Fallon to the encoder and try to reconstruct Fallon’s face with Decoder A. This forces Decoder A to learn how to create Fallon’s face from a noisy input.
  2. Then, using the same encoder, we encode a warped version of Oliver’s face and try to reconstruct it using Decoder B.
  3. We keep doing this over and over again until the two decoders can create their respective faces, and the encoder has learned how to “capture the essence of a face” whether it be Fallon’s or Oliver’s.

Once training is complete, we can perform a clever trick: pass in an image of Fallon into the encoder, and then instead of trying to reconstruct Fallon from the encoding, we now pass it to Decoder B to reconstruct Oliver.

This is how we run the model to generate images. The encoder captures the essence of Fallon’s face and gives it to Decoder B, which says “ah, another noisy input, but I’ve learned how to turn this into Oliver… voila!”

It’s remarkable to think that the algorithm can learn how to generate these images just by seeing thousands of examples, but that’s exactly what has happened here, and with fairly decent results.

Limitations and learnings

While the results are exciting, there are clear limitations to what we can achieve with this technology today:

  • It only works if there are lots of images of the target: to put a person into a video, on the order of 300–2000 images of their face are needed so the network can learn how to recreate it. The number depends on how varied the faces are, and how closely they match the original video.
    This works fine for celebs, or anyone with lots of photos online. But clearly, this won’t let you build a product that can swap anybody’s face.
  • You also need training data that is representative of the goal: the algorithm isn’t good at generating profile shots of Oliver, simply because it didn’t have many examples of Oliver looking to the side. In general, training images of your target need to approximate the orientation, facial expressions, and lighting in the videos you want to stick them into.
    So if you’re building a face swap tool for the average person, given that the most photos of them will be front-facing (e.g., selfies on Instagram), limit face swaps to mostly forward facing videos. If you’re working with a celeb it’s easier to get a diverse set of images. And if your target is helping you create data, go for quantity and variety so you can insert them into anything.
There just weren’t that many images of Oliver looking to the side, so the network was never able to learn by example and generate profile poses.
  • Building models can get expensive: it took 48 hours to get ok results, and 72 to get the fair ones above. At about $0.50 a GPU hour, it will cost $36 to build a model just for swapping person A to B and vice versa, and that doesn’t include all the bandwidth needed to get training data, and CPU and I/O to preprocess it. The biggest issue: you need one model for every pair of people, so work invested in a single model doesn’t scale to different faces. (Note: I used a single GPU machine; training could be parallelized across GPUs.) → This high model creation cost makes it hard to create a free or cheap app in the hopes it will go viral. Of course, it’s not an issue if customers are willing to pay to generate models.
  • Running models is fairly cheap, but not free: it takes about 5–20X the duration of a video to create a swap when running on a GPU, e.g. a 1 minute 1080p video takes about 18 minutes to generate. A GPU helps speed up the core model, and also the face detection code (i.e. is there a face in this frame that I even need to swap?). I haven’t tried conversion on just a CPU, but I’d venture it would be much slower. → There are a lot of inefficiencies in the current conversion process: waiting on I/O when reading and writing frames, no batching of frames sent to the GPU, no parallelism, etc. This can be made to work much faster, and possibly with a time and cost tradeoff that allows CPUs to be just fine for conversion.
  • Reusing models reduces training time, and therefore cost: if you use the Fallon-to-Oliver model to convert, say, Jimmy Kimmel’s face to John Oliver, the results are typically quite poor. However, if you train a new model with images of Kimmel and Oliver, but start with the Fallon-to-Oliver trained model, it can learn how to perform reasonable conversions in 20–50% of the time (12–36 hours instead of 72+ hours). → If one were to keep a stable of models for different face shapes, and use them as starting points, there may be considerable reductions in time and cost.

These are tenable problems to be sure: tools can be built to collect images from online channels en masse; algorithms can help flag when there is insufficient or mismatched training data; clever optimizations or model reuse can help reduce training time; and a well engineered system can be built to make the entire process automatic.

But ultimately, the question is: why? Is there enough of a business model to make doing all this worth it?

So what potential applications are there?

Given what’ve now learned about what’s possible, let’s talk about ways in which this could be useful:

Video Content Production

Hollywood has had this technology at its fingertips, but not at this low cost. If they can create great looking videos with this technique, it will change the demand for skilled editors over time.

But it could also open up new opportunities: for instance, making movies with unknown actors, and then superimposing famous celebrities onto them. This could work for YouTube videos or even news channels filmed by regular folks.

In more out-there scenarios, studios could change actors based on their target market (more Schwarzenager for the Austrians), or Netflix could allow viewers to pick actors before hitting play. More likely, this tech could generate revenue for the estates of long dead actors by bringing them back to life.

Deepfakes is the tip of the iceberg where methods of video generation are concerned. The video above demonstrates an approach to construct a controllable 3D model of a person from a photo collection (paper 2015). The lead author of that paper has other exciting work worth looking at; currently he’s at Google Brain.

Social Apps

Some of the comment threads on DeepFakes videos on YouTube are abuzz about what a great meme generator this technology could create. Jib Jab is a company that has been selling video greeting cards with simple face swapping for years (they are hilarious). But the big opportunity is to create the next big viral hit; after all photo filters attracted masses of people to Instagram and SnapChat, and face swapping apps have done well before.

Given how fun the results can be, there’s likely room for a hit viral app if you can get the costs low enough to generate these models.

StarGAN is a research paper that demonstrates generating different hair colors, gender, age, and even emotional expression using just an algorithm. I bet an app that lets you generate pouty faces could go viral.

Licensing Celebrity Faces

Imagine if Target could have a celebrity showcase their clothes for a month, just by paying her agent a fee, grabbing some existing headshots, and clicking a button. This would create a new revenue stream for celebrities, social media influencers, or anyone who happens to be in the spotlight at the moment. And it would give businesses another tool to promote brands and drive conversion. It also raises interesting legal questions about ownership of likeness, and business model questions on how to partition and price rights to use them.

Licensing faces is already happening. Looklet is a business that allows apparel companies to: (1) photograph their apparel on a mannequin, (2) pick accompanying clothes, (3) pick a model’s face and pose, and voila, have a glossy image ready for marketing. What’s more: they can restyle at will without models or photographers.

Personalized Advertising

Imagine a world where the ads you see as you surf the web include you, your friends, and your family. While this may come across as creepy today, does it seem so far fetched to think that this won’t be the norm in a few years?

After all, we are visual creatures, and advertisers have been trying to elicit emotional responses from us for years, e.g. Coke may want to convey joy by putting your friends in a hip music video, or Allstate may tug at your fears by showing your family in an insurance ad. Or the approach may be more direct: Banana Republic could superimpose your face on a body type that matches yours, and convince you that it’s worth trying out their new leather jackets.

Conclusion

Whoever the original Deepfakes user is, they opened a Pandora’s box of difficult questions about how fake video generation will affect society. I hope that in the same way we have come to accept that images can easily be faked, we will adapt to video uncertainty too, though not everyone shares this hope.

What Deepfakes also did is shine a light on how interesting this technology is. Deep generative models like the autoencoder that Deepfakes uses, allow us to create synthetic but realistic looking data (including images or videos), only by showing an algorithm lots of examples. This means that once these algorithms are turned into products, regular folks will have access to powerful tools that will make them more creative, hopefully towards positive ends.

There have already been some interesting applications of this technique, like style transfer apps that make your photos look like famous paintings, but given the high volume and exciting nature of the research that is being published in this space, there’s clearly a lot more to come.

I’m interested in exploring how to build value from the latest in AI research; if you have an interest in taking this technology to market to solve a real problem, please drop me a note.

Examples of results from from a famous paper CycleGAN (ICCV 2017), that show an algorithm learning to turn zebras into horses, winter to summer, and regular photos to famous impressionist paintings.

Appendix

A few fun tidbits for the curious:

--

--

I’ve been a product manager, engineer, and founder for over a decade in Seattle and Silicon Valley. Currently exploring new ideas at the Allen Institute for AI.