Pix2Pix: Как работает генератор кошечек. Нейросеть Pix2pix реалистично расцвечивает карандашные наброски и чёрно-белые фотографии

Neural City

Another experiment from Mario was training the generator on YouTube videos of Francoise Hardy , then tracking the face of KellyAnne Conway while she explained “alternative facts,” and generating images of Francoise Hardy from those found landmarks, effectivly making Francoise Hardy pantomime the facial gestures of KellyAnne Conway.

Riffing on this technique, I made a webcam-enabled pix2pix trained on pictures of Donald Trump’s face giving a speech. The real-time application was run live for a workshop.

Person-to-person pix2pix

Using pix2pix

The following is a tutorial for how to use the tensorflow version of pix2pix. If you wish to, you can also use the original torch-based version or a newer pytorch version which also contains a CycleGAN implementation in it as well. Although these instructions are for the tensorflow version, they should be fairly relevant to the others with just minor modifications in syntax. You should be able to use any of the versions and get similar results.

In using pix2pix, there are two modes. The first is training a model from a dataset of known samples, and the second is testing the model by generating new transformations from previously unseen samples.

Training pix2pix means creating, learning the parameters for, and saving the neural network which will convert an image of type X into an image of type Y. In most of the examples that we talked about, we assumed Y to be some “real” image of dense content and X to be a symbolic representation of it. An example of this would be converting images of lines into satellite photographs. This is useful because it allows us to generate sophisticated and detailed imagery from quick and minimal representations. The reverse is possible as well; to train a network to convert the real imagery into its corresponding symbolic form. This can be useful for many practical tasks; for example automatically finding and labeling roads and infrastructure in satellite images.

Once a model has been generated, we use testing mode to output new samples. For example, if we trained X->Y where X is a symbolic form and Y is the real form, then we make generative Y images from previously unseen X symbolic images.

Installation

In order to run the software on your machine, you need to have an NVIDIA GPU which is supported by CUDA. Here is a list of supported devices . At least 2GB of VRAM are recommended, although realistically, with less than say 4GB, you may have to produce smaller-sized samples. If you have an older laptop, consider using a cloud-based platform instead (todo: make a guide about cloud platforms).

Install CUDA

Once you have successfully run the installer for CUDA, you can find it on your system in the following locations:

Mac/Linux: /usr/local/cuda/
Windows: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0

In order for your system to find CUDA, it has to be located in your PATH variables, and in LD_LIBRARY_PATH for Mac/Linux. The installer should do this automatically.

Install cuDNN

cuDNN is an add-on for CUDA which specifically implements operations for deep neural nets. It is not required to run Tensorflow but is highly recommended, as it makes many of the programs much more resource-efficient, and is probably required for pix2pix.

You have to register first with NVIDIA (easy) to get access to the download. You should download the latest version for your platform unless you have an older version of CUDA. At time of this writing, cuDNN 5.1 works with CUDA 8.0+, although cuDNN 6.0+ should work as well.

Once you’ve download cuDNN, you will see it contains several files in folders called “include”, “lib” or “lib64”, and “bin” (if on windows). Copy these files into those same folders inside of where you have CUDA installed (see step 1)

Install tensorflow

Follow tensorflow’s instructions for installation for your system. In most cases this can be done with pip.

Install pix2pix-tensorflow

Clone or download the above library. It is possible to do all of this with the original torch-based pix2pix (in which case you have to install torch instead of tensorflow for step 3. These instructions will assume the tensorflow version.

Training pix2pix

First we need to prepare our dataset. This means creating a folder of images, each of which contain an X/Y pair. The X and Y image each occupy half of the full image in the set. Thus they are the same size. It does not matter which order they are placed into, as you will define the direction in the training command (just remember which way because you need to be consistent).

Additionally, by default the images are assumed to be square (if they are not, they will be squashed into square input and output pairs). It is possible to use rectangular images by setting the aspect_ratio argument (see the optional arguments below for more info).

Then we need to open up a terminal or command prompt (possibly in administrator mode or using sudo if on linux/mac), navigate (cd) to the directory which contains pix2pix-tensorflow, and run the following command:

python pix2pix.py --mode train --input_dir /Users/Gene/myImages/ --output_dir /Users/Gene/myModel --which_direction AtoB --max_epochs 200

replacing the arguments for your own task. An explanation of the arguments:

--mode : this must be “train” for training mode. In testing mode, we will use “test” instead.

--input_dir : directory from which to get the training images

--output_dir : directory to save the model (aka checkpoint)

--which_direction : AtoB means train the model such that the left half of your training images is the input and the right half is the output (generated). BtoA is the reverse.

--max_epochs : number of epochs (iterations) to train, i.e. how many times to each image in your training set is passed through the network in training. In practice, more is usually better, but pix2pix may stop learning after a small number of epochs, in which case it takes longer than you need. It is also sometimes possible to train it too much, as if to overcook it, and get distorted generations. The loss function does not necessarily correspond well to quality of images generated (although there is recent research which does create a better equivalency between them).

Note also that the order in which parameters are written in the command does not actually matter. Additionally, there are optional parameters which may be useful:

--checkpoint : a previous checkpoint to start from; it must be specified as the path which contains that model (so it is equivalent to –output_dir). Initially you won’t have one but if your training is ever interrupted prematurely, or you wish to train for longer, you can initialize from a previous checkpoint instead of starting from scratch. This can be useful for running for a while and checking to see quality, and then resuming training for longer if you are unsatisfied.

--aspect_ratio : this is 1 by default (square), but can be used if your images have a different aspect ratio. If for example your images are 450x300 (width is 450), then you can use an aspect_ratio 1.5.

--output_filetype : png or jpg

There are more advanced options, which you can see in the arguments list in pix2pix.py. The adventurous may wish to experiment with these as well.

Unfortunately, pix2pix-tensorflow does not currently allow you to change the actual size of the produced samples and is hardcoded to have a height of 256px. Simply changing it in pix2pix.py will result in a shape mismatch. If you wish to generate bigger samples, you can do so using the original torch-based pix2pix which does have it as a command line parameter, or more adventurously adapt the tensorflow code to arbitrarily sized samples. This will require changing the architecture of the network slightly, perhaps as a function of a –sample_height parameter, which is a good exercise left to the intrepid artist.

In practice, trying to generate larger samples, say 512px does not always lead to improved results, and may instead look not much better than an upsampled version of the 256px versions, at the cost of requiring significantly more system resources/memory and taking longer to finish training. This will definitely be true if your original images are smaller than the desired size because subpixel details are not available in the training data, but even if your data is sized sufficiently, it may still occur. Worth trying out, but your results may vary.

Once you run the command, it will begin training, updating the progress periodically and will consume most or all of your system’s resources so it’s often worth running overnight.

Testing or generating images

The second operation of pix2pix is generating new samples (called “test” mode). If you trained AtoB for example, it means providing new images of A and getting out hallucinated versions of it in B style.

In pix2pix, testing mode is still setup to take image pairs like in training mode, where there is an X and a Y. This is because a useful thing to do is to hold out a test or validation set of images from training, then generate from those so you can compare the constructed Y to the known Y, as a way to visually evaluate the data. pix2pix helpfully creates an HTML page with a row for each sample containing the input, the output (constructed Y) and the target (known/original Y). Several of the downloadable datasets (e.g. facades) are packaged this way. In our case, since we may not have the corresponding Y (after all, that’s the whole point!) a quick workaround for this problem is to simply take each X, and convert into an image twice the width where one half is the image to use as the input (X) and the other is some blank decoy image. Make sure you are consistent with how it was trained; if you trained –which_direction AtoB, the blank image is on the right, and BtoA it is on the left. If the generated html page shows the decoy as the “input” and the output is the nondescript “Rothko-like” image, then you accidentally put them in the wrong order.

Once your testing data has been prepared, run the following command (again from the root folder of pix2pix-tensorflow):

python pix2pix.py --mode test --input_dir /Users/Gene/myImages/ --output_dir /Users/Gene/myGeneratedImages --checkpoint /Users/Gene/myModel

An explanation of the arguments:

--mode : this must be “test” for testing mode.

--input_dir : directory from which to get the images to generate

--output_dir : directory to save the generated images and index page

--checkpoint : directory which contains the trained model to use. In most cases, this should simply be the same as the –output_dir argument in the original training command.

todo: example images through training and testing process

todo: make a CycleGAN guide

Четыре примера работы программы, код которой опубликован в открытом доступе. Слева показаны исходные изображения, справа - результат автоматической обработки

Многие задачи в обработке изображений, компьютерной графике и компьютерном зрении можно свести к задаче «трансляции» одного изображения (на входе) в другое (на выходе). Так же как один и тот же текст можно представить на английском или русском языке, так и изображение можно представить в RGB-цветах, в градиентах, в виде карты границ объектов, карты семантических меток и т.д. По образцу систем автоматического перевода текстов, разработчики из лаборатории Berkeley AI Research (BAIR) Калифорнийского университета в Беркли создали приложение для автоматической трансляции изображений из одного представления в другое. Например, из чёрно-белого наброска в полноцветную картинку.

Неосведомлённому человеку работа такой программы покажется магией, но в основе её лежит программная модель условных генеративных состязательных сетей (conditional generative adversarial networks, cGAN) - разновидности известного типа генеративных состязательных сетей (generative adversarial networks, GAN).

Авторы научной работы пишут, что большинство проблем, которые возникают при трансляции изображений, связаны с трансляцией или «многие к одному» (компьютерное зрение - трансляция фотографий в семантические карты, сегменты, границы объектов и т.д.), или «один ко многим» (компьютерная графика - трансляция меток или входных данных от пользователя в реалистичных изображения). Традиционно каждая из этих задач выполняется отдельным специализированным приложением. В своей работе авторы попытались создать единый универсальный фреймворк для всех таких проблем. И у них получилось.

Для трансляции изображений великолепно подходят свёрточные нейросети, обученные минимизировать функцию потерь , то есть меру расхождения между истинным значением оцениваемого параметра и оценкой параметра. Хотя само обучение происходит автоматически, всё-таки для эффективной минимизации функции потерь требуется значительная ручная работа. Другими словами, нам по-прежнему нужно объяснить и показать нейросети, что конкретно нужно минимизировать. И здесь кроется много подводных камней, которые отрицательно сказываются на результате, если мы работаем с функцией потерь на низком уровне типа «минимизировать евклидово расстояние между предсказанными и настоящими пикселями» - это приведёт к генерации смазанных изображений.

Влияние различных функций потери на результат

Намного проще было бы ставить нейросети высокоуровневые задачи типа «сгенерировать изображение, неотличимое от реальности», а затем автоматически обучить нейросеть для минимизации функции потерь, которая наилучшим образом выполняет поставленную задачу. Именно так работают генеративные состязательные сети (GAN) - одно из самых перспективных направлений в разработке нейросетей на сегодняшний день. Сеть GAN обучает функцию потерь, задачей которой является классифицировать изображение как «настоящее» или «поддельное», одновременно тренируя генеративную модель, чтобы минимизировать эту функцию. Здесь никак не могут получиться размытые изображения на выходе, потому что они не пройдут проверку классификации как «настоящие».

Разработчики использовали для поставленной задачи условные генеративные состязательные сети (cGAN), то есть GAN с условным параметром. Так же как GAN усваивает генеративную модель данных, cGAN усваивает генеративную модель по определённому условию, что делает её пригодной для трансляции изображений «один в один».

Трансляция разметки Cityscapes в реалистичные фотографии. Слева разметка, в центре оригинал, а справа сгенерированное изображение

В последние два года описано множество применений GAN и хорошо изучена теоретическая основа их работы. Но во всех этих работах GAN используется только для специализированных задач (например, или ). Не совсем понятно было, каким образом GAN подходит для эффективной трансляции изображений «один в один». Основная цель данной работы - продемонстрировать, что такая нейросеть способна выполнять большой перечень разнообразных задач, показывая вполне приемлемый результат.

Например, очень неплохо смотрится раскраска чёрно-белых карандашных набросков (левая колонка), на основе которых нейросеть генерирует фотореалистичные изображения (правая колонка). В некоторых случаях результат работы нейросети кажется даже реалистичнее, чем настоящая фотография (центральная колонка, для сравнения).

Трансляция карандашных набросков в реалистичные фотографии. Слева карандашный рисунок, в центре оригинал, а справа сгенерированное изображение

Трансляция карандашных набросков в реалистичные фотографии

Как и в других генеративных сетях, в этой GAN нейросети воюют между собой . Одна из них (генератор) пытается создать фальшивое изображение, чтобы обмануть другую (дискриминатор). Со временем генератор обучается всё лучше обманывать дискриминатор, то есть генерирвоать более реалистичные изображения. В отличие от обычных GAN, в Pix2Pix одновременно и дискриминатор, и генератор имеют доступ к исходному изображению.

Обучение cGAN предсказывать фотографии аэрофотосъёмки по картам местности

Примеры работы cGAN по трансляции фотографий аэрофотосъёмки в карты местности и наоборот

Научная статья опубликована в открытом доступе, исходный код Pix2pix - на GitHub . Авторы предлагают всем желающим испытать программу.

Disclaimer: пост написан на основе отредактированных логов чата closedcircles.com , отсюда и стиль изложения, и уточняющие вопросы

Все это - реализация пейпера Image-to-Image Translation with Conditional Adversarial Networks из Berkeley AI Research.

Так как это все работает-то?

В пейпере люди решают задачу трансформации картинки в другую так, чтобы человеку не нужно было придумывать loss function.

Одна из главных проблем с нейросетями в генерации картинок - в том, что если использовать как loss просто среднуюю разницу в пикселях, например, L1 или L2 (он же mean squared error), то сеть стремится усреднять все возможные варианты. Если в финальной картинке есть некая неопределенность - например, ребро может быть на разной позиции, или цвет может быть в неком диапазоне, то оптимальный результат с точки зрения L2 loss - что-то среднее между всеми возможными случаями, а не какой-то конкретный из них.

Посему картинки оказываются очень размытыми пятнами.

Для разных отдельных задач люди придумывали другие loss functions, чтобы выразить некую структуру, которая должна быть в результатирующей картинке (для сегментации например Conditional Random Fields пробовали добавлять итд итп), но это все помогает очень инкрементально и очень зависит от задачи.

Ну и вот, следуя новым веяниям, в пейпере в качестве такого дополнительного лосса к L1 втыкают GAN (Generative Adversarial Network). (почитать про GANs можно почитать на Хабре и )

Общая схема у них такая:

Генератору на вход дается input image - она является дополнительным условием на то, что нужно сгенерировать. На ее основе генератор должен сгенерировать картинку на выход.

Дискриминатору - дается и input image, и то, что сгенерировал генератор (или, для positive examples - настоящая пара из тренировочного датасета), и он должен выдать является ли сгенерированная картинка настоящей или сгенерированной. Таким образом, если генератор будет генерировать картинку, не относящуюся к входной - дискриминатор должен это определить и отбросить.

Генератор является результатом итеративной тренировки этой пары сетей.

В целом, это стандартный подход Сonditional GANs - варианта GAN, где модель должна генерировать картинки соответствующие дополнительному входному вектору класса.

Только здесь входной вектор класса - картинка, и общий loss - это GAN loss + L1.

В смысле "втыкают GAN" в контексте обсуждения loss"ов? Типа добавляют генератор и решают задачу на нахождение минимакса?
Ну да.

На высоком уровне все!

Какие у них интересные детали

В отличие от классического подхода к GANs, генератору вообще не передается никакого шумового вектора.
Все разнообразие только от того, что в сети есть dropout, и они его не выключают после тренировки.

Архитектура сети - U-Net, достаточно новая архитектура для сегментации, у которой есть много skip connections от энкодера до декодера (вот короткое описание)

Вот картинка, которая показывает, что и GAN loss, и U-net помогают.

Здесь, кстати, хорошо видна изначальная проблема с использованием только L1 loss - даже мощная модель генерирует размытые пятна, чтобы минимизировать среднее отклонение.

Они тренируют модель на патчах 70x70, а потом применяют на больших картинках через full convolution. Забавно, что 70x70 дает в среднем результаты лучше, чем делать сразу на всей картинке 256x256 целиком.

А где же кошечки!!!

После этого есть система, которую можно научить на произвольных входах и выходах, даже если они из совсем разных задач.

Из сегментации в фотографию, из дневной фотки в ночную, из черно-белой в цветную итд.

И вот последний пример - это из ребер в картинку. Ребра по картинке генерируются стандартным алгоритмом из computer vision.

Это означает, что можно просто взять набор картинок, прогнать edge detection, и вот на этих парах
натренировать. Можно и на кошечках:

И после этого модель может что-то сгенерировать для любых скетчей, которые рисуют люди.

(присылайте, кстати, что вам запомнилось)

Так был ликвидирован недостаток хлебообразных кошек у человечества!

В целом, эта работа - еще один пример того, как взлетают GANs начиная с прошлого года. Оказывается, что это очень мощный и гибкий инструмент, который выражает "хочу чтобы было неотличимо от настоящего, хоть и не знаю, что это конкретно значит" как цель оптимизации.
Надеюсь, кто-то напишет полный обзор остального, происходящего в области! Там все очень круто.

Спасибо за внимание.