We have learned how the basic version of the image is generated in stage I. Now, in stage II, we fix the defects of the image produced in stage I and generate a more realistic version of the image. We condition our network with the image generated from the previous stage and also on the text embeddings.