The newly announced 'VoiceCraft' is a neural codec language model inspired by multimodal models of text and images, enabling zero-shot text-to-speech output, speech synthesis, and speech editing.