Multimodal AI are a specific type of deep learning models which are able to process different types of input, e.g. text, images, audio, etc, and generate different types of output as well, whereas unimodal models can handle only one type of input, e.g. only text, or only images, and generate an output having the same type of the source, e.g. only text output, or only images output.