MAiVAR is a CNN-based software package developed to enhance multimodal action recognition performance. It investigates the potential of CNNs representation process to fuse image-based audio representations with video representations to achieve superior action recognition accuracy. The proposed Multimodal Audio-Image and Video Action Recognizer is capable of extracting meaningful image representations of audio and fusing them with video representation, which is beneficial for large-scale action recognition datasets. MAiVAR can be a valuable tool for researchers in the field of multimodal action recognition, providing enhanced performance by leveraging multiple modalities. The package is implemented in Python and can be used across different operating systems.