Large Multi-modal Model for Video Captioning