Multimodal Reasoning with Fine-grained Knowledge Representation