[Objective] This paper addresses data sparsity and dynamic changes in user interests with multimodal feature fusion and deep reinforcement learning. [Methods] First, we used a pre-trained model and attention mechanism to achieve intra-modal representation and fusion of three modalities. Then, we created a model for user-item interactions. Finally, we utilized the deep reinforcement learning algorithm to capture user interest drift and long and short-term rewards in real time to achieve personalized recommendations. [Results] Compared with the highest value in the baseline models, the proposed model improved precision@5 by 11.8%, 16.5%, 11.4%, and NDCG@5 by 5.3%, 8.0%, 6.4%, on the MovieLens-1M, MovieLens-100K, and Douban datasets, respectively. [Limitations] The user interaction history in the Douban dataset is relatively small, and the proposed model cannot learn more accurate user preferences during training. Compared with the experiments on the MovieLens dataset, we received limited recommendation results. [Conclusions] The proposed model integrates multimodal information to reconstruct the state representation network of deep reinforcement learning, improving the recommendation effect.