The implicit hypothesis behind benchmarking on the gold standard QM9 dataset is that, model improvement on small and concentrated molecules implies improvement in generalization as better quantum chemical property (QCP) predictors. This extrapolation ability for deep learning (DL) models is highly useful for various real-world applications, yet the related investigation remains quite limited. The goal of this paper is to promote the development of DL models that can extrapolate beyond the in-domain dataset, and can handle larger molecules than that of the training data. To achieve this goal, a cross-dataset benchmark of training models on QM9 dataset and testing on ALchemy datasets with Larger molecular size (QMALL) is proposed. Experimental results using recent DL methods are provided to investigate their out-of-distribution (OOD) behavior. Analysis of the overall performance drop, model ranking inconsistency, aggregation method selection, and error patterns created new insights into this OOD extrapolation issue, highlighting its challenge for the research community to tackle.