Functional near-infrared spectroscopy (fNIRS) promises a non-intrusive way to measure real-time brain activity and build responsive brain-computer interfaces. A primary barrier to realizing this technology's potential has been that observed fNIRS signals vary significantly across human users. Building models that generalize well to never-before-seen users has been difficult; a large amount of subject-specific data has been needed to train effective models. To help overcome this barrier, we introduce the largest open-access dataset of its kind, containing multivariate fNIRS recordings from 68 participants, each with labeled segments indicating four possible mental workload intensity levels. Labels were collected via a controlled setting in which subjects performed standard n-back tasks to induce desired working memory levels. We propose a benchmark analysis of this dataset with a standardized training and evaluation protocol, which allows future researchers to report comparable numbers and fairly assess generalization potential while avoiding any overlap or leakage between train and test data. Using this dataset and benchmark, we show how models trained using abundant fNIRS data from many other participants can effectively classify a new target subject's data, thus reducing calibration and setup time for new subjects. We further show how performance improves as the size of the available dataset grows, while also analyzing error rates across key subpopulations to audit equity concerns. We share our open-access Tufts fNIRS to Mental Workload (fNIRS2MW) dataset and open-source code as a step toward advancing brain computer interfaces.