unveiling the complexity of language model datasets with wimbd