📁 Dataset

The OpenBioLink2021 Dataset is a highly challenging benchmark dataset containing about 4.5 million high quality biomedical facts from various renowned biomedical knowledge bases. The dataset was split randomly with a ratio of 90-5-5.

# Train	# Valid	# Test	# Entities	# Relations
4,192,002	186,301	180,964	180,992	28

The dataset can be downloaded from Zenodo: KGID_HQ_DIR.zip or loaded with the provided python dataloader module, which is further documented here. Please make sure that you get the dataset from one of the two sources, as other versions of OpenBioLink may differ.

from openbiolink.obl2021 import OBL2021Dataset

dl = OBL2021Dataset()

train = dl.training # torch.tensor of shape(num_train,3)
valid = dl.validation # torch.tensor of shape(num_val,3)