📁 Dataset

The OpenBioLink2021 Dataset is a highly challenging benchmark dataset containing about 4.5 million high quality biomedical facts from various renowned biomedical knowledge bases. The dataset was split randomly with a ratio of 90-5-5.

# Train # Valid # Test # Entities # Relations
4,192,002 186,301 180,964 180,992 28

The dataset can be downloaded from Zenodo: KGID_HQ_DIR.zip or loaded with the provided python dataloader module, which is further documented here. Please make sure that you get the dataset from one of the two sources, as other versions of OpenBioLink may differ.

from openbiolink.obl2021 import OBL2021Dataset

dl = OBL2021Dataset()

train = dl.training # torch.tensor of shape(num_train,3)
valid = dl.validation # torch.tensor of shape(num_val,3)