Authors: Jeffrey Sardina, John D. Kelleher, Declan O'Sullivan
Paper Content:
Page 1:
Extending TWIG: Zero-Shot Predictive Hyperparameter
Selection for KGEs based on Graph Structure
Jeffrey Sardina1,†, John D. Kelleher1and Declan O’Sullivan1
1Trinity College Dublin, the University of Dublin, College Green Dublin, Ireland
Abstract
Knowledge Graphs (KGs) have seen increasing use across various domains – from biomedicine and linguistics
to general knowledge modelling. In order to facilitate the analysis of knowledge graphs, Knowledge Graph
Embeddings (KGEs) have been developed to automatically analyse KGs and predict new facts based on the
information in a KG, a task called “link prediction". Many existing studies have documented that the structure of
a KG, KGE model components, and KGE hyperparameters can significantly change how well KGEs perform and
what relationships they are able to learn. Recently, the Topologically-Weighted Intelligence Generation (TWIG)
model has been proposed as a solution to modelling how each of these elements relate. In this work, we extend
the previous research on TWIG and evaluate its ability to simulate the output of the KGE model ComplEx in
the cross-KG setting. Our results are twofold. First, TWIG is able to summarise KGE performance on a wide
range of hyperparameter settings and KGs being learned, suggesting that it represents a general knowledge
of how to predict KGE performance from KG structure. Second, we show that TWIG can successfully predict
hyperparameter performance on unseen KGs in the zero-shot setting. This second observation leads us to propose
that, with additional research, optimal hyperparameter selection for KGE models could be determined in a pre-hoc
manner using TWIG-like methods, rather than by using a full hyperparameter search.
Keywords
Knowledge Graphs, Knowledge Graph Embeddings, Relational Learning, Link Prediction, Simulation
1. Introduction and Preliminaries
Knowledge Graphs (KGs) are graph-based databases that model information as a set of nodes, which
represent concepts, and edges, which represent the relationships between them [ 1]. Knowledge Graph
Embedding (KGE) models learn to predict new facts based on the information contained in a knowledge
graph – formally, this is called the link prediction task. [ 2,3,4]. As a result of their success in link
prediction, KGE models have become increasingly used in a large variety of domains – from modelling
health sciences data [5, 6, 7, 8, 9] to general knowledge [10, 11].
While previous studies provided detailed benchmarking of various KGE models [ 10,11,4,12,13],
explored the effects of specific KGE model components [ 11,10,14,15], and explored the effects of
KG structure on learning [ 5,16,17], no study known to the authors has attempted to create a system
in which KGE models, model components, graph structure, and link prediction performance can be
understood as part of a common analytic framework. In each of these areas, analysis of KGEs remain
incompletely characterised in terms of the others. For example, the manner by which different KGE
model components affect the learnability of various graph (sub-)structures has not been explored in
detail in the literature known to the authors.
However, recent developments in the Topologically-Weighted Intelligence Generation (TWIG) ap-
proach for analysing KGE models have opened the door to characterising KGE models, KG structure,
and link prediction performance in a common framework [ 18]. In this work, we extend the TWIG
model to simulate KGEM output on multiple KGs at the same time. We provide an empirical analysis of
the accuracy this new TWIG model and show that it can accurately predict the overall performance
of KGE models even on previously unseen KGE hyperparameter settings. Finally, we show that it can
AICS-24: 32nd Irish Conference on Artificial Intelligence and Cognitive Science, December 9-10, 2024, Dublin, Ireland
*Corresponding author.
/envel⌢pe-⌢penJeffrey.Sardina@gmail.com (J. Sardina); John.Kelleher@tcd.ie (J. D. Kelleher); Declan.OSullivan@tcd.ie (D. O’Sullivan)
/orcid0000-0003-0654-2938 (J. Sardina); 0000-0001-6462-3248 (J. D. Kelleher); 0000-0003-1090-3548 (D. O’Sullivan)
©2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).arXiv:2412.14801v1 [cs.LG] 19 Dec 2024
Page 2:
further predict hyperparameter preference and KGE model performance on entirely unseen knowledge
graphs; i.e. in the zero shot setting.
The following sections provide a background and motivation for this work. All code can be found
at https://anonymous.4open.science/r/TWM-4D1F/README.md, and data files can be found at https:
//figshare.com/s/13dc93087c97cbf7cca1.
1.1. Knowledge Graphs and Knowledge Graph Embeddings
Knowledge Graphs represent data as atomic facts (also called “triples") consisting of labelled nodes and
the directed, labelled relations that occur between them. Triples in a KG are denoted as (𝑠, 𝑝, 𝑜 ), where
𝑠represents the subject (or “head") node, 𝑜represents the object (or “tail") node, and 𝑝is the “predicate"
that describes the relationship between 𝑠and𝑜[1]. The intrinsically networked nature of Knowledge
Graphs leads them to very naturally represent a variety of real world data, from biological pathways
and biomedical data [19, 20, 21] to linguistics [22] and general knowledge [23, 22].
Knowledge Graph Embeddings are the machine learning counterpart to KGs – they aim to automati-
cally learn to represent all of the knowledge in a KG as latent vector embeddings of each node and edge
[1,2,3]. These embeddings can then be used to predict new statements that should be present in a KG,
allowing the inference of new knowledge from the knowledge already present in a KG – a task called
“link prediction". The link prediction task is formally defined as answering a “link prediction query" in
the form (𝑠, 𝑝,?)or(?, 𝑝, 𝑜), where ?represents the subject or object entity that should be predicted
such that the triple would be true.
To solve this task, KGE models learn to calculate a plausibility score for all triples. This scoring
function takes the form:
𝑓(𝑒𝑠, 𝑒𝑝, 𝑒𝑜)→𝑠𝑐𝑜𝑟𝑒 𝑠,𝑝,𝑜
where 𝑒𝑠represents the embedding of the subject node, 𝑒𝑝represents the embedding of the predicate,
𝑒𝑜represents the embedding of the object node, and 𝑠𝑐𝑜𝑟𝑒 𝑠,𝑝,𝑜 is a scalar-valued plausibility score
output by the function for the given triple.
An example Knowledge Graph is given in Figure 1a; in this graph, all nodes represent people and
all relations indicate if the people consider others to be friends or enemies. In such a graph, the link
prediction task represents asking if a certain person is the friend of, or the enemy of, another. An
example of link prediction in this KG is shown in Figure 1b. In this case, the query triple can be
represented as (𝑃𝑖𝑝𝑝𝑖𝑛, 𝐹𝑟𝑖𝑒𝑛𝑑 -𝑜𝑓,?). The remainder of the KG is not part of the query, but is used
as training examples for KGE models – in other words, it represents the background knowledge that is
used by KGE models to answer the posed query.
Evaluation of KGE models is based on how well the model is able to assign higher plausibility scores
to known-true triples (in a KG’s hold-out test set) and to assign lower plausibility scores to all other
triples (not observed in the KG). A schematic overview of this procedure is shown in Figure 2, and an
algorithmic specification follows.
At an algorithmic level, in order to evaluate the performance of a KGE model on a given triple (𝑠, 𝑝, 𝑜 ),
two “link prediction queries" are posed based on that triple: (𝑠, 𝑝,?)and(?, 𝑝, 𝑜). For each query, all
possible entities in the KG are substituted for the unknown entity ?, and all resultant triples (𝑠, 𝑝, 𝑜 𝑖^ )
and(𝑠𝑖^, 𝑝, 𝑜)are scored. The scores of all link prediction queries are then sorted into two ranked lists:
the list of answers to (𝑠, 𝑝,?), and the list of answers to (?, 𝑝, 𝑜); both lists are sorted by score such that
triples with higher plausibility scores come first in the list. The rank of the correct answers (𝑠*, 𝑝, 𝑜)
and(𝑠, 𝑝, 𝑜*)in the sorted list is then calculated. Lower ranks (closer to 1) indicate that the correct
answer to the link prediction query is predicted to be more plausible than its incorrect alternatives,
whereas higher ranks (further from 1) indicate that the correct answer is predicted to be less plausible
than its incorrect alternatives.
This procedure is done for all triples in the KG’s hold-out test set. Once all ranks are obtained for
all link prediction queries, overall performance is measured using the standard Mean Reciprocal Rank
Page 3:
(a)An example Knowledge Graph where nodes rep-
resent people and edges represent whether they
are friends or enemies.
(b)An example of a query for link prediction in the
knowledge graph (in black) based on training
data (in grey).
Figure 1: A sample Knowledge graph (left) and an example of link prediction (right).
Figure 2: An schematic overview of how rank-based evaluation of KGE models is performed.
(MRR) metric [ 10,11]. Specifically, MRR calculates the mean of the reciprocal of the ranks assigned to
the correct answers of all link prediction queries. Mathematically, this is expressed as:
𝑀𝑅𝑅 =∑︀𝑛
𝑖=11
𝑟𝑎𝑛𝑘 𝑖
𝑛
where 𝑛is the total number of link prediction queries posed and 𝑟𝑎𝑛𝑘 𝑖is the rank of the correct
answer to the 𝑖𝑡ℎlink prediction query. MRR values are bounded on the interval (0,1]. Values closer to
1 indicate better performance, and values closer to 0 indicate worse performance.
Page 4:
1.2. Properties of KGE Models
KGE models have three main components
•ascoring function that uses embeddings to assign a plausibility score to a triple,
•anegative sampler that produces counter-examples (in the form of fake triples) during training,
and
•aloss function that forces the KGE model to assign higher scores to true triples and lower scores
to “negative" / fake triples.
On top of these, every KGE model has various hyperparameters, such as its learning rate, the size of
the embeddings, its regularisation coefficients, and the number of negatives triples to generate during
training. All of these hyperparameters affect how KGE learning proceeds.
Existing analysis of KGE-based link predictors has shown that the optimal choices for various KGE
model components [ 11,10,18], such as the hyperparameters [ 18,24], the loss function [ 15,11,10,18],
and the negative sampler [ 14,11,10,18] can be understood in terms of the KGE model in use and of the
structure of the knowledge graph that is being learned.
Kotnis et al. show that the optimal negative sampler choice depends on KG structure, and suggest
that this structural dependence is a significant source of hyperparameter preference in KGE models
[14]. However, they do note that the KGE model used also has an impact on the optimal negative
sampler, citing that less expressive KGE models such as TransE cannot always benefit from more
robust negative sampling protocols [ 14]. In a similar vein, Sameh et al. show that optimal loss function
choice varies both by the KGE model used and by the KG being learned [ 15]. Ruffinelli et al. and Ali
et al. both conducted mass KGE benchmark studies, performing ablations of a wide variety of KGE
model components on various KGs of differing structures [ 10,11]. Overall, Ruffinelli et al. and Ali
et al. both found evidence of complex systems of preference for various KGE components and KGE
hyperparameters when learning different KGs, but did relatively little analysis on those systems or
what gave rise to them [10, 11].
Finally, a distinct body of works has explored the effectiveness of KGE models directly as a function
of graph structure [ 5,4,18,24]. Bonner et al. showed that so-called “super-hub" nodes of extremely high
degree could substantially impact learning, often leading to worse performance in link prediction by
being over-represented in model predictions [ 5]. They also highlight that nodes with very low degree
(near 1, meaning that the node only connects to a select few other nodes in the graph) are much harder
to learn during link prediction [5].
Rossi et al. showed that KGE performance can be modelled as a function of the properties of nodes
and relationships – such as the how often certain nodes and relationships co-occur in the graph [4].
Finally, Sardina et al. showed that the rank assigned to link prediction queries during evaluation, as
well as the overall performance of KGEs, can be accurately modelled using a simulation method called
Topologically-Weighted Intelligence Generation (TWIG) [ 18]. While these results are only evaluated
on a single dataset, they show a notable ability to summarise hyperparameter preference and KG
learnability using a simple neural network model [18].
Taken together, these results suggest that the performance of KGEs can be understood largely as
a function of their hyperparameters / components and the structural properties of the knowledge
graphs that they learn. The result of this is the theoretical possibility of predicting hyperparameter
preference and KGEM performance not only in the supervised setting, but also in the few-shot and
zero-shot settings – in other words, to perform predictive hyperparameter optimisation using TWIG as
a substitute to a hyperparameter search.
2. Methodology
The methodology of our work can be divided into two parts – training and evaluating TWIG in a
multi-KG setting, and how we define the few-shot and zero-shot evaluations for TWIG.
Page 5:
Figure 3: An overview of link prediction (using KGEs) and of KGE simulation (using TWIG).
2.1. The TWIG Model
The job of TWIG is to simulate a single knowledge graph embedding model (such as ComplEx [ 25]) by
predicting the ranks that the KGE model would assign to link prediction queries in the form (𝑠, 𝑝,?)or
(?, 𝑝, 𝑜). To do this, we use the TWIG neural network as published in the original TWIG paper [ 18]. The
TWIG neural network has three major learning components – a hyperparameter learning component,
a graph structure learning component, and an integration component. The first two components learn
to implicitly represent KGE hyperparameters and graph structure respectively, and the integration
component combines their information to produce the final output of predicted ranks. An overview of
the TWIG simulation pipeline is given in Figure 3, and the architecture of the TWIG neural network is
given in Figure 4.
In terms of input, TWIG gathers the values of all hyperparameters, as well as the specific negative
sampler and loss function being used, directly as input. It further collects fine-grained structural
information on every triple in the KG, as well as aggregate statistics on each triple’s local neighbourhood.
Full details on all hyperparameter and structural features used by TWIG can be found in Table 1; note
that these are the same features used in the original TWIG paper [18].
For simplicity, a visual depiction of the fine-grained and coarse-grained structural information that
TWIG collects is shown in Figure 5. As a final note, we highlight that all structural features are calculated
based only on the relations present in the KG’s training set to avoid data leakage.
2.2. Data and Datasets
We selected five KGs to use – CoDExSmall, DBpedia50, Kinships, OpenEA, and UMLS [ 26,27,28,29,19],
all taken from the publicly-available PyKEEN repository [ 30]. We choose these graphs for two reasons.
First, they represent a diverse set of domains – notably, biology (UMLS [ 19]), family trees (Kinships
[28]), and general knowledge (CoDExSmall, DBpedia50, and OpenEA [ 27,26,29]). Second, all datasets
Page 6:
Figure 4: A visualisation of the TWIG neural network architecture. KG structure (in magenta) and hyperparam-
eter influence (in yellow) are first learned in separate blocks, then combined in an integration component (in
red) before predicted ranks are output.
are relatively small, meaning that performing large hyperparameter benchmarking experiments to
produce ground-truth data for TWIG could be done in a feasible amount of time.
In this work, we choose to simulate the state-of-the-art KGE model ComplEx, as it is among the
strongest KGE models and has consistent use in many applications [ 25,13,10,11]. To gather information
on ComplEx’s performance for TWIG, we train ComplEx on a large hyperparameter grid on all five
KGs, and record the ranks it assigns to all link prediction queries, as well as the overall MRR ComplEx
achieves on each KG, for all hyperparameter settings. This is done a total of four times, to produce 4
replicates (differing only by random initialisations) of ranked lists to simulate for each KG and each
hyperparameter combination.
We use the same hyperparameter grid as used in the original TWIG paper for training ComplEx; that
grid is shown in Table 2. The meaning of each hyperparameter can be found in Table1.
2.3. Evaluation of TWIG
We evaluate TWIG in two settings: simulation of KGE performance on unseen hyperparameters, and
simulation of KGE performance on unseen KGs and on unseen hyperparameters simultaneously.
2.3.1. Evaluation of TWIG on Unseen Hyperparameters
When evaluating TWIG on unseen hyperparameters, we define our hold-out test set as a random 10%
of all hyperparameter combinations for each KG TWIG is trained on. This is done such that the exact
same hyperparameter combinations are held out on each different KG TWIG is trained on, so that when
Page 7:
Feature Meaning
Hyperparameter Fts
Negative Sampler The negative sampling strategy used
#Negatives per Positive The number of negatives sampled for each triple in training
Loss Function The loss function used
Margin (if applicable) The margin value used in loss calculation (if applicable)
Learning Rate The learning rate for the Adam optimiser
Embedding dimension The dimension of KGE embeddings
Regularisation Coefficient The coefficient of the regulariser
KG Structural Fts
is_head True if the link prediction query is (?, 𝑝, 𝑜); else false
s_deg The degree of the subject node
o_deg The degree of the object node
p_freq The frequency of the predicate
s-p cofreq The number of times the given subject and predicate co-occur
o-p cofreq The number of times the given object and predicate co-occur
s-o cofreq The number of times the given subject and object co-occur
s min deg neighbour The degree of the lowest-degree neighbour of the subject
s max deg neighbour The degree of the highest-degree neighbour of the subject
s mean deg neighbour The degree of the mean-degree neighbour of the subject
o min deg neighbour The degree of the lowest-degree neighbour of the object
o max deg neighbour The degree of the highest-degree neighbour of the object
o mean deg neighbour The degree of the mean-degree neighbour of the object
s num neighbours The total number of neighbours the subject has
o num neighbours The total number of neighbours the object has
s min freq edge The frequency of the least-frequent edge linked to the subject
s max freq edge The frequency of the most-frequent edge linked to the subject
s mean freq edge The mean frequency of edges linked to the subject
o min freq edge The frequency of the least-frequent edge linked to the object
o max freq edge The frequency of the most-frequent edge linked to the object
o mean freq edge The mean frequency of edges linked to the object
s num edges The total number of edges incident on the subject
o num edges The total number of edges incident on the object
Table 1
A description of all features used in the TWIG model.
Hyperparameter Values Searched
Negative Sampler Basic, Bernoulli, Pseudo-Typed
#Negatives per Positive 5, 25, 125
Loss Function Margin Ranking, Binary Cross Entropy, Cross Entropy
Margin (if applicable) 0.5, 1, 2
Learning Rate 1e-2, 1e-4, 1e-6
Embedding dimension 50, 100, 250
Regularisation Coefficient 1e-2, 1e-4, 1e-6
Epochs 100 (constant)
Table 2
The grid of hyperparameters used to train the KGE model ComplEx simulated by TWIG.
it is tested, it is tested on hyperparameter combinations it has never seen before. The remaining 90% of
hyperparameter combinations are used as the training set.
TWIG is then trained to simulate the output of all hyperparameter combinations on all five knowledge
graphs for the KGE model ComplEx. This training is run exactly as in the original TWIG paper; that is,
training is done in two phases, with 5 epochs for the first phase and 10 epochs for the second phase
[18]. In the first phase TWIG learns using KL-divergence loss alone to teach it to match the expected
distribution of ranks in its output; in the second phase TWIG learns using both KL-divergence loss and
Page 8:
Figure 5: A visualisation of the region around a triple that TWIG uses to calculate structural features. Details
statistics are gathered for the main triple (shown in red), and aggregate statistics are gathered for all triples
connecting to that triple (shown in blue). Other triples (shown in black) are ignored.
Mean Squared Error (MSE) loss to teach it to more exactly match the values of the ranks it is meant to
predict while also maintaining the expected distribution of ranks [ 18]. Full details of this training setup,
and the reasons behind 2-phase training, can be found in the TWIG paper [18].
After training TWIG, we evaluate performance by its ability to correctly predict the Mean Reciprocal
Rank (MRR) value that the KGE model ComplEx would achieve on each KG under each hyperparameter
setting in the hold-out test set. To do this, we first use all of TWIG’s predicted ranks for each hyper-
parameter combination to produce a predicted MRR score. We then calculate the 𝑅2metric between
ground-truth MRR values and TWIG’s predicted MRR values. Note that the 𝑅2metric is bounded on
(−∞,1], with higher values (closer to 1) indicating better performance and lower values indicating
worse performance.
2.3.2. Evaluation of TWIG on Unseen KGs and Unseen Hyperparameters
When evaluating TWIG on unseen KGs, we define one of the five KGs as a hold-out set test, such that
TWIG never sees the output of ComplEx on that KG during training. This is used as the first hold-out
test set. For the remaining 4 KGs, we again remove a random 10% of the hyperparameter combinations
as a second hold-out test set.
We then train on the training set of the remaining 4 KGs, which consists of 90% of the hyperparameter
combinations for each KG. In order to evaluate how well TWIG can simulate the performance of ComplEx
on unseen KGs, we evaluate TWIG on allhyperparameter combinations of the hold-out KG – which
includes hyperparameter combinations it has previously seen in training, as well as hyperparameter
combinations never seen during training. Once again, evaluation is done in terms of the 𝑅2metric, this
time between predicted MRR for each hyperparameter setting on the hold-out KG and on the true MRR
value achieved by ComplEx on the same hyperparameter settings for the hold-out KG. This evaluation
setting, in which an entire KG is held out, is referred to as the “0-shot" evaluation setting.
As a generalisation of 0-shot evaluation, we also explore how TWIG performs when it is trained on a
small percent of hyperparameter combinations in the unseen KG. We phrase this as a transfer learning
problem, where we use the TWIG model trained on four of the five KGs as a pretrained model, and then
finetune that model onto the hold-out KG. We do this in two cases – where TWIG can see a random
5% of the hold-out KG during finetuning, and where TWIG can see a random 25% of the hold-out KG
during finetuning. In both cases, the remaining 95% or 75% of the hyperparameters are reserved as
a hold-out test set. We refer to these finetuning experiments as the "5%-shot and 25%-shot" settings
respectively.
Page 9:
Finally, we also evaluate TWIG’s ability to simulate ComplEx on each of the four seen KGs on
their 25% hold-out hyperparameter combinations, exactly as done when testing TWIG on unseen
hyperparameters only. The results of all of these tests – on seen KGs and on and unseen KGs in the
0-shot, 5%-shot, and 25%-shot evaluation settings, are given in the Results section.
3. Results and Discussion
In this section, we first demonstrate that the TWIG model we use simulates the KGE model ComplEx
with high accuracy on both unseen hyperparameter combinations and on unseen KGs. We further show
that TWIG can be effectively finetuned, allowing it to be re-purposed to new KGs with minimal effort
and very little data for finetuning.
3.1. Evaluation on Unseen Hyperparameters
Table 3 gives the results of TWIG when trained on 90% of all hyperparameter combinations of all 5 KGs,
and tested on the remaining 10% of hyperparameter combinations on the same KGs.
CoDExSmall DBpedia50 Kinships OpenEA UMLS
TWIG 𝑅20.95 0.72 0.98 0.8 0.96
Table 3
𝑅2values achieved by TWIG for each KG when trained jointly on the training sets of all KGs. 𝑅2is calculated
between observed and predicted MRR values for each hyperparameter combination on each KG.
These results indicate that TWIG can generalise across various KGs and effectively simulate the
outputs of KGEs on many different hyperparameter settings and many different KGs. The 𝑅2values
of TWIG on each dataset all lie between 0.72 at the lowest (on DBpedia50) and 0.98 at the highest
(on Kinships), indicating that it can accurately predict the performance of ComplEx on all KGs under
various hyperparameter settings. We further highlight that as this evaluation is done on hold-out hyper-
parameters, that it indicates that TWIG is able to predict the efficacy of hyperparameter combinations
it has never seen.
3.2. Evaluation on Unseen KGs
Table 4 gives all results of TWIG when trained on 90% of all hyperparameter combinations of 4 KGs,
and tested:
•on the remaining 10% of hyperparameter combinations on the 4 KGs seen during training (shown
on the left)
•on all hyperparameter combinations on a hold-out KG never seen during training (shown on the
right)
Table 4 further shows the results of finetuning TWIG with either a random 5% (for the 5%-shot
setting) or a random 25% (for the 25%-shot setting) of all hyperparameter combinations on the hold-out
KG before evaluating on that KG. All combinations of 4 training KGs and one hold-out KG are shown.
The results indicate that TWIG is able to simulate ComplEx not only on unseen hyperparameters, but
also on entirely unseen KGs. Particularly notably, TWIG achieves an 𝑅2of between 0.54 (at the lowest)
and 0.73 (at the best) in the 0-shot setting where it is not trained on any information about the hold-out
KG. This improves to 0.77 (at the worst) to 0.96 (at the best) in the 5%-shot setting and 0.86 to 0.99 in
the 25%-shot setting, where 5% or 25% of the hold-out KG is used to finetune TWIG before evaluation.
We highlight that these results have two immediate, major implications:
•that TWIG is able to use a general knowledge of KG structure and hyperparameter effects to
simulate KGEs on datasets and hyperparameter combinations it has never seen in training, and
Page 10:
Training KGs Testing KG
0-shot 5%-shot 25%-shot
CodExSmall DBpedia50 Kinships OpenEA UMLS
0.95 0.81 0.98 0.83 0.64 0.91 0.97
CodExSmall DBpedia50 Kinships UMLS OpenEA
0.95 0.85 0.94 0.96 0.54 0.77 0.97
CodExSmall DBpedia50 OpenEA UMLS Kinships
0.97 0.83 0.83 0.86 0.65 0.90 0.99
CodExSmall UMLS Kinships OpenEA DBpedia50
0.95 0.93 0.95 0.88 0.73 0.81 0.86
UMLS DBpedia50 Kinships OpenEA CoDExSmall
0.98 0.88 0.98 0.89 0.73 0.96 0.98
Table 4
𝑅2values achieved by TWIG in the on unseen hyperparameters, as well as in the 0-shot, 5%-shot, and
25%-shot settings on all combinations of 4 training KGs and 1 unseen KG. 𝑅2is calculated between
ground truth and predicted MRR values for each hyperparameter combination on each KG.
•that TWIG is highly receptive to finetuning, and that finetuning even on small amounts of data
can lead to substantial increases in its ability to simulate KGEs.
We further highlight that the high performance of the 5%-shot and 25%-shot evaluation setting is
not unexpected – since TWIG is finetuned on only one dataset, rather than having to simulate KGEs
on all datasets, it has a much higher ability to learn the details of that KG in particular. When trained
on multiple KGs, TWIG’s performance drops somewhat due to the need to simulate KGEs in so much
more diverse of an environment. In fact, a weaker version of this effect is seen in the slight per-KG
increase in performance on unseen hyperparameters observed when TWIG is trained 4 KGs only (as in
Table 4), versus when it is trained on all 5 (as in Table 3).
Finally, it is important to highlight that zero-shot and few-show prediction both work regardless of
the domain of the unseen KG . For example, when predicting hyperparameter preference for Kinships
(the only KG containing family-tree data) and UMLS (the only KG containing biological data), R2 values
of 0.65 and 0.64 are obtained respectively. In both cases, no family tree data, nor biological data, was
present in the KGs TWIG was trained on. This suggests that structural impact on hyperparameters may
be domain-agnostic, allowing it to generalise across KGs even from different knowledge domains and
with distinct information content.
4. Conclusion
In this paper, we extend TWIG, a novel system that can predict KGE model performance and hyper-
parameter preference for unseen KGs based on graph structure. Our results indicate that TWIG can
accurately predict and represent the overall performance of the KGE model ComplEx on both unseen
hyperparameters and on unseen KGs. This last ability, its ability to predict overall KGE performance
for various hyperparameter sets on unseen KGs, is very significant for several reasons.
First, it suggests that hyperparameter preference is a function of KG structure. Since TWIG is able
to make this prediction across KGs from different domains, it further highlights that structural are
fundamental to KGE-based KG learning. It is therefore possible that the information content of the
KG, paradoxically, may be less important than KG structure for determining how well the KG can be
learned by KGE models.
Second, the substantively increased performance on formerly unseen KGs in the few-shot setting
suggests that TWIG could possibly serve as a replacement for large-scale hyperparameter searches.
Since TWIG can predict hyperparameter preference with high accuracy when trained on only 5% or
25% of a hyperparameter grid, and since this effect persists across various KGs from different domains,
there is strong initial evidence that it could be used in lieu of a traditional hyperparameter search.
Page 11:
Finally, the ability of TWIG to model hyperparameter preference and KGE model performance across
multiple KGs (seen and unseen) using their structural characteristics suggests that some elements of
KG structure are common among different KGs. For example, Bonner et al. [ 5] have shown that, in the
presence of strongly skewed distributions of node degrees, KGE learning can be heavily biased. It is
very possible that a variety of effect such as this underpin TWIG’s ability to predict KGE performance.
Determining the exact nature of such structural relations, and to what extent TWIG uses them in its
predictions, is left for future work.
We must finally acknowledge some limitations of this study. Most notably, only one KGE model
(ComplEx) has been tested here. While our results clearly show that TWIG can generalise across
KGs, it remains unknown if it can generalise across different KGE models. We leave testing other
literature-standard models, such as DistMult and TransE, to future directions.
Further, all KGs examined in this work are relatively small, especially compared to the standard
benchmark KGs FB15k-237 and WN18RR. It is, as such, unclear how TWIG reacts to KGs of different
size but otherwise similar structure, or if TWIG can effectively generalise across KGs of much more
variable size. Evaluating TWIG in these settings is also left as a future direction.
Acknowledgements
This research was conducted with the financial support of Science Foundation Ireland D-REAL CRT
under Grant Agreement No. 18/CRT6225 at the ADAPT SFI Research Centre at Trinity College Dublin,
together with sponsorship of Sonas Innovation Ireland. The ADAPT SFI Centre for Digital Content
Technology is funded by Science Foundation Ireland through the SFI Research Centres Programme and
is co-funded under the European Regional Development Fund (ERDF) through Grant # 13/RC/2106_P2.
References
[1]A. Hogan, E. Blomqvist, M. Cochez, C. D’amato, G. D. Melo, C. Gutierrez, S. Kirrane, J. E. L. Gayo,
R. Navigli, S. Neumaier, A.-C. N. Ngomo, A. Polleres, S. M. Rashid, A. Rula, L. Schmelzeisen,
J. Sequeda, S. Staab, A. Zimmermann, Knowledge graphs, ACM Comput. Surv. 54 (2021). URL:
https://doi.org/10.1145/3447772. doi: 10.1145/3447772 .
[2]Q. Wang, Z. Mao, B. Wang, L. Guo, Knowledge graph embedding: A survey of approaches
and applications, IEEE Transactions on Knowledge and Data Engineering 29 (2017) 2724–2743.
doi:10.1109/TKDE.2017.2754499 .
[3]M. Nickel, K. Murphy, V. Tresp, E. Gabrilovich, A review of relational machine learning for knowl-
edge graphs, Proceedings of the IEEE 104 (2016) 11–33. doi: 10.1109/JPROC.2015.2483592 .
[4] A. Rossi, D. Barbosa, D. Firmani, A. Matinata, P. Merialdo, Knowledge graph embedding for link
prediction: A comparative analysis, ACM Transactions on Knowledge Discovery from Data 15
(2021) 1–49. doi: 10.1145/3424672 .
[5]S. Bonner, U. Kirik, O. Engkvist, J. Tang, I. P. Barrett, Implications of topological imbalance for
representation learning on biomedical knowledge graphs, Briefings in bioinformatics 23 (2022)
bbac279.
[6]S. K. Mohamed, V. Nováček, A. Nounu, Discovering protein drug targets using knowledge graph
embeddings, Bioinformatics 36 (2020) 603–610.
[7]R. Zhang, D. Hristovski, D. Schutte, A. Kastrin, M. Fiszman, H. Kilicoglu, Drug repurposing for
covid-19 via knowledge graph completion, Journal of biomedical informatics 115 (2021) 103696.
[8]S. K. Mohamed, A. Nounu, V. Nováček, Drug target discovery using knowledge graph embeddings,
in: Proceedings of the 34th ACM/SIGAPP symposium on applied computing, 2019, pp. 11–18.
[9]R. Celebi, H. Uyar, E. Yasar, O. Gumus, O. Dikenelli, M. Dumontier, Evaluation of knowledge
graph embedding approaches for drug-drug interaction prediction in realistic settings, BMC
bioinformatics 20 (2019) 1–14.
Page 12:
[10] M. Ali, M. Berrendorf, C. T. Hoyt, L. Vermue, M. Galkin, S. Sharifzadeh, A. Fischer, V. Tresp,
J. Lehmann, Bringing light into the dark: A large-scale evaluation of knowledge graph embedding
models under a unified framework, IEEE Transactions on Pattern Analysis and Machine Intelligence
44 (2021) 8825–8845.
[11] D. Ruffinelli, S. Broscheit, R. Gemulla, You can teach an old dog new tricks! on training knowledge
graph embeddings, in: ICLR, 2020.
[12] R. Kadlec, O. Bajgar, J. Kleindienst, Knowledge base completion: Baselines strike back, arXiv
preprint arXiv:1705.10744 (2017).
[13] P. Jain, S. Rathi, S. Chakrabarti, et al., Knowledge base completion: Baseline strikes back (again),
arXiv preprint arXiv:2005.00804 (2020).
[14] B. Kotnis, V. Nastase, Analysis of the impact of negative sampling on link prediction in knowledge
graphs, arXiv preprint arXiv:1708.06816 (2017).
[15] S. K. Mohamed, V. Novácek, P.-Y. Vandenbussche, E. Muñoz, Loss functions in knowledge graph
embedding models, in: DL4KG@ESWC, 2019.
[16] H. Zhang, T. Zheng, J. Gao, C. Miao, L. Su, Y. Li, K. Ren, Data poisoning attack against knowledge
graph embedding, arXiv preprint arXiv:1904.12052 (2019).
[17] P. Bhardwaj, J. Kelleher, L. Costabello, D. O’Sullivan, Adversarial attacks on knowledge graph
embeddings via instance attribution methods, arXiv preprint arXiv:2111.03120 (2021).
[18] J. Sardina, J. D. Kelleher, D. O’Sullivan, Twig: Towards pre-hoc hyperparameter optimisation and
cross-graph generalisation via simulated kge models, in: 2024 IEEE 18th International Conference
on Semantic Computing (ICSC), 2024, pp. 122–129. doi: 10.1109/ICSC59802.2024.00025 .
[19] A. T. McCray, A. Burgun, O. Bodenreider, Aggregating umls semantic types for reducing conceptual
complexity, Studies in health technology and informatics 84 (2001) 216.
[20] P. Chandak, K. Huang, M. Zitnik, Building a knowledge graph to enable precision medicine,
Scientific Data 10 (2023) 67.
[21] M. Dumontier, A. Callahan, J. Cruz-Toledo, P. Ansell, V. Emonet, F. Belleau, A. Droit, Bio2rdf
release 3: a larger connected network of linked data for the life sciences, in: Proceedings of the
2014 international conference on posters & demonstrations track, volume 1272, Citeseer, 2014, pp.
401–404.
[22] K. Toutanova, D. Chen, Observed versus latent features for knowledge base and text inference,
in: A. Allauzen, E. Grefenstette, K. M. Hermann, H. Larochelle, S. W.-t. Yih (Eds.), Proceedings of
the 3rd Workshop on Continuous Vector Space Models and their Compositionality, Association
for Computational Linguistics, Beijing, China, 2015, pp. 57–66. URL: https://aclanthology.org/
W15-4007. doi: 10.18653/v1/W15-4007 .
[23] F. Mahdisoltani, J. Biega, F. M. Suchanek, A knowledge base from multilingual wikipedias–yago3,
Technical report, Telecom ParisTech (2014).
[24] J. Sardina, D. O’Sullivan, Structural characteristics of knowledge graphs determine the quality of
knowledge graph embeddings across model and hyperparameter choices (2020).
[25] T. Lacroix, N. Usunier, G. Obozinski, Canonical tensor decomposition for knowledge base comple-
tion, in: International Conference on Machine Learning, PMLR, 2018, pp. 2863–2872.
[26] T. Safavi, D. Koutra, Codex: A comprehensive knowledge graph completion benchmark, arXiv
preprint arXiv:2009.07810 (2020).
[27] B. Shi, T. Weninger, Open-world knowledge graph completion, in: Proceedings of the AAAI
conference on artificial intelligence, volume 32, 2018.
[28] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, N. Ueda, Learning systems of concepts with
an infinite relational model, in: AAAI, volume 3, 2006, p. 5.
[29] Z. Sun, Q. Zhang, W. Hu, C. Wang, M. Chen, F. Akrami, C. Li, A benchmarking study of embedding-
based entity alignment for knowledge graphs, arXiv preprint arXiv:2003.07743 (2020).
[30] M. Ali, M. Berrendorf, C. T. Hoyt, L. Vermue, S. Sharifzadeh, V. Tresp, J. Lehmann, PyKEEN 1.0: A
Python Library for Training and Evaluating Knowledge Graph Embeddings, Journal of Machine
Learning Research 22 (2021) 1–6. URL: http://jmlr.org/papers/v22/20-825.html.