e RDF2vec.org

RDF2vec.org

The ultimate guide to RDF2vec.

About RDF2vec

RDF2vec is a tool for creating vector representations of RDF graphs. In essence, RDF2vec creates a numeric vector for each node in an RDF graph.

RDF2vec was developed by Petar Ristoski as a key contribution of his PhD thesis Exploiting Semantic Web Knowledge Graphs in Data Mining [Ristoski, 2019], which he defended in January 2018 at the Data and Web Science Group at the University of Mannheim, supervised by Heiko Paulheim. In 2019, he was awarded the SWSA Distinguished Dissertation Award for this outstanding contribution to the field.

RDF2vec was inspired by the word2vec approach [Mikolov et al., 2013] for representing words in a numeric vector space. word2vec takes as input a set of sentences, and trains a neural network using one of the two following variants: predict a word given its context words (continuous bag of words, or CBOW), or to predict the context words given a word (skip gram, or SG):

This approach can be applied to RDF graphs as well. In the original version presented at ISWC 2016 [Ristoski and Paulheim, 2016], random walks on the RDF graph are used to create sequences of RDF nodes, which are then used as input for the word2vec algorithm. It has been shown that such a representation can be utilized in many application scenarios, such as using knowledge graphs as background knowledge in data mining tasks, or for building content-based recommender systems [Ristoski et al., 2019].

The resulting vectors have similar properties as word2vec embeddings. In particular, similar entities are closer in the vector space than dissimilar ones, which makes those representations ideal for learning patterns about those entities. In the example below, showing embeddings for DBpedia and Wikidata, countries and cities are grouped together, and European and Asian cities and countries form clusters:

The two figures above indicate that classes (in the example: countries and cities) can be separated well in the projected vector space, indicated by the dashed lines. Zouaq and Martel have compared the suitability for separating classes in a knowledge graph for different knowledge graph embedding methods. They have shown that RDF2vec is outperforming other embedding methods like TransE, TransH, TransD, ComplEx, and DistMult, in particular on smaller classes.

Implementations

There are a few different implementations of RDF2vec out there:

  • The original implementation from the 2016 paper. Not well documented. Uses Java for walk generation, and Python/gensim for the embedding training.
  • jRDF2vec is a more versatile and better peforming Java-based implementation. Like the original one, it uses Java to generate the walks, and Python/gensim for training the embedding. There is also a Docker image available here.
  • pyRDF2vec is a pure Python-based implementation. It implements multiple strategies to generate the walks, not only random walks.
  • ataweel55's implementation is another pure Python-based implementation. It includes all strategies for biasing the walks described in [Cochez et al., 2017a] and [Al Taweel and Paulheim, 2020].

Models and Services

Training RDF2vec from scratch can take quite a bit of time. Here is a list of pre-trained models we know:

There is also an alternative for downloading and processing an entire knowledge graph embedding (which may consume several GB):

  • KGvec2go provides a REST API for retrieving pre-computed embedding vectors for selected entities one by one, as well as further functions, such as computing the vector space similarity of two concepts, and retrieving the n closest concepts [Portisch et al., 2020].

Extensions and Variants

There are quite a few variants of RDF2vec which have been examined in the past.

  • Wembedder is a simplified version of RDF2vec which uses the raw triples of a knowledge graph as input to the word2vec implementation, instead of random walks. It serves pre-computed vectors for Wikidata. [Nielsen, 2017]
  • KG2vec follows the same idea of using triples as input to a Skip-Gram algorithm. [Soru et al., 2018]

One area which has been undergone extensive research is the creation of walks for the RDF2vec algorithm. While the original implementation uses random walks, alternatives have been explored include:

  • The use of different heuristics for biasing the walks, e.g., prefering edges with more/less frequent predicates, prefering links to nodes with higher/lower PageRank, etc. An extensive study is available in [Cochez et al., 2017a].
  • A similar approach is analyzed in [Al Taweel and Paulheim, 2020], where embeddings for DBpedia are trained with external edge weights derived from page transition probabilities in Wikipedia.
  • In [Saeed and Prasanna, 2018], the identification of specific properties for groups of entities is discussed as a means to find task-specific edge weights.
  • Mukherjee et al. [Mukherjee et al.] also observe that biasing the walks with prior knowledge on relevant properties and classes for a domain can improve the results obtained with RDF2vec.

RDF2vec relies on the word2vec embedding mechanism once the sequences are created. This is not the only choice:

While the original RDF2vec approach is agnostic to the type of knowledge encoded in RDF, it is also possible to extend the approach to specific types of datasets.

Other Resources

Other useful resources for working with RDF2vec:

Applications

RDF2vec has been used, among others, in the following applications:

  • TREC CAR is a benchmark for complex answer retrieval. The authors use pre-trained RDF2vec embeddings as one means to represent queries and answers, and for matching them onto each other. [Nanni et al., 2017a]
  • Nanni et al. describe a system for harvesting event collections from Wikipedia, where RDF2vec is used internally for entity ranking. [Nanni et al., 2017b]
  • TIEmb is an approach for learning subsumption relations using RDF2vec embeddings. [Ristoski et al., 2017]
  • MERGILO is a tool for merging structured knowledge extracted from text. A refinement of MERGILO using RDF2vec embeddings on FrameNet is discussed in [Alam et al., 2017].
  • REMES is an entity summarization approach which uses RDF2vec to select a suitable subset of statements for describing an entity. [Gunaratna et al., 2017]
  • Kejriwal and Szekely discuss the use RDF2vec embeddings for entity type prediction in knowledge graphs. [Kejriwal and Szekely, 2017] Another approach in that direction is proposed by Sofronova et al., who contrast supervised and unsupervised methods for exploiting RDF2vec embeddings for type prediction. [Sofronova et al., 2020]
  • Inan and Dikenelli demonstrate the usage of RDF2vec embeddings in named entity disambiguation in the entity disambiguation frameworks DoSeR and AGDISTIS. [Inan and Dikenelli, 2017]
  • Wang et al. have used RDF2vec embeddings for analyzing entity co-occurence in tweets [Wang et al., 2017].
  • Hascoet et al. show how to use RDF2vec for image classification, especially for classes of images for which no training data is available, i.e., zero-shot-learning [Hascoet et al., 2017].
  • EARL is a named entity linking tool which uses pre-trained RDF2vec embeddings. [Dubey et al., 2018]
  • GraphEmbeddings4DDI utilizes RDF2vec for predicting drug-drug interactions [Çelebi et al., 2018]. A similar system is introduced by Karim et al., using a complex LSTM on top of the entity embeddings generated with RDF2vec [Karim et al., 2019].
  • Ad Hoc Table Retrieval using Semantic Similarity describes the use of pre-trained RDF2vec embeddings for retrieving Wikipedia tables. [Zhang and Balog, 2018]
  • Nanni et al. showcase the use of RDF2vec embeddings for entity aspect linking in [Nanni et al., 2018].
  • Jurisch and Igler demonstrate that utilization of RDF2vec embeddings for detecting changes in ontologies in [Jurisch and Igler, 2018].
  • KGA-CGM is a system for describing images with captions. It uses RDF2vec embeddings for handling out-of-training entities [Mogadala et al., 2018].
  • ALOD2vec Matcher is an ontology matching system which uses pre-trained embeddings on the WebIsALOD knowledge graph to determine the similarity of two concepts. [Portisch and Paulheim, 2018]
  • Biswas et al. discuss the use of RDF2vec as a signal for predicting infobox types in Wikipedia articles [Biswas et al., 2018].
  • Egami et al. show the use case of geospatial data analytics in urban spaces by constructing a geospatial knowledge graph and computing RDF2vec embeddings thereon [Egami et al., 2018].
  • Hees discusses the use of pre-trained RDF2vec models for predicting human associations of terms [Hees, 2018].
  • The utilization of RDF2vec for content-based recommender systems is discussed in [Saeed and Prasanna, 2018] and [Ristoski et al., 2019].
  • Ammar and Celebi showcase the use of RDF2vec embeddings for the fact validation task at the 2019 edition of the Semantic Web Challenge. [Ammar and Celebi, 2019]. A similar approach is pursued by Pister and Atemezing [Pister and Atemezing, 2019].
  • Cyber-all-intel is an application in the computer security domain. It uses RDF2vec vectors for retrieving information on security alerts [Mittal et al., 2019].
  • AnyGraphMatcher is another ontology matching system which leverages RDF2vec embeddings trained on the two input ontologies to match [Lütke, 2019].
  • Azmy et al. use RDF2vec for entity matching across knowledge graphs, and show a large-scale study for matching DBpedia and Wikidata [Azmy et al., 2019].
  • Jurgovsky demonstrates the use of RDF2vec for data augmentation on the task of credit card fraud detection [Jurgovsky, 2019].
  • Türker discusses the use of RDF2vec for text categorization by embedding both texts and categories [Türker, 2019].
  • Vakulenko demonstrates the use of RDF2vec in dialogue systems [Vakulenko, 2019].
  • G-Rex is a tool for relation extraction from text which leverages RDF2vec entity embeddings [Ristoski et al., 2020].
  • Chen et al. show that RDF2vec embeddings can be used for relation prediction and yields results competitive with TransE and DistMult [Chen et al., 2020].

References

These are the core publications of RDF2vec:

  1. Petar Ristoski, Heiko Paulheim: RDF2Vec: RDF Graph Embeddings for Data Mining. International Semantic Web Conference, 2016
  2. Petar Ristoski, Jessica Rosati, Tommaso Di Noia, Renato De Leone, Heiko Paulheim: RDF2Vec: RDF Graph Embeddings and Their Applications. Semantic Web Journal 10(4), 2019

Further references used above:

  1. Mehwish Alam, Diego Reforgiato Recupero, Misael Mongiovi, Aldo Gangemi, Petar Ristoski: Event-based knowledge reconciliation using frame embeddings and frame similarity. Knowledge-based Systems (135), 2017
  2. Faisal Alshargi, Saeedeh Shekarpour, Tommaso Soru, Amit Sheth: Concept2vec: Metrics for Evaluating Quality of Embeddings for Ontological Concepts. Spring Symposium on Combining Machine Learning with Knowledge Engineering, 2019
  3. Ahmad Al Taweel, Heiko Paulheim: Towards Exploiting Implicit Human Feedback for Improving RDF2vec Embeddings. Deep Learning for Knowledge Graphs Workshop, 2020
  4. Ammar Ammar, Remzi Celebi: Fact Validation with Knowledge Graph Embeddings. International Semantic Web Conference, 2019
  5. Michael Azmy, Peng Shi, Jimmy Lin, Ihab F. Ilyas: Matching Entities Across Different Knowledge Graphs with Graph Embeddings. arxiv.org, 2019
  6. Remzi Çelebi, Erkan Yaşar, Hüseyin Uyar, Özgür Gümüş, Oguz Dikenelli, Michel Dumontier: Evaluation of knowledge graph embedding approaches for drug-drug interaction prediction using Linked Open Data. International Conference Semantic Web Applications and Tools for Life Sciences, 2018
  7. Russa Biswas, Rima Türker, Farshad Bakhshandegan-Moghaddam, Maria Koutraki, Harald Sack: Wikipedia Infobox Type Prediction Using Embeddings. Workshop on Deep Learning for Knowledge Graphs and Semantic Technologies, 2018
  8. Jiaoyan Chen, Xi Chen, Ian Horrocks, Erik B. Myklebust, Ernesto Jiménez-Ruiz: Correction Knowledge Base Assertions. The Web Conference, 2020
  9. Michael Cochez, Petar Ristoski, Simone Paolo Ponzetto, Heiko Paulheim: Biased Graph Walks for RDF Graph Embeddings. International Conference on Web Intelligence, Mining, and Semantics, 2017
  10. Michael Cochez, Petar Ristoski, Simone Paolo Ponzetto, Heiko Paulheim: Global RDF Vector Space Embeddings. International Semantic Web Conference, 2017
  11. Mohnish Dubey, Debayan Banerjee, Debanjan Chaudhuri, Jens Lehmann: EARL: Joint Entity and Relation Linking for Question Answering over Knowledge Graphs. International Semantic Web Conference, 2018
  12. Shusaku Egami, Takahiro Kawamura, Akihiko Ohsuga: Predicting Urban Problems: A Comparison of Graph-based and Image-based Methods. Joint International Semantic Technology Conference, 2018
  13. Michael Färber: The Microsoft Academic Knowledge Graph:A Linked Data Source with 8 Billion Triples ofScholarly Data. International Semantic Web Conference, 2019
  14. Kalpa Gunaratna, Amir Hossein Yazdavar, Krishnaprasad Thirunarayan, Amit Sheth, Gong Cheng: Relatedness-based Multi-Entity Summarization. International Joint Conference on Artificial Intelligence, 2017
  15. Tristan Hascoet, Yasuo Ariki, Tetsuya Takiguchi: Semantic Web and Zero-Shot Learning of Large Scale Visual Classes. International Workshop on Symbolic-Neural Learning, 2017
  16. Jörn Hees: Simulating Human Associations with Linked Data. University of Kaiserslautern, 2018
  17. Ole Magnus Holter, Erik B. Myklebust, Jiaoyan Chen, Ernesto Jimenez-Ruiz: Embedding OWL Ontologies with OWL2Vec. International Semantic Web Conference, 2019
  18. Emrah Inan, Oguz Dikenelli: Effect of Enriched Ontology Structures on RDF Embedding-Based Entity Linking. Metadata and Semantic Research, 2017
  19. Johannes Jurgovsky: Context-Aware Credit Card Fraud Detection. University of Passau, 2019
  20. Md Rezaul Karim, Michael Cochez, Joao Bosco Jares, Mamtaz Uddin, Oya Beyan, Stefan Decker: Drug-drug interaction prediction based on knowledge graph embeddings and convolutional-LSTM network. ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 2019
  21. Matthias Jurisch, Bodo Igler: RDF2Vec-based Classification of Ontology Alignment Changes. Workshop on Deep Learning for Knowledge Graphs and Semantic Technologies, 2018
  22. Mayank Kejriwal, Pedro Szekely: Supervised Typing of Big Graphs using Semantic Embeddings. International Workshop on Semantic Big Data, 2017
  23. Alexander Lütke: AnyGraphMatcher Submission to the OAEI Knowledge Graph Challenge 2019. International Workshop on Ontology Matching, 2019
  24. Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean: Efficient Estimation of Word Representations in Vector Space. International Conference on Learning Representations, 2013
  25. Sudip Mittal, Anupam Joshi, Tim Finin: Cyber-All-Intel: An AI for Security related Threat Intelligence. arxiv.org, 2019
  26. Aditya Mogadala, Umanga Bista, Lexing Xie, Achim Rettinger: Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects. Extended Semantic Web Conference, 2018
  27. Sourav Mukherjee, Tim Oates, Ryan Wright: Graph Node Embeddings using Domain-Aware Biased Random Walks. arxiv.org, 2019
  28. Federico Nanni, Bhaskar Mitra, Matt Magnusson, Laura Dietz: Benchmark for Complex Answer Retrieval. ACM International Conference on the Theory of Information Retrieval, 2017
  29. Federico Nanni, Simone Paolo Ponzetto, Laura Dietz: Building Entity-Centric Event Collections. ACM/IEEE Joint Conference on Digital Libraries, 2017
  30. Federico Nanni, Simone Paolo Ponzetto, Laura Dietz: Entity-aspect linking: providing fine-grained semantics of entities in context. International Joint Conference on Digital Libraries, 2018
  31. Finn Årup Nielsen: Wembedder: Wikidata entity embedding web service. arxiv.org, 2017
  32. Maria Angela Pellegrino, Michael Cochez, Martina Garofalo, Petar Ristoski: A Configurable Evaluation Framework for Node Embedding Techniques. Extended Semantic Web Conference, 2019
  33. Maria Angela Pellegrino, Abdulrahman Altabba, Martina Garofalo, Petar Ristoski, Michael Cochez: GEval: A Modular and Extensible Evaluation Framework for Graph Embedding Techniques. Extended Semantic Web Conference, 2020
  34. Jeffrey Pennington, Richard Socher, Christopher D. Manning: GloVe: Global Vectors for Word Representation. Empirical Methods in Natural Language Processing, 2014
  35. Alexis Pister, Ghislain Atemezing: Knowledge Graph Embedding for Triples Fact Validation. International Semantic Web Conference, 2019
  36. Jan Portisch and Heiko Paulheim: ALOD2vec Matcher. International Workshop on Ontology Matching, 2018
  37. Jan Portisch, Michael Hladik, Heiko Paulheim: KGvec2go - Knowledge Graph Embeddings as a Service. International Conference on Language Resources and Evaluation, 2020
  38. Petar Ristoski, Stefano Faralli, Simone Paolo Ponzetto, Heiko Paulheim: Large-scale taxonomy induction using entity and word embeddings. International Conference on Web Intelligence, 2017
  39. Petar Ristoski: Exploiting Semantic Web Knowledge Graphs in Data Mining. IOS Press, Studies on the Semantic Web (38), 2019
  40. Petar Ristoski, Anna Lisa Gentile, Alfredo Alba, Daniel Gruhl, Steven Welch: Large-scale relation extraction from web documents and knowledge graphs with human-in-the-loop. Semantic Web Journal (60), 2020
  41. Muhammad Rizwan Saeed, Viktor K. Prasanna: Extracting Entity-Specific Substructures for RDF Graph Embedding. IEEE International Conference on Information Reuse and Integration, 2018
  42. Radina Sofronova, Russa Biswas, Mehwish Alam, Harald Sack: Entity Typing based on RDF2Vec usingSupervised and Unsupervised Methods. Extended Semantic Web Conference, 2020.
  43. Tommaso Soru, Stefano Ruberto, Diego Moussallem, Edgard Marx, Diego Esteves, Axel-Cyrille Ngonga Ngomo: Expeditious Generation of Knowledge Graph Embeddings. European Conference on Data Analysis, 2018
  44. Rima Türker: Knowledge-Based Dataless Text Categorization. Extended Semantic Web Conference, 2019
  45. Svitlana Vakulenko: Knowledge-based Conversational Search. TU Wien, 2019.
  46. Yiwei Wang, Mark James Carman, Yuan Fang Li: Using knowledge graphs to explain entity co-occurrence in Twitter. ACM Conference on Knowledge and Information Management, 2017
  47. Shuo Zhang and Krisztian Balog: Ad Hoc Table Retrieval using Semantic Similarity. The Web Conference, 2018
  48. Amal Zouaq and Felix Martel: What is the schema of your knowledge graph?: leveraging knowledge graph embeddings and clustering for expressive taxonomy learning. International Workshop on Semantic Big Data, 2020.

Acknowledgements

The original development of RDF2vec was funded in the project Mine@LOD by the Deutsche Forschungsgemeinschaft (DFG) under grant number PA 2373/1-1 from 2013 to 2018.

Contact

If you are aware of any implementations, extensions, pre-trained models, or applications of RDF2vec not listed on this Web page, please get in touch with Heiko Paulheim.