The World of Data: A Podcast on Knowledge Graphs and Databases

Shownotes

+++Unsere 23. Podcast-Episode: Die Welt der Daten: Ein Podcast über Wissensgraphen und Datenbanken+++

(English Version below)

In der aktuellen Folge unseres Datenwissenschafts-Podcasts widmen wir uns einem Thema, das für viele im Bereich der künstlichen Intelligenz und Datenanalyse von zentraler Bedeutung ist. Gemeinsam mit Sruthi Radhakrishnan, einer erfahrenen AI Consultant bei itemis, beleuchten wir die Grundlagen und fortgeschrittenen Konzepte, die diese Technologien so kraftvoll und unverzichtbar machen.

Sruthi teilt mit uns ihr fundiertes Wissen und ihre praktischen Erfahrungen, um einen detaillierten Überblick über die Welt der Wissensgraphen zu geben. Wir diskutieren, was Wissensgraphen sind, wie sie funktionieren und warum sie ein unverzichtbares Werkzeug in der modernen Datenwissenschaft darstellen. Besonderes Augenmerk legen wir auf die verschiedenen Arten von Datenmodellierungen und Einbettungen, die für die effektive Nutzung von Wissensgraphen entscheidend sind.

Ebenso tauchen wir tief in die Thematik der Graphdatenbanken ein und vergleichen sie mit Vektordatenbanken. Dabei betrachten wir die spezifischen Anwendungsfälle und Vorteile, die Graphdatenbanken bieten.

Sruthi bietet nicht nur Einblicke in die technischen Aspekte, sondern gibt auch wertvolle Ratschläge für diejenigen, die ihre Kenntnisse in Data Science erweitern möchten. Ob du neu in diesem Feld bist oder deine Fähigkeiten vertiefen möchten, diese Episode bietet praktische Tipps und Einblicke.

Diese Episode inspiriert dazu, auszuprobieren, wie diese Technologien genutzt werden können, um die Grenzen dessen, was mit Daten möglich ist, neu zu definieren. Bereite dich darauf vor, die Tiefe und Breite der Datenwissenschaft zu erkunden und entdecke , wie Wissensgraphen und Datenbanken die Landschaft der Informationstechnologie umgestalten.

+++++ English Version +++++

In the latest episode of our Data Science Podcast, we delve into a topic that is of paramount importance to many in the field of artificial intelligence and data analysis. Alongside Sruthi Radhakrishnan, an experienced AI Consultant at itemis, we illuminate the fundamental and advanced concepts that make these technologies so powerful and indispensable.

Sruthi shares her in-depth knowledge and practical experience to provide a detailed overview of the world of knowledge graphs. We discuss what knowledge graphs are, how they function, and why they are an essential tool in modern data science. Special attention is paid to the various types of data modeling and embeddings that are crucial for the effective use of knowledge graphs.

We also dive deeply into the subject of graph databases and compare them with vector databases, considering the specific use cases and benefits that graph databases offer.

Sruthi provides not just insights into the technical aspects, but also valuable advice for those looking to expand their knowledge in data science. Whether you are new to this field or looking to deepen your skills, this episode offers practical tips and insights.

This episode inspires listeners to explore how these technologies can be utilized to redefine the boundaries of what is possible with data. Prepare to explore the depth and breadth of data science and discover how knowledge graphs and databases are transforming the landscape of information technology.

Für mehr Informationen zum Thema Data Science, besuche gerne unsere Website: https://datamics.com/

Oder nimm an einem unserer zahlreichen Onlinekurse teil: https://www.udemy.com/user/datamics/

Unter dem folgenden Link findest du das LinkedIn Profil von Sruthi Radhakrishnan: https://www.linkedin.com/in/sruthi--radhakrishnan/

Transkript anzeigen

00:01.49

datamics

Hi and welcome to the next episode of our data science podcast with milk and sugar today I welcome shri rattra into but we can do it. And welcome to the next episode of the data science podcast with me and. Last hi and welcome to the podcast data sciences with ah Meg and sugar today we have again an english episode and I'm very happy to welcomeruddi hatar Krish none.

00:45.10

Sruthi

And same here. Thank you for having me on this podcast. Oh all. So I thought I thought that was like my cue to start. But sorry.

00:50.00

datamics

Not wait I just want to pronounce it correctly if you still have some page and ah the ah yeah, less That's let's do it just have. Also if you're something we can clap then I know and we can cut it hi welcome to the podcast data science with milk and sugar to the ab of an english episode and I'm very happy to welcome shrudi ratha krishnan she is a a i. Consultant at eimis has experiences with knowledge graphs. We will dive in this topic today we check automated links based on the tags we go in graphdbs for lmms the hot topics right now the graph embeddings which are used there. Also we. Go a little bit in the differences between graph and back to a databases as she has knowledge in both topics which will be very interesting. So welcome shorting.

01:52.63

Sruthi

And thank you Rey for having me this podcast. It's a pleasure to be here a short introduction. Thank you for the introduction as Renee already mentioned I work as an ei consultant especially in the domain of systems engineering. Also have a lot of publicly funded research projects and slowly transitioning into the customer projects in the company I have a masters in information technology from the university of stutgart I did my master thesis in this topic of knowledge graphs basically systems engineering knowledge graphs. That is the traceability graphs link prediction as renie already mentioned it's like automatically connecting to components based on textual similarity, structural similarity, etc, etc and can't wait to dig deeper into this topic with you all.

02:44.68

datamics

So thanks for the introduction and I'm very very curious about the topics today' ' zone we could search ah as you mentioned already a knowledge graph is one of your basic topics and knowledge. You did also research in this area. So what are knowledge graphs.

02:58.30

Sruthi

So if you if you Google what is a knowledge graph. Basically you get a ton of information because knowledge graphs are almost everywhere to give you an example if you are looking to buy something for your home in Amazon. It uses a knowledge graph in the backend which is a recommendation engine so you can think knowledge graph as a network of elements coming together. It has an overview of different entities that are put together in a connected way. This is my understanding of knowledge graph. And an example as I mentioned is Amazon's recommendation engine so a knowledge graph has 3 main components which are the notes edges and the labels that are given to this particular notes and edges this forms our semantic network. You are from a background of systems engineering if you're from the background of healthcare or whatever then you can also think of knowledge graph as an ontology where it helps us to view things more clearly in a connected entity world.

04:03.88

datamics

1 all right? And if you dive into the recommendation engine I think recommendation engine has two sides. 1 is user-based and 1 is content-based so here you would go rather on the content-based side or rather on the user-based side.

04:18.22

Sruthi

And so mostly on the content-based side basically because as the field of systems engineering uses a lot of data from different aspects. This is mostly concerned towards the content on. How these entities are linked together. What are the kind of topology that exists between these graphs. How can we model this data into graphs. Basically so it's more focused on the content-based side. But for the evaluation Purposes. So if you have to evaluate this knowledge graphs based on domain knowledge or based on Expertise then the user is involved. But I would say my ah in in my in in a day-to-day basis I'm more working on the content-based side and only partially on the user base side I could say this is 8020 basically.

05:06.20

datamics

So all right? and then also probably you can if you go to the content base if you look maybe also a recommendation engine if you look for some tool and it can distinguish between like gulf of Mexico the gulf of the car. And playing golf then with this because if you do a normal search on elasticsearch then it cannot distinguish the the meaning and differences. So.

05:29.52

Sruthi

So yeah, so as you mentioned semantics is also something which is very important when it comes to like understanding how graphs have been placed. How graphs has been modeled and especially the meaning of the word. The context of the word also depends on the connected entities. Basically. This is where we do a lot of analysis to identify how we could kind of group this embeddings together. So if you if you are working with lot of recommendation engines then semantics is equally important than the structure and the semantics in a graph are then data mine by the connected components as your example. So if it is like a girl's ball then the other entity which might be related to this is a ball or you have a set of properties that are associated with this particular graph. For example, if you're using something like a label property graph then you have different variety of labels and um so one of the things that ah. That pops up in the knowledge graph ontology world is that for which task are you building this graph for is it like a recommendation task is it like a task where you have to identify entities is this a task or where you are to do transactions, etc, etc. So it really depends on which task you're modeling this data for so based on the task that. Solving or the use case that you're confronted with this could be efficient remodeed and extracted as 1 layer about of the existing semantic knowledge graph which has semantic different semantics associated with each other. So.

06:57.22

datamics

All right? And what are the challenges. So when you have been working with ah knowledge graphs. What are the main challenges in this field and how can you solve it.

07:05.58

Sruthi

So one of the main challenges that you're confronted with especially in my field is the ability to model data as a graph when you look into the data you could already see that it is fresh and it is like it is a graph you could see with a human brain that this is a graph. But modeling that into a structure that is well adopted into like a machine learning algorithm or a graph neural network then that's the tricky part. So I would say modeling of the data is one of the very tricky parts especially when the data is from a background from a domain that doesn't have so much of. Retrievers or so much of information about how this is sketched and what has been sketched, etc etc. So this is one of the major problems modeling the data with respect to data sets I feel the graph community has given a lot of data sets to try out with especially for different domain mines. So one of the groups that is actively involved is particularly snap which is from Stanford they do a lot of research. They give out a lot of data sets on working with graph databases. There are also a couple of open source projects like miomi for example, also gives so gives you a lot of graph-based datasets to work with. So if you want to get started with graphs I would say there is abandonance of data. But when you model that to a domain then modeling the data with respect to the domain knowledge with respect to the connections bringing all the things together having an architecture.

08:36.36

Sruthi

Etc. Etc is very important. So ah to answer your second question. The solution is basically what I mentioned just now so in id companies or in any domains there is different concepts of storing data. So initially we had databases which is basically a form of putting federal keys and values. Or several tables relational tables coming together and then there are document-based databases like mongodb coming into picture and then now we have graph databases where we can store the data from multiple sources. So I feel. Ah, having data architecture that helps us to bring all this data together in a centralized already centralized. Ah way is very important and 1 ah one efficient technique that has been introduced is this data mesh that has been widely used in different companies. And there is also different strategies on building data lakes. For example, if you don't know what to extract from your knowledge because your domain is very widespread. All those database technologies and databases comes handy to solve this problem of bringing all the data together. So.

09:42.47

datamics

For So are these ah data modeling are these standard challenges because if you're like I see a lot of people. They start this data science and then they want to train the cool models and then they see 80% of the. Work is to clean the data to get the data to prepare the data and if you come now to the knowledge Graphs is this similar that you have like removing null values checking where the data is or do we have specific preparations. You need a specific format or you need to reduce the features or some specific constraints which you have to do.

10:18.52

Sruthi

So your your 100% right? So 80% of a data scientist spends his life cleaning the data. So the the regular preprossing techniques also are applicable to the graph data but on top of this. We also need to extract like. Features which are different from the traditional machine learning or deep learning data which we do for sequential data because if you if you have a look into graphs graphs are in the non-nuclidean space but all the machine learning and deep learning tasks are based on like text audio images videos. Are in a eucledean space. Basically so the structural aspects of capturing all the notes and edges in the graph meets for a little bit of more additional processing. For example, if you are solving for a link prediction task which is basically like a recommendation system. Then you might also need to create something called edge embeddings that needs to be computed at the later step so while working with that graph data. You need to have a set of properties in addition to the input data that you have based on the task that you're solving. And after you have created the recommendation then it is also equally important to have some post-processing methods to show the user. What they really want to see in this particular task that they're trying to follow. So in addition to the traditional preprossing. We also have some extra.

11:43.80

Sruthi

Processing which is basically based on graph theory the features that we extract and post processing based on the task that they're trying to solve. So.

11:48.35

datamics

All right? then can you like the edge embeddings I think this is a little bit more specific. Can you dive a little bit in deeper What this is exactly what it means.

11:58.10

Sruthi

Ah, so if issues from the and Nlp world or if issue from in general like different you have worked with lot of like transformers and all those things or even without them embeddings are something that we areed on a day-to-day basis. So embeddings is nothing but a. Vector representation of your data. We all know that the model that you're training ah sees everything as numbers which is basically vectors which are kind of like all this mattresses coming into picture. You have this functions that helps you to create what you want to create or what you want to classify what you want to predict. So in graphs. We have 3 types of embedding one is like graph levell embedding and then we have note- levelve embedding and then we have edge that embedding this depends on the task that you're trying to solve, especially the graph-based task. For example, if we want to do something like document classification then that is where graph embeddings comes to picture. You have documents from different domain. Let's consider for example medicaline where you have documents about um skin diseases where you have documents about like brain diseases when you have documents about different types of um, patient, history, etc, etc. So. Everything is okay at as a graph so graph embedding is something that captures all this information of different graphs. For example, single embedding for the whole skin diseases that we have seen single embedding for the whole heart diseases that we have and this is where graph embeddings comes to picture.

13:30.27

Sruthi

And then when you have to cluster nodewise that that's where we dive deeper into node embeddings. So in general these embeddings also have different loss functions. It's most transactive and indictive training that happens in this graph neural network. So called and then we have this node to like. Ah, nottoic algorithm algorithm or deeppo algorithm that helps us to create this node embeddings for clustering the notdes and then on top of ah next is the edge embedding which is basically the vertteses. Ah, which is basically the edges between 2 nodes. And this could be like average embedding so you have like two notdees already computed then you have to combine identify the relationship between the notes you can use like average embedded had emed emedded l one l two embeddings to kind of create this edge embedding. So these are the 3 major types of embeddings that are used in graphs and. There are several algorithms to do this based on the task that you're trying to solve as as as I said so.

14:31.33

datamics

And as we that thank for the very nice introduction and explanation of all the the graphs the different embedding types I think it was a really great overview and it's also a good basis now to dive in a little bit to the elements to the hot topics because there you often hear graph Embeddings What is theilations. Ah. To Elementm so chat gpd out How can it be used in this context.

14:51.98

Sruthi

So that research is still like a very exciting research that's going on if you're looking into graph-based ls I think there is only 1 which is present in hugging face's I forgot the name but you'll go into graph machine learning and hugging phase. That's the only last language model that you that do. You find which is trained on like graphs. Basically so what? what interests people when it comes to like llms and combining them with knowledge graphs is this new field which is like I wouldn't say it's new, but it's like a new trend which is evolving which is retrieval argumented generation. That is combining the ah knowledge that you have in your domain. It could be like a Pdf file. It could be like a a document. It could be like an excel, files, etc, etc. Then you combine that with an llm and then you try to ask questions with an llm and so it knows your knowledge and kind of uses. How it's trained and gives you the answer. So this is the retrie argumenter generation. So what people are trying to focus more is that what if the data is not like unstructured which is basically a document or a pdf file and they want to have it as a graph. What is the advantages of having graph embedding in combination with an nlm is that you can have this transacttive dependencies because as you mentioned graphs. Ah as your examples there is like a golf ball which could be a ball which could be a game etc, etc. So on using graphs. You can know this hidden information. So.

16:22.18

Sruthi

If you don't have to model the whole unstructured data rather you can use pieces of the data which are connected with other entities that helps us to that helps us to give even the undirected relationships that can be existing in the graph. So this chances. Transient dependencies are something that people are very much involved with and people are ah they they needed to kind of understand how this relationship is between different components so to give you an example, one of the things that has been done by Ibm Watson is that they took a collection of. Different cancer medicines that has been given to the world and they started like seeing in into different prescriptions for the same kind of treatments for the same kind of patents that they had and this algorithm that they built which is basically like a set of connections, entities, etc, etc. Started suggesting medicines that were having similar patterns in a different country suggested by a different doctor that had better results than the prescription that they suggest now. So basically all those patterns that you evolve in these graphs are very much important for the lllms to kind of be. Less hall nativeative. So one of the main key reasons of using knowledge graph emberding with lms is that it reduces the hallucination problem and I think there are like 3 different types of combining elements and knowledge graphs one of it which I gave us an example and the other thing is that if you don't have um.

17:52.87

Sruthi

And Llm Um, that doesn't kind of give you like a question answering based on the knowledge then that's also something that people are very much interested in kind of answering through knowledge graphs creating queries using knowledge, graphs, etc, etc. So this kind of relationship. So. Aren very much important because the and Nlm is still in the problem of like hallization, Etc, etc that could be solved using Knowledge Crowds. Basically.

18:18.33

datamics

Yeah, that's a cool feature and that's why it's also very hyped or well used to to use the graphs in lms because of the Hallesin nation. The biggest weak point of lms a lot of people talking about and if this can solve of these halleninations then.

18:23.89

Sruthi

Yeah.

18:36.48

datamics

And will be even much better. So now then we can also jump ah over because now we talked about the graphs graphbs for lmms but I also heard that I know ah practical use cases which are. But used with ah vector databases which can be also a good point. We also had a different podcast some time ago about this vector databases that these are trained for the elements also to prevent. Yeah, the Harucination. So what is the difference then between the graph and the vector database.

19:09.31

Sruthi

So in general graph databases. They're used to store this entities so they were basically traditionally they were built to solve the problem of several join operations that you have to do when you have to extract data from relational databases. And then to do like deeper analysis rather than graph databases. For example, if you have to find the shortest path in a document or Sql database then the query is very huge and the time ticket for travels in this graph is very huge but graph databases that meant to solve this problem of. Touring connected entities together so that you can do deeper analysis. You can ah have lesser query ah lesser length query and lesser querying time with high scalability, etc, etc. And um. Is a specific concept of graphs which helps us to do this more efficiently that's called indexing basically so this indexes are assigned to the entities that we put into the graph databases after the advent of llms or after people started noticing that what if we can store this. Whole embedding as a vector rather than disconnected entities that changed the whole story. So when it comes to vector databases. Vector database is a place where all the vectors are stored for the particular entity that you want to vectorize? let's say you're doing something like a clustering operation for example, then this.

20:40.72

Sruthi

Text that your cluster string is then embedded using an embedding model. One of the embedding model by open eyes ext starn c zero or text ah x embedding ada Zero zero two which helps you to embed all the specectors and store in a vector d so what are the advantages of storing. Ah. Text let us convert into a vector into vectordb is that ah you can run computations more efficiently if you want to do like similarity calculation if you have to do something with Ret Travelval argumenter generation then you have then you are reducing this one step of converting it each time into vdtas rather. You store the vector at once and you perform all the operations in this vectors and there are also several vectors tools that started arising like fires. For example, which is just spin. Ah which is from meta that also helps you to store even vectors that are very large than the actual size of vectors that you would create. So they are meant to do very quick transactions if you are in an operation. For example, you're placing an order on Amazon you're doing a financial transaction of paying your university fees, etc, etc. So you need more efficiency so that's where all these victor store are really really coming into play. You can natively render operations on the vectors rather than converting them each time. So come back to graph databases I think mufoj has now a vector index but I'm not like very sure. But I think like there was a blog post about like having a vector index and Leo four j so.

22:12.75

Sruthi

They are trying to combine these things together so you can search more efficiently. You can search more effectively and according to the task you can do Computations mode quicker than before how you will do in a react database. So when it comes to like vector databases I Feel there is a lot of. Combinations of things and there is a whole lot of vector databases out there so choosing what is right for your particular task especially with the infrastructure, especially with the approximate near Sna but Algorithms such as the algorithm that runs behind the recommendation system or the similarity system of. Vctor databases are kind of like natively known and if if they are very clear and very open about if a graph database if ah, sorry for vector database is very open about having such a open approximately but there computation hitch and Sw or what algorithms that they deploy. Would be easier to choose whether a graph database or a vector database would be ideal for your use Case. So for so for some of my use cases. Vector databases were the best. Some of my other use cases where there is an introduction of like ontology where there is an introduction of multiple connected entities multiple stakeholders then I feel graph databases are still better than Victory databases. So.

23:30.30

datamics

What do you mean? exactly with the like with graph databases multi mendor or what does it? What doeses mean exactly or can you give it an example for this? Yeah, like.

23:41.83

Sruthi

So ah, so do you mean Yeah graph databases also has multiple vendors. Basically so we have like new four j which is one of the most prominently used graph database at least 60% of the graph enthusiasts use new four j. And then there is tiger graph. 1 thing that we use internally and then there is rang godb which is basically like a document come graph database which is a make sure of both the things together and and there are few more there is d graph basically so ah. Compar to different vendors. It's a matter of choice on how you want to see so if it is about like licenses and all those things then you have to really think about how to commercialize this if you're using for research purposes then a four j has an open source version then there is also ti to graph. So if you're working in a very let's say um. Speed intensive or very efficient way of transactions and all those things so tigergraph is the best choice because they natively run the queries come back to newofoger because you know four J has something called. Cpher which is like a graph query language that depends on apoc which is awesome procedures on cipher to run all these algorithms but it it really depends on which use case you're trying to solve how much speed are you willing to achieve what is the commercialization aspects of the project that you're working on and yeah, ah like.

25:04.86

Sruthi

So um, so many other factors that you need to consider when you have to ah choose between different windows when it comes to like vector databases or graph databases. So it really depends on the use case I would say.

25:14.85

datamics

Yeah, this sounds ah very cool also for the explanation of everything. Thanks a lot. We went through a lot of topics like knowledge graph what it is. We had the embeddings different types today we had graphdbs we had. Vector databases. But even the vendor list overview. It's very nice of graph ahdb. So for the listeners if you want to to choose from graph database I think there's some some ideas ah out there which we mentioned of Common Graph database so So we made a batch like a large topic today. And we're also reaching to the end at the end. We always ask our participants what are or ask you for the participantments. So of your main tips which you have for them so one to 3 tips out there from what would you success for the people. Which challenge did you had or if they start data science or knowledge with knowledge graphs.

26:08.24

Sruthi

So 1 thing that I would so the the main challenge for me was having like a lot of background knowledge and knowledge graphs. So if you want to get started with knowledge graph I would suggest to have a brief look into graph theory I know we have gone through this stuff I know we have studied this stuff stramar. Ah school but still like having this refreshment having understood like how this last functions and all those things work is first thing that I would suggest the audience because that forms the basics even before you get into data science especially when you want to work with graphs. And the second thing I would suggest is that to have look into like lot of experts who are in this knowledge graphs. Of course there are only like a very handful of experts that they could pick when you are choosing the the line of knowledge graphs 1 of the lectures that I would suggest even today is the graph machine learning course by Stanford that. That gives you that sets the base for doing everything that you want to do with the graphs basically and then the third thing is to network basically the graph community is always very welcoming. That's what I have seen at least to my experience whether it's like graph databases whether it's like having some doubt with graphql networks and all those things. Is like a whole ah set of people on Linkedin who are sharinging their knowledge all the time. So I would say like networking with people who are working with graphs also gives you like a prospect. They want what problems that they try to solve etcites, etc. All and finally I would like to say.

27:37.83

Sruthi

Um, when you get started with when when you're starting out with data science and all those things. The only thing that I would suggest for you is that have a ah strong grip of what domain that you want to go like especially in Ai you ah Ai is technically the hype. But when I got into ai I did a lot of computer vision I did a lot of deformation reduction in different crystals and everything so having a domain of your choice whether it's medicine whether it's like education whether it's like. Finans or whatever so having a domain that interests you that puts you forward and doing things and up playing Ai to solve a problem in that domain would be the most interesting thing to do.

28:20.70

datamics

Yeah, that's ah the nice sentence at the end to solve the problem and to help the others to be useful, very nice. So thanks. Ah, shudi for joining today and also for your really insightful explanations and.

28:37.25

datamics

Yeah, all the the examples which you gave today. So thanks for joining.

28:39.64

Sruthi

So it's it's a pleasure. You're very welcome and thank you so much for inviting me to this podcast and I hope the audience get something out of this and apply apply to solve their problems that they face on a day-to-day basis.

Shownotes

Transkript anzeigen

Neuer Kommentar