Do we still need structured information? (The answer is yes)

Large language models (LLM) are revolutionizing the relevance of natural language comprehension and text responses. Combining them with structured data for standardized organization only improves their accuracy. It is against this backdrop that Telecom Paris researcher Fabian Suchanek and his team are continuing to develop the YAGO knowledge base.

In brief:

Large language models generalize information, which creates a problem for some queries. And updating this information is a complex process.
Structured data can help overcome these limitations of LLM.
YAGO is a knowledge base that combines Wikidata data with the taxonomy of Schema.org.

Tags:

# knowledge base

# natural language processing

# open source

# structured data

# Télécom Paris

Could it be possible to bring someone “back to life” using their emails, text messages and other records of conversations from their lifetime? This scenario is no longer restricted to episodes of the successful science fiction series Black Mirror. It is now included in the capabilities of language models. In the less esoteric context of everyday life, while language models have not replaced humans, they can greatly simplify a multitude of tasks, such as note-taking, summarizing a text, writing emails, coding and programming.

Large language models (LLM), which have been in the spotlight since the rise of ChatGPT, are machine learning algorithms that process natural language. In practical terms, they are trained on massive data sets and statistically model the distribution of linguistic units – letters, phonemes, words – in a natural language. This allows them to recognize, translate, predict and generate text in a smooth and contextual manner.

**“Let’s resist the temptation of generalization.” (Alain Decaux)**

Yet LLMs also have their limitations, especially when it comes to managing and storing precise, non-probabilistic information. This is because the machine learning that drives the models can generalize but cannot memorize the information. For example, a system trained with images of cats can, by generalization, recognize a new cat on an image when it sees it for the first time. “However, when it comes to specific hardware items, like recognizing the screws used to manufacture an aircraft, this generalization is irrelevant and even entirely undesirable,” says Fabian Suchanek, a researcher in natural language processing at Telecom Paris. “If we show the system a screw to determine if it is used in this same aircraft, we don’t want it to confirm simply because the screw is similar to the ones used.” In fact, even when the LLM are trained using a notice showing all the aircraft’s screws, they are not able to list the type, number and weight of the screws.

Updating this information is also a very complicated process: “If one of the screws used in the aircraft is replaced, it is difficult to teach the language model to forget the old one, and there is no guarantee that it will take the change into account ,” the researcher adds. Finally, LLM are massive models requiring a significant carbon footprint. To respond to a simple query – such as which parts can connect to a given cable – an LLM relies on graphics processing models and computing power to run billions of parameters, whereas the same query on an ordinary computer database takes just a few nanoseconds.

The perfect balance between “what” and “how”

Although LLM have a good understanding of human language, structured data, such as databases (e.g. XML and JSON) and knowledge bases, are better at storing accurate information. One the one hand, databases are well-suited for storing an exhaustive set of information, like a list of aircraft parts. Knowledge bases, on the other hand, are ideal for incomplete data, such as information on historical figures, which cannot be exhaustive. In short, structured data knows “what” to answer, while language models are better at knowing “how” to answer. Each approach has its own fields of application.

The combination of these two approaches could therefore provide interesting results. By relying on structured data, LLM would overcome the majority of the limitations mentioned above. Accurate information, easier updates and a smaller size (linked to the models storing much less data) would reduce their carbon footprint. “Smaller models are also easier to manage locally, which would help respond to digital sovereignty issues,” Fabian Suchanek adds. This is how the YAGO knowledge base came into being, with Suchanek as its main creator.

Birth of another great knowledge base

In 2008, while still working on his PhD at the Max Planck Institute, Fabian Suchanek, along with his co-authors, became a pioneer in developing a knowledge base automatically using data from the internet. The YAGO (Yet Another Great Ontology) project was born. Since then, a very large number of knowledge bases have emerged, most of which are specific and limited to the representation of concepts in a given field (e.g., geography, medicine, chemistry), or dedicated to a precise scope of application in a field.

Amid this landscape, YAGO stands out as a very broad and generalist base because it extracts data from the web – specifically from Wikipedia in the beginning. As the project grew, YAGO became multilingual and the authors became interested in Wikidata, another large knowledge base. Established in 2012, Wikidata met with great success starting in the second half of the 2010s, eventually convincing the creators of YAGO that it was a resource worth using.

Challenges in structuring data

The knowledge bases present the information as a graph: the nodes represent concepts or entities (e.g., people, places), and the edges show the relationships between these nodes. This structure is organized according to a schema, which establishes a taxonomy, a class hierarchy (e.g., a capital is a city, a city is a place), and a formal definition of the types of relationship between nodes: “was-born-in”, “works-at”, etc.

Like its sister project Wikipedia, Wikidata benefits from a large community consisting of tens of thousands of Internet users who actively contribute to its structure. And while 20,000 people can easily agree on a simple fact such as a date of birth, it is much more difficult to convince them to all comply with taxonomy or data organization requirements. “This is where our strength lies, because there are only six of us who are authors,” says Fabian Suchanek. “And our secret was to integrate the information from the Wiki databases into our own schema and taxonomy, based on resources like WordNet initially and later Schema.org.”

A knowledge base can be seen as a graph, where the nodes are entities, like here in the center, Eleanor Roosevelt, or in the bottom right, Manhattan. The edges are relationships between these entities: between the two previous examples, “birthplace”. The taxonomy is the class structure shown at top left: Roosevelt was an autobiographer, women’s rights activist, politician, journalist…

“A clean Wikidata subset”

Schema.org is a community that collaboratively develops and promotes data organization schemas. Several search engine giants, including Google, Bing and Yahoo!, launched this initiative and agreed to a specific taxonomy. Schema.org therefore provides a compilation of shared vocabulary, used in many applications, which now functions as a lingua franca on the web.

“We forced the data to follow this universal clean schema with its taxonomy, relationships, and all the logical constraints of Schema.org,” says Fabian Suchanek. “This provides a clean, readable, smaller Wikidata subset which facilitates automatic reasoning.” The database can therefore be used for information retrieval (IR), to verify and recognize entities mentioned in a text and respond to queries.

Yet YAGO is not a language model. “We can ask it to tell us when Albert Einstein was born, but not to write a text about his life,” the researcher explains. “Our intention is, of course, to combine YAGO with a language model. We are just starting a thesis on this issue.”

An open-source knowledge base and continuous improvements

In addition to operating under a permissive license – with an open-source code – YAGO is a living project that is actively maintained and constantly being developed. In 2022, it was awarded the Jury’s Special Prize at the Open Science Research Data Awards. This year, the base is celebrating the launch of its new version: 4.5. Fabian Suchanek and his team remain the only developers who correct the deficiencies and recode and implement improvement points.

More information on the YAGO website: https://yago-knowledge.org

Making algorithms understand what we are talking about

20 April 2021/by I'MTech

Scientific news from the Institut Mines-Télécom.