FAIRwell | Barend Mons

Farewell address Barend Mons

After almost 40 years of scientific research on malaria and later on FAIR Data principles and Open Science, Prof. dr. Barend Mons will officially retire as Professor Biosemantics at the Leiden University Medical Centre (LUMC) and the Leiden Academic Centre for Drug Research (LACDR) at the Leiden University.

Below is the official Farewell Address of Prof. dr. Mons, titled 'FAIR-well', which was given on Friday September 6, 2024 in the Groot Auditorium of the Academie-gebouw of the Leiden University in Leiden, The Netherlands.

Pictures of the afternoon can be found here

FAIR-well

On the relationship
between Biology, Data, Computers and Pianos

"First, a brief overview of my journey from biology to data science and back.

From my early childhood, I was intrigued by biology and its unbelievable complexity, a sparking a sense of beauty and wonder. At high school ‘even’ my biology teachers advised me to ‘go for my hobby’, so I gave up my long-term dream to become a vet. and started a biology study in Leiden in 1976.
Following my PhD. In 1986 on the cell biology and genetics of malaria parasites, Chis Janse and I built up the malaria group in the department of parasitology.

We worked for years on trying to understand the most basic decision making points in this unicellular organism. That was a humbling experience. Little as we knew at the time and with only embryonic genetic technologies at our disposal, we already stumbled on layers after layers of complexity, even in such a relatively simple organism. After 15 malaria years, including a short intermezzo at the European Commission and NWO, I decided to move on to explore ’bioinformatics’

I ‘predicted’ that it would take decades to create a recombinant malaria vaccine, if ever.

Today, I am proud to tell you all (easy because it is not my achievement) that the Leiden malaria group, almost 40 years after Chris and I started with ‘one culture vessel’, is testing the first live-attenuated malaria vaccine in humans!

But, for myself, I came to the conclusion that supporting ‘the scientific discourse in general’ would be an even better contribution than just focusing on one disease, no matter how intriguing and devastating.

So, my Leiden professorship had little to do with malaria, it was on ‘Biosemantics’.

Fast back- and forward to the year 2000, when I entered the world of ’computers’. Not that I ever had the ambition to become a programmer. In fact Erik van Mulligen strongly advised me against that, probably consciously protecting his professional group…

I always continued to take a ‘biological perspective’ when trying to address the looming explosion of data in the life sciences and the associated challenges caused by new high throughput technologies.

Soon, we realised that the current scientific communication system was not at all ready to support the necessary revolutions needed to serve increasingly machine assisted science.

We also found the ‘Publish text or Perish ’ kartel's in our line of fire.

But... why is ‘machine assistance’ so important for current science and innovation?

Well, the major societal and planetary challenges we currently face, while crossing Planetary Boundaries, are all of global proportions. They are also extremely complex systems problems. Nothing is ‘just’ a matter of better health, better food, energy transition or environmental measures. Therefore all the Sustainable Development Goals require cross-disciplinary research.

The information science challenge is that the problem of ‘interoperability’ of relevant research data increases immensely when we want to reuse data from other disciplines.

In addition, the sheer volume and complexity of current data means that the patterns we might discover are way beyond even the most brilliant minds of people. Thus, machines are needed to recognise these patterns.

The problem however is that ‘machines do not understand anything’ they just recognise patterns based on complex algorithms and machine learning processes that we have fed them with.

I cannot resist saying at least one thing about the hype around Large language Models, not being Artificial Intelligence at all: In essence: everything an LLM coughs up is a ‘hallucination’ in the sense that it is just a statistically sound output based on an enormous training set. Only when a hallucination ‘makes sense’ to us we project ‘intelligence’ on the system.

In my humble opinion, the danger is not so much that machines will get ‘intelligent’ let alone (self) conscious. The real danger is that we, consciously or not, become so reliant on machines that we move in the direction of becoming more like robots and lose our sense for nuance, wonder and irrational things like ‘love’ and ‘faith’. I think we all agree that we feel the ‘loss of nuance’ all around us in society already….and nuance is what I will end with.

Because human language is full of nuance and ambiguities, an entire discipline developed over the years, called ‘text mining’ which uses complex computational approaches to ‘recover’ precise , unambiguous scientific assertions from text. As a biologist entering this field, I thought: ‘This is the world upside down’. Imagine that cells in our body would first scramble messages to other cells to make them less precise ?

So I asked the question in 2005: ‘Text mining?, why bury it first and then mine it again?’, making the argument that computational biology needs machine readable information. That was not very popular in 2005 and is still met with some skepticism today, especially by the people making money by burying information in prosaic language, and those that make a living by mining it back into precise, verifiable and machine actionable statements.

In 2009, Jan Velterop and `I coined the term ‘Nanopublication’ for the concept of an irreducible, minimal, yet meaningful assertion in science, with its provenance in machine actionable format.

Figure 1.

My classical example still being
"malaria - is transmitted by - mosquitoes".

Paul Groth and Tobias Kuhn have taken the concept to a first technical implementation as tiny knowledge graphs of triples in 2010 (see Figure 1).

With Andy Gibson, we designed a way to approach the redundancy of repetitive identical assertions, which can be found multiple times in the literature and databases. Such unified identical statements were collapsed with all their provenance into what we called ‘cardinal assertions’.

We now use the term ‘cardinal assertion’ for any assertion that is found multiple times in published sources. This approach was later perfected by Arie Baak and Aram Krol in the company Euretos, which now has a graph of over 2 billion Cardinal Assertions.

I will come back to this ‘zipping’ approach later, as it is also crucial for the current developments in Large Language Models, ambiguously referred to as ‘Artificial Intelligence.’

Obviously, there is always a first time for each assertion to be made. Some assertions will be contested and actually found to be false.

In our era of misinformation and deep-fake, wouldn’t it be nice if machines could be informed about the ‘level of truth’ for each assertion they encounter?

Meanwhile, the ‘data tsunami’ rapidly became so overwhelming that today even the most skeptical scientists are coming around to see computers as an indispensable assistant in research.

We were criticized by some people stating that the ‘nuances’ of human knowledge can never be properly represented in something as simple as a Nanopublication.

Now, first of all, we never proposed that, and secondly, it addresses the exact problem in which I am still most interested and that is: how to communicate science as precisely as possible, also to machines, including nuances and ambiguities that are inherent to progressing human knowledge.

We know that computers are ‘stupid’ and most ambiguities and nuances are lost on them.

So, machines need precise and unambiguous statements and the necessity to use computers for science forces us to feed them with precise information.

This will also benefit human to human communication. We are not in science to write poetry after all…

However, describing important nuances (‘near sameness’) to machines appeared to be a serious challenge, which will extend way beyond my retirement.

In 2008, Frank van Harmelen ‘gestimated’ the total number of ‘Nanopublications’ that Life Sciences would have produced at 10^14...

To our relief we also could argue that the average ‘redundancy’ of a statement in science might be somewhere in the order of 1000, so the body of ‘zipped’ cardinal assertions would be ‘only’ 10^11 (100 billion). Even today, that is a daunting number of information to deal with for any computer.

But.... that is not where the story ended. We soon enough came up with the concept of a ‘Knowlet’, which is a cluster of all cardinal assertions with the same ‘subject’, for example a drug, a disease or a person.

We are now using Knowlets as self contained units without fixing them in an static ontology or knowledge graph.

Today, more than 15 years later, we also present the Knowlet as a FAIR Digital Twin of the concept it denotes.

Figure 2.

my daily office (ceiling) view, Knowlets slowly moving around my router…representing Knowlets as free floating concept representations on a grid of conceptual restrictions…

I will come back to the Knowlet later in its current and predicted future role. For now, I believe it is a first step towards enabling machines to independently deal at least at some level with ‘near sameness’ (slight but important nuances and ambiguities).

Meanwhile, In 2014, here in Leiden at the Lorentz center, this all culminated in the formulation of the FAIR guiding principles in order to render data and information Findable, Accessible, Interoperable and Reusable by machines as well as their masters (for now still Homo sapiens).

For a pure data science audience, I could stop here and say ‘The rest is history’.

But as we are here to look both back and forwards, I will still take you on a brief ‘first decade of FAIR tour ‘ before we try to make sense of it all.

The FAIR principles got immediate global attention and acceptance, but mainly by policy makers (EC, G7, G20, etc.) and increasingly also by science funders. They realised that much of their current investments into creating scientific data was going to waste because the data were NOT FAIR and thus either lost altogether, if found, they were not accessible or interoperable and thus they were what we coined as ‘Re-useless’.

However, the principles inadvertently exposed traditional data malpractice and were seen by some scientists as a distraction and an extra burden in the dominant ‘publish or perish’ world of science. The costs associated with making data FAIR were also seen as ‘diverting’ from the core business of creating yet more data and publications’. I have addressed this before as one of the Seven Capital Sins in Open Science.

Then, in 2015 I was appointed chair of the EC High Level Expert Group for the European Open Science Cloud (EOSC). I could fill the entire remaining lecture here with stories about that period alone, but suffice to say, that at least it yielded GO FAIR and two friends, Jean Claude Burgelman, my director at the EC in that period and Karel Luyben, who is currently the president of the EOSC association. Most important for today’s topic is that FAIR data and services will be a key requirement for the EOSC.

My mounting frustrations about the slow progress in such a major international endeavor with many conflicting vested interests, really bothered me. Until my trusted friend and adviser George Strawn told me that on a geological time scale, FAIR was progressing with lightning speed!…

It has also taught me that I should first of all be more patient and secondly, stay away from politics as much as I can.

The resistance against FAIR is now rapidly imploding with more and more examples where FAIR data enables data intensive science and funders as well as publishers increasingly simply demand FAIR outputs as a prerequisite for further funding and publication.

This year, exactly ten years after we conceived of the principles, another Lorentz conference reviewed its impact and the road forward. With more than 14.000 citations in many recognised subdisciplines and ranking as number 38 in the list of most cited articles of similar age ever tracked globally, I feel the period where we had to ‘defend’ FAIR is over.

So I can retire.... Not yet....

There is a downside to any ‘hype term’ and that is that people take it, and water it down to suggest they are compliant. One of our partners -Roche- even coined the term ‘fake-FAIR’ and we also hear about FAIRwashing. Also, FAIR is not a goal in itself,, but should lead to better science.

So what have we learned in the past 25 years?

Many of you know I am from a ‘Protestant Reformed background’ and I want to honor that by now approaching the remainder of my lecture as a solid ‘sermon’ with ‘three attention points’:

Biology as teacher for data science

Understanding human Knowledge discovery

Near sameness opened up for machines

I have to disappoint some of you, there will be no singing in between …

Point 1: Biology as teacher for data science

When Luiz Bonino, here in the room but hiding now, joined our team we decided that, as a hard core computer scientist, he should first follow a personal ‘crash course’ in basic biology. Just to make sure that his eyes would not glaze over when he would hear the term ‘gene’ like, every day, in our group. His reaction after two sessions was: ’Computer Scientists should first study biology before they are allowed to design any artificial information system’. (I still agree by the way).

Organisms are able to deal with incredible amounts of sometimes ambiguous and changing information, all encoded in increasingly complicated networks of simple molecules. They have an impressive ability to deal with these ambiguities, changing environments, and selective pressures. Biology even supports the enigma of self consciousness of organisms like ourselves.

During my study, in the seventies, DNA was sequenced, beyond short strands and single genes, (Sanger 1977) and this sparked rapid developments in our understanding of how biology deals with information. When DNA was first described and appeared to be ‘just’ composed of four base pairs (the famous letters ATCG) it was hard for biologists to believe that this ‘simple’ molecule could carry all the information to ultimately create complex life forms, including humans.

Of course we know by now that our genome is quite a bit more complex than a simple strand of ‘letters’ but even without all its secondary and tertiary additions it is still more efficient for storage of simple ‘data’ than any of the advanced systems humans have developed for that purpose.

Let’s only look at the lowest level of complexity, namely simple ‘storage’ of data.

Even that is a mounting problem: Genomics data alone doubles in about every 6 months, and we will soon see the limit of our current technical approaches to store data.

Computer scientists try to make us feel good by telling us that storage capacity doubles every 18 months… but if data growth is at three times that speed, we can see exponential problems coming.

Now, how does biology deal with storage of ‘raw’ data? Precisely: DNA….

My little finger contains about 8 billion cells. Each cell contains my entire genome (about 3.4 billion letters).

Now, assume that I would have one unique, different human genome in each of these cells (rather than exact copies of my own)…I could store a copy of the genome of each human being on Planet Earth in my little finger. This would be the equivalent of 16 billion DVDs, which would take up a space as large as more than 100 times this entire auditorium.

Obviously, the real complexity challenge is not in storing the ‘data’ but in how to build information and ultimately actionable knowledge from them.

So again, how do organisms do that?

In the early nineties I worked with Chimpanzees, and learned about the 98+ % similarity of our genes. I started to be more and more intrigued by the 95% ‘junk DNA’, which I never believed to be ‘junk’ in the first place. I made some enemies by saying that many of the significant differences between us and chimpanzees may not be in the genes at all.

By now, we know that there are probably around 4 million ‘functional sequences’ in the DNA that do not code for proteins, but regulate a host of other functions, which determine to a large extent how the processing of protein coding RNA will be handled to result in a much broader range of proteins than we ever conceived back in the days.

When I first explained those ‘regulatory sequences’ to my son Bas, he classified them immediately as ‘Readme files’, which I liked, and then used again to try and talk to computer science colleagues about how provenance and instructions should be connected to data and information.

Nowadays, at the risk of oversimplification, I usually try to explain the ‘genome’ function and emerging ‘variation’ to non-biologists as a ‘Pianola’ with about 20,000 keys (the genes) and about 4 million ‘pin’s’ in the roll that determine which of the endless number of possible melodies will be played.

In case a key is ‘mutated’, it plays a false note and sounds ‘off key’ (and always off key).

However, when in a complicated musical work, played by a pianola type or music box, only some pins are changed, minuscule variations in the melody or the instrumental composition can occur that (initially) may go unnoticed, even by experts. However, at some point there could be a sudden shift to an entirely new melody, which, if ‘beautiful’ becomes a ‘hit’ and when ugly is probably soon extinct.

I still discuss with Marco Roos and the Biosemantics group that my ‘educated guess’ is that the majority of variants leading to rare genetic diseases is not to be found in the genes, but in the ‘readme’ files, created by regulatory sequence variation. So we better study the roll, next to the keys….

One more lesson from biology

All life forms are ultimately built from atoms. As such, isolated atoms carry little information, however, once multiple atoms compose a ‘molecule’ this molecule carries information and has all kinds of inherent ‘features’ that make it react with other molecules and its environment in certain ways. Nanopublications are like those molecules.

In life, molecules come together and make more complex biological structures (like Knowlets) that can react much more diversely in different situations and contexts than simple molecules.

Once molecular structures start to form even more complex systems, like for instance ‘organelles’ (such as mitochondria), we create ‘executable’ units, or little ‘factories’ that are able to really perform a repetitive function. This higher level of complexity is comparable in data science with for instance Caroles’ ROCrates where data, algorithms and instructions are all ‘packaged’ to be independently executable.

But, these can all be decomposed again into their individual molecules and even atoms, but not without losing their emerged added functionality. We still don’t have a clue what ‘life’ really is, but all levels of complexity need to be interoperable for the molecular ‘algorithms’ of living cells, tissues, organs and ultimately organisms.

Following this line of thinking, all ‘Knowlecules’ in data intensive science should be FAIR (machine actionable) units of information, building ever larger and complex structures with functions, ultimately creating ‘whole ‘organisms’ and even ecosystems of knowledge.

So the first basic lesson here: Accept biology as a teacher for data science

Point 2: Understanding human Knowledge discovery

Luiz also had to endure deep and wild discussions for years between Erik Schultes and myself about biological evolution and complexity science. We all agree that the combination of these diverse skills and interests in GO FAIR heralded four of the most productive years of our careers. I feel we developed a number of novel information science approaches that still grow in importance today.

But first we went back (also with a major role for Herman van Haagen) to the very basic model of how we as humans discover knowledge.

Don’t worry, I am not going to explain all the details of the model they came up with many years ago, but in essence it showed that knowledge discovery is very much like other physical processes, and follows the famous ‘sawtooth’ dynamics. If patterns become more and more complex (the x-axis) and our ‘understanding’ (and compute power) also increases, we can still understand the patterns.

What the model made visible is that in case our ‘ability of understanding’ has an upper limit, and patterns become even a little bit more complex, we suddenly are completely lost and perceive almost everything beyond that point as ‘chaos’.

It is the progressive deeper understanding of reproducible patterns in what we perceive as ‘reality’. The more we discover, the more the model predicts vast ‘implicit knowledge’.

Why is this relevant?

Well, because, as explained in the introduction, we have reached the point where the data relevant for the scientific problems we need to address today are so complex that computers see many patterns in them, but humans are no longer able to discern those without the help of these machines.

But, as said before, machines just discover patterns but ‘have no clue’ what they mean.

However, there is again also good news, we as humans are getting better and better in letting machines do this hard work of revealing patterns in massive amounts of data. We then compare what we already know (established knowledge) with real world observations.

For scientists the challenge remains to unravel the complex patterns and create actionable knowledge in our reality one little step at a time.

In the current ‘AI’ hype discussions it all comes down once more to: Unambiguous, machine actionable FAIR Digital Objects, accumulated in more and more complex FAIR objects that machines can operate on with minimal mistakes. So FAIR (Fully AI Ready) will massively reduce meaningless hallucinations. In addition, we can restrict machines at the ‘output level’ with so-called ‘conceptual models’ that prevent meaningless outputs.

So lesson two: Try to fundamentally understand how we discover knowledge

Now comes a crux: how do we present ‘established knowledge’ together with ‘real world observations’ to machines in such a way that they do not go completely rogue because they create millions of hypotheses while they misinterpret most of the data that are ambiguous or even totally unintelligible for machines? This leads to my final point...

Point 3: Near sameness opened up for machines

As a reminder: Machines are stupid. One of the challenges with machine readable, fixed ontologies is that machines can not deal effectively with subtle differences between concepts in a particular context.

People can… they have in fact the equivalent of ‘Knowlets’ of associated concepts in their minds for each concept they communicate about. These ‘Knowlets’ are personal and slightly (or vastly) different, dependent on prior knowledge, and also based on the cultural or religious background of the person. This sometimes leads to what we call ‘false agreements’ or to ‘false disagreements’, and can even cause wars. So also people are not perfect at it.

But that is not our issue today: In general terms, humans have developed a pretty effective way to communicate across ambiguity barriers.

Machines however, have ‘no clue’ about subtle differences and can only deal with very precise instructions and information.

Imagine a ‘triple’ that states that [concept 1] [is ‘nearly identical’ to] [concept 2].

What does that mean? Not much for a computer and at least it would need a ‘value’ on how similar these things are and in what context.

This is a ‘hobby’ I will pick up again after my retirement, and we will have regular brainstorms about it with whomever is interested.

Here I pull out the ‘Knowlet’ models again. Imagine that two concepts (for instance proteins) with all their features are ‘identical’, except that one was produced in mice and the other one in humans. As molecular structures they are actually identical. Still they are different concepts, with a different unique identifier. In their respective ‘Knowlets’ maybe 10,000 Cardinal Assertions are identical, but at least 2 of the assertions ‘produced in Homo sapiens’ (versus Mus musculus) and ‘gene located on chromosome x’ would be different. So, by comparing Knowlets -machines are much better in that than humans- computers can now ‘decide’ a number of things independently from precise human instruction: (a) these two proteins are different (identifiers) but ‘very similar and (b) they only differ on ‘objects’ of the semantic type ‘species’ and ‘chromosomal location’.

We have published about this issue quite recently, and I will spend most of my ‘free time’ now to deal with this matter in more detail.

One of the people that was introduced to me a few years ago via Luiz is Giancarlo Guizzardi and we ‘hit it off’ on this subject as he worked on it from a computer science perspective for many years.

He introduced the concept of looking at a concept from a particular perspective as ‘Qua’. We agreed that humans (mostly unconsciously) ignore certain features of concepts they ‘have in mind’ and about which they communicate in order to minimize the risk of false agreements and false disagreements. Let me try to explain this with a last ‘piano example’.

First of all, the word ‘piano’ is ambiguous in itself: it does not only refer to the concept of the instrument, but also to a ‘floor’ of a building in Italian and to ‘soft’ in musical context. Obviously the ‘Knowlet’ of these three very different ‘meanings’ will be vastly different and machines will easily separate the Knowlets (not the isolated word of course).

But here comes the real challenge. When you and I tell each other that we ‘love the piano’, first of all we will usually not think that we literally ‘love the instrument as a thing’ but we love playing or listening to piano music.

So, it really does not matter much whether you have a ‘grand piano’ in your mind (a different Knowlet) and I an electric piano. However, when we agree to ‘move the piano to a different room’ the difference between a grand and an electric piano (its actual size and weight) matters a lot! So, do I look at the concept ‘piano’ ‘Qua’ , 'creator of music’ or ‘qua’ piece of furniture?

We soon enough found out that combining the ‘Knowlet’ concept with the ‘Qua’ concept could at least approximate a systematic approach for binary machines to tackle ‘near sameness’.

Coming back to the example of the two proteins: If we simply filter out the semantic types ‘species’ and ‘chromosomal location’ from both knowlets (a simple instruction for machines), computers would now treat the two different concepts (represented by their Knowlets) as identical. However, when we reintroduce the cardinal assertions that were filtered out before (a different ‘Qua’) the knowlets would separate again in conceptual space.

This is extremely important, now that the complex global challenges force us to cross disciplines, where very different perceptions may occur even on generic and universally used concepts such as ‘water’.

So lesson three: We need to teach machines about ‘near sameness’ beyond ontologies.

Regardless of whether you meanwhile decided to rather spend time to appreciate this beautiful auditorium or whether you silently disagree with me as a fellow scientist, I think at least we can all agree that this single issue could easily keep me busy - and engaged -and bothering you- for years to come.

Let me round off by saying a few words about the new institute we just started. Ten founding members (all present here, in person or online) have started the Leiden Institute for FAIR and Equitable Science (LIFES). This is an association of public and private members addressing FAIR data and the need for its distributed (and equitable) reuse in the global setting where the data were created, stored and controlled. The overlap with the title of the Lorentz conference in January is not entirely coincidental.

On top of the complexity and the volume of the data we deal with, they are also in many cases highly sensitive and thus, cannot leave the safe environment in which they were created and stored. This makes it all the more important that visiting machines (algorithms) can properly, and automatically, deal with the data locally and thus that the data are FAIR, and the algorithms are FAIR as well. That is the core business of the new institute and will prevent me from just quietly retiring and fading into the background.

Even my sustained feelings for the Global South that stem from my old ‘malaria days’ come all back into focus, as the ‘Equitable’ part of LIFES is strongly related to the distributed character of the approach. The algorithms that visit data are infinitely smaller than the datasets themselves and can be -in essence- launched from a smartphone. So, ‘unless we prevent this by deliberate, unjust legislation’ a ‘scientist from Africa’ should be equally able to interrogate data from around the world as a scientist from Leiden or Harvard.

Also, data from the Global South should no longer be exploited for purposes not approved by the custodians of these data, as they stay where they are and the local custodian decides on reuse and for what purpose.

The LIFES institute will also support highly ambitious projects such as the Human Immunome Project where I co-lead the Data Governance Committee. This project aims to understand the dauntingly complex human ‘immunome’ (everything that has to do with our bodies’ interface to the outside world). This can only be based on massive data, spread around the world in different genetic groups, regions and circumstances. These data are inherently sensitive and will have to stay in the safe spaces where they have been generated and under the jurisdiction and legal structures of the countries in which they were collected. It is good that I am a pathological optimist….

So, in conclusion, my atypical scientific career, starting in malaria, switching briefly to ‘policy’ then to data science and all the way back to biology informing data science, now culminates in the LIFES partnership and I thank you all for that because at different levels, you have all donated to my hobby for over 40 years from your taxes. I hope you do not consider it wasted.

Thank you, Ik heb gezegd."