If the various catastrophes that are brewing, beginning with global warming, allow humanity to exist for several centuries, what will remain of contemporary computing, except perhaps mountains of waste? It is probable that at this distant date, our descendants sometimes seek to make this or that artifact of our twentieth and twenty-first centuries work. But how do you do with software that has been out of date for a long time? This is where one of the most amazing digital projects (a case of utopia made on Wikipedia) will come into play, Software Heritage.

A universal catalog

This huge source code collection and archiving project, the “Source Code Internet Archive”, was launched in November 2016 by Inria (National Research Institute in Digital Sciences and Technologies) after more than two years of research preparations.

With a triple objective, indicated its founder and CEO Roberto Di Cosmo: to constitute a universal catalog of all the source codes of all software, its universal archive to preserve them and a research infrastructure that allows analysis.

To commemorate its first half of the decade, Software Heritage is organizing an event at Unesco headquarters in Paris on Tuesday, November 30. This will be an opportunity to take stock and highlight three themes: open science – Unesco has just approved a recommendation on open science -, the role of the preservation of culture and education in digital skills, and innovation for industry and the administration.

The event, on Tuesday morning, will be broadcast on social media with the hashtag # SWH5YEARS and will be broadcast live on the Unesco website. The organization is also launching, for the first time, a call for donations for #GivingTuesday.

In the origins: the observation of a dangerous fault

In February 2019, in the Libre à vous program (transcript there), its creator explained the concept of Software Heritage, drawing a parallel with the Internet Archive. “As a computer scientist, I always think about the future, we always project ourselves towards the future, it is difficult for us to deal with the idea of ​​losing, of disappearance, of death, of failure, of loss of information. So we don’t think too much, we are always in the dynamic of building new things, but later, if we ask ourselves for a moment, we realize that there is a lot of this knowledge, a lot of software that we have built. , from source codes that we have written that are actually compromised; nobody really cares about preserving them, protecting them, indexing them, making them available. “

“What was missing from the landscape was a platform that would actually archive this software. So it was really something that was missing and, looking a little more, we realized that there is indeed a lot of software, there is a lot of software source code available; That said, we don’t have a real catalog. They are scattered across many of these other platforms, be it development platforms or distribution platforms, we don’t know where to look. So the best approach, in general, is a search engine, ask a friend at the coffee machine: where can I find this or that library, this or that library to make such an application. We finally discovered that there was no file. “

Regarding the risk of the source code disappearing, Roberto Di Cosmo recalled two file platform closures announced in 2015: Gitorious, after its takeover by GitLab, and Google Code, two shocks that, on the contrary, showed the importance of be able to trust. about the long-term archive.

“The third thing that we observe is that, in fact, today, software is not only at the center of all the digital transformation of our society, but free software is at the center of the software that is transforming our society. Almost all companies use free software on a large scale even today, so it has become super important, for example, to have a platform that allows you to systematically analyze the source code of this software to try to identify errors, vulnerabilities, do code. easier analysis, help developers better reuse their code, etc. For that, we need a common platform that we have never been able to build before. “

Microsoft, fast support …

Software Heritage has seen good growth over these first five years: in September 2021, the organization, which now employs about fifteen people, reports that it exceeds 11 billion source files from 160 million software projects, bringing the makes it the largest source code collection ever.

The organization has several sponsors, public bodies (CNRS, University of Paris…) and companies, among them… Microsoft, which was one of the first industrial partners. Roberto Di Cosmo, who published in 1998 “Trap in cyberspace”, then, with journalist Dominique Nora, “The planetary robbery: the hidden face of Microsoft”, a virulent essay against the company then headed by Bill Gates, mentioned it in February 2019:

“We contacted, I will not give the names, but a certain number of players, even large companies that use free software in a massive way, even large industrial players in free software, but, big surprise, in June at the time it was necessary made public, none of these had responded present. Perhaps they consider that developing free software is enough, it is not worth worrying about maintaining it in the long term. And, to my surprise, it was Microsoft who responded “.

“That was funny, because I would never have said 20 years ago that I would have met in Redmond, in Seattle, at Microsoft’s headquarters, with all of Microsoft’s top management to tell them why it was interesting to support a project of this type. But it was still an interesting experience because there I discovered a Microsoft that is not exactly the same one that I had known 20 years ago. There has been a complete change of direction ”.

Open source software infrastructure

Regarding the physical protection of the collections, the founder explained that “the basis of our strategy is:
one, to make sure that all the infrastructure we build ourselves is done entirely in free software so that it is easier for others to replicate it elsewhere; two, have a mirror network at the planetary level in which all the data we collect is distributed and distributed. And here we use in particular a terminology that is a bit particular in our project, we have not formalized it, but we can share it as of today, we use the term copy for a complete copy of all the data that is in the file. but that are under our responsibility.

So, for example, today, Software Heritage has three copies of the file: two that are in the Inria facilities, with us, and one that is on an Azure platform sponsored by Microsoft. “

Remember that since then, Software Heritage has also started storing source code at the North Pole, in partnership with GitHub; we are talking about an archive destined to last for thousands of years.

