Focus on “Big Science”, the collaborative project for the development of an efficient open source language model

With the aim of developing an efficient open source multilingual language model in one year, several laboratories, large groups and start-ups have come together. They will use the French supercomputer Jean Zay to carry out the “Big Science” project. The main objective is to design a giant neural network capable of “speaking” eight languages ​​including French, English and several African languages. The kick-off workshop took place at the end of April and we offer you a focus on this very interesting participatory project.

A project involving around 100 institutions

The “Summer of Language Models 21” or “Big Science” is a one-year research project focused on the language models used and studied in the field of natural language processing (NLP). More than 250 researchers, from around a hundred institutions such as CNRS, Inria, Airbus, Ubisoft, Facebook, Systran, Ubisoft, Airbus, OVH, as well as several French and foreign universities, contribute to it.

The project was born from discussions initiated in early 2021 between Thomas Wolf (Hugging Face), Stéphane Requena and Pierre-François Lavallee (respectively from GENCI and IDRIS). Very quickly, several experts from the HuggingFace scientific team (notably composed of Victor Sanh and Yacine Jernite) as well as members of the French academic and industrial research community in IA and NLP joined the discussions to deepen the project.

Big Science is thus defined as a one-year research workshop where a set of collaborative tasks will be carried out around the creation of a large set of data from a wide variety of languages ​​and a multilingual language model. effective.

The use of the French supercomputer Jean Zay in a collaborative project

GENCI and IDRIS wished to take part in the project by proposing the use of the Jean Zay supercomputer, installed in Orsay. The two institutions made available 5 million computing hours (around 208 days), which corresponds to a quarter of the machine’s capacity.

In parallel, a workshop for the public will be held online on May 21 and 22, with the completion of collaborative tasks aimed at creating, sharing, and evaluating a huge multilingual database in order to start designing the model. Discussions will be held in order to identify the challenges of the major linguistic models and to better understand how they work.

If successful, this workshop may be repeated and updated depending on the progress of the project, which is intended to be participatory.

How the “Big Science” project works

This research program will consist of:

  • A steering committee which will give scientific or general advice.
  • An organizing committee, divided into several working groups which will be responsible for determining and carrying out collaborative tasks, as well as organizing workshops and other events allowing the creation of the NLP tool.

Several roles will be given within the framework of this project, three are reserved for researchers and experts, the last one involves the participation of the public:

  • A role of scientific advisor and functional organization: a task that requires a slight commitment, namely reading a newsletter every two weeks and offering comments within the framework of a working group.
  • A role of active member of one of the working groups of the project: design and implementation of collaborative tasks, organization of live events.
  • A role of chair / co-chair of a working group: which requires a much greater commitment, coordinates the efforts and organizes the decision-making process of the working group.
  • A role of participant in the workshop or in a public event: participation in the realization of a collective task in a guided manner by following the directives set up by the working groups.

The solution developed within the framework of this project wishes to be more successful and less “biased” than those developed by OpenAI and Google. The GPT-3 OpenAI delivers 4.5 billion words per day for approximately 300 clients, contains 570 GB of text (745 GB for Switch-C, Google’s tool) and 175 billion parameters (10 times more at Google).

Back to top button

Adblock Detected

Please consider supporting us by disabling your ad blocker