Open-Source NLP Research Projects

I'm a Natural Language Processing (NLP) Researcher passionate about increasing the representation of the linguistic and cultural diversity of the world in language models. Do you want to collaborate?

Last update: December 2024 | For up-to-date information check my Hugging Face profile!

SomosNLP

My 💛 project
Did you know that we are 600 million Spanish-speaking individuals around the world? SomosNLP.org is an international community aiming to represent in AI the linguistic and cultural diversity of LATAM, the Caribbean and Spain by creating open-source resources. Join us!

Current Projects


Open Leaderboard for the languages of Spain and LATAM

LLM Evaluation
NLP in Spanish
Open-source leaderboard to evaluate LLM memorization, reasoning and linguistic capabilities in the languages of Spain, LATAM and the Caribbean. Developed as part of the #Somos600M Project thanks to the donation of 60 high-quality datasets by 10 research groups. La Leaderboard is live!

The INCLUDE Multicultural Benchmark

Cultural Bias
Multilingual NLP
The INCLUDE project aims to create a multicultural and multilingual benchmark. As part of my Master's thesis, I'm leading the effort to extend the v1 to the languages of LATAM, the Caribbean and Spain. We're looking for exams in EVERY language and welcome active participation. Share an exam in your language!

FineWeb-2 Annotation Campaign

Annotation
NLP
Hugging Face
FineWeb 2 is an annotation campaign by Hugging Face aimed at creating a high-quality multilingual dataset. I'm the Language Lead for Spanish and responsible for the other languages from LATAM, the Caribbean and Spain. The annotation consists of evaluating the educative level of texts, it's really easy and makes a huge impact. Join the effort!

Validation of machine-translated evaluation datasets

Translation
Bias
NLP in Spanish
Community effort to validate the machine-translated Spanish versions of 3 widely-used evaluation datasets (MMLU, RAC-C, and HellaSwag) and the prompt dataset from the Data Is Better Together (DIBT) initiative. Efforts co-organized by SomosNLP, Hugging Face & Argilla. Join us!

Dataset collection campaign

Data
NLP in Spanish
At SomosNLP we are collecting datasets in the languages spoken in LATAM, the Caribbean and Spain. Collaborate and help us collect diverse data!

Spanish NLP Initiatives

NLP in Spanish
Discover the initiatives driving NLP advancements in Spanish and other low-resource languages spoken in LatAm and Spain.

Previous Projects


Hackathon SomosNLP 2024: #Somos600M

Instruction-tuned LLMs
NLP in Spanish
Third edition of the largest open-source hackathon of NLP in Spanish. This year's edition counted with +600 participants and 12 amazing speakers.
Check the recorded talks and keynotes!

Transparency Self-Assessment

Responsible AI
This tool allows you to self-assess the transparency of your model development based on the Foundation Model Transparency Index (FMTI) published by the Center for Research on Foundation Models.

Hackathon SomosNLP 2023: Los LLMs hablan español

LLMs
NLP in Spanish
Second edition of the largest open-source hackathon of NLP in Spanish. This year's edition counted with +500 participants, 17 speakers, and 7 mentors.
Check the awarded projects and the recorded talks and keynotes!

Somos Mujeres NLP

Women in AI
NLP
Organized two initiatives to promote both the work and research of women in NLP and also the projects that apply NLP to fight sexism.

NLP Course by Hugging Face

NLP
Education
Contributing to the translation of the NLP Course by Hugging Face to Spanish.

BigCode Project: LLMs for Code

NLP
Research
Contributing to BigCode. Project in progress.

EleutherAI: Polyglot Romance

NLP
Research
BERTIN Project
Contributing to EleutherAI's research project "Polyglot Romance". Project in progress.

Hackathon SomosNLP 2022: NLP en Español

NLP in Spanish
With more than 500 participants from 39 countries, it is the largest open-source hackathon of NLP in Spanish. The recorded talks and workshops have already more than 5k visualizations! Organized by SomosNLP and sponsored by Hugging Face, Platzi and Paperspace. Check the awarded projects!

BigScience Research Workshop

NLP
Hugging Face
Research
A one-year long international research workshop on large multilingual models and datasets. We created, among other cool things, ROOTS: A 1.6TB Composite Multilingual Dataset that was then used to train BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.

BERTIN Project: Perplexity Sampling

NLP
Hugging Face
Research
BERTIN is a series of RoBERTa-based models in Spanish trained using a novel sampling technique that we call "perplexity sampling". More detailed info can be found in the model card and the paper BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling.

Course: NLP de 0 a 100 con Hugging Face

NLP
Education
The first NLP course from zero to hero in Spanish. It's open-source and was organized by SomosNLP with the support of Spain AI. I taught the classes on sequential models and the Transformer architecture.

Pre-training GPT-2, T5 & Wav2Vec2 models in Spanish

NLP
Hugging Face
HF Hackathon
A series of Spanish language models trained with Flax/Jax and using TPUs sponsored by Google during the Flax/Jax Community Week organized by Hugging Face in June 2021. Here are the model cards: GPT-2 model , T5 model and Wav2Vec2 model.

WaiACCELERATE Program

Entrepreneurship
Women in AI & Robotics
A program where we provide women entrepreneurs with the tools, knowledge, mentoring and network to successfully realize their startup/business idea in the AI sector.

Making Spanish NLP datasets available in the HF Hub

NLP
Hugging Face
HF Hackathon
Addition of 3 datasets in Spanish to the huggingface/datasets library during the open-sprint organized by Hugging Face in Dec 2020. The datasets are HEAD-QA (a multi-choice HEAlthcare Dataset), the dataset of the eHealth-KD Challenge at IberLEF 2020, and the Spanish Billion Words Corpus.

Quality Analysis of ML Models

Python Package
AI Performance
AI Robustness
PyPI package to perform quality analyses on ML models. It focuses on the three quality pillars: functionality, robustness and explainability.

Chatbot COVID-19

Conversational AI
Backend
Frontend
DevOps
Math Thesis
Chatbot that understands and answers questions about the COVID-19: symptoms, prevention, regulation, the situation in Spain. Don't hesitate to chat with AURORA!
The chatbot understands correctly on the 1st attempt 92% of the requests and helped 1500+ people during the first months of the pandemic. Collaboration with Accenture’s Gijón office.

Neural Network for the study of the Higgs Boson with data from the LHC (CERN)

Machine Learning
Physics Thesis
Implementation of a Neural Network that predicts - with a correlation coefficient of 0.778 - characteristics of the Higgs Boson produced in the particle collider. Collaboration with the university's high energy particle research team.