glosario
is an open source glossary of terms used in data science
that is available online and also as a library in both R and Python.
By adding glossary keys to a lesson's metadata,
authors can indicate what the lesson teaches,
what learners ought to know before they start,
and where they can go to find that knowledge.
Authors can also use the library's functions
to insert consistent hyperlinks for terms and definitions in their lessons
in any of several (human) languages.
To advance data science knowledge and accessibility for our diverse community, we developed Glosario. You do not need to know any programming language to contribute to Glosario: anyone with a basic familiarity with the GitHub web interface can get involved! We have prepared a detailed and accessible guide for contributing, which has been translated into several languages. Contributions are welcome in any language, not only those represented in that document. If you need help with your contribution, feel free to come to ask questions on the #glosario Slack channel (if you are not a member of The Carpentries Slack you can join by filling this form).
R Markdown and Jupyter Notebooks allow authors to place structured metadata in files. We propose the following metadata (written as YAML):
glossary:
sources:
- http://some_glossary.org/something/
language: fr
requires:
- aggregation_function
- call_stack
defines:
- closure
- name_collision
- The
source
key is required.- It must introduce a list containing at least one URL.
- Those URLs must resolve to glossaries as described in the next section.
- Those glossaries are searched in order from first to last to find definitions.
- The
language
key is required and must be a single ISO 639 language code (e.g.,fr
for French). - The keys
requires
anddefines
are optional.- Either may introduce an empty list.
- The values under these keys are keys into a shared glossary (discussed in the next section).
- We expect the terms identified under
requires
to be used without being defined in this lesson (i.e., the lesson author assumes users already know them). - All of the terms identified under
defines
must be hyperlinked in the lesson.- The target of the hyperlink for the term's definition must be
GLOSSARY_SITE#glossary_key
, whereGLOSSARY_SITE
is one of the sites listed under thesources
key andglossary_key
is an exact match for one of thedefines
keys.
- The target of the hyperlink for the term's definition must be
We will provide simple tools so that all of the terms listed in a lesson's metadata are linked correctly in its body. We will also provide shortcuts to make it easy to create correctly-formatted links so that authors can write things like:
The computer uses a `r link('call stack', 'call_stack')` to keep track of function calls.
Any site where glossary URLs resolve can be used as a glossary. As a working model, this project implements a glossary of terms used in data science and data engineering.
- The master copy of the glossary lives in
glossary.yml
. Its format is described below. - This file is turned into a single-page GitHub Pages site using Jekyll.
- It is also turned into a Python package called
glosario
and an R package with the same name.
A glossary entry is structured like this:
- slug: cran
ref:
- base_r
- tidyverse
en:
term: "Comprehensive R Archive Network"
acronym: "CRAN"
def: >
A public repository of R [packages](#package).
- The value associated with the
slug
key identifies the entry.- It must be unique within the glossary.
- It must be in lower case and use only letters, digits, and the underscore (to be compatible with Jekyll's automatic slug creation).
- It becomes the fragment identifier in the online version of the glossary.
- The entry may have a
ref
key. If it is present, its value must be a list of identifiers of related terms in this glossary. - Every other top-level key must be an ISO 639 language code such as
en
orfr
.- Every entry must have at least one such language section.
- Within each language section for each term:
- The value of
term
is the term being defined. This key must be present. - The key
acronym
is optional. If present, its value is the acronym for this term. - The value of
def
is the definition. This key must be present, and the value may contain local links to other terms in this glossary (i.e., links starting with#
) and/or links to outside sources.
- The value of
-
Should we provide one function for interactive definition lookup that searches keys and terms, a separate function for each, or some kind of keyword arguments to control the scope of search?
-
Should we integrate definition lookup with existing help systems? For example, should
define('something')
in RStudio put the definition in the help pane (and if so, should it hyperlink to terms that the definition depends on)?
-
Linking to a definition.
- Amari writes a lesson in R Markdown that introduces some new terms.
- She has defined the language to be Spanish using the
glossary/language
key in the YAML header, but has not changed any other settings. - She adds an inline code block
`r gdef('linear-model', 'Linear models')`
to her lesson. - When she knits her document,
the inline code block produces the HTML
<a href="http://carpentries.org/glossary/es/#linear-model" class="glossary-definition">Linear Models</a>
-
Checking a lesson.
- Beatriz has made some changes to a lesson she inherited from Amari, and wants to check that it is still consistent.
- She runs a command-line script that:
- Reads the R Markdown file.
- Extracts the terms under the
glossary/defines
key. - Searches the document's body for calls to
gdef(...)
. - Checks that every term listed in
glossary/defines
is referenced in the document body, and that every term referenced in the document body is mentioned inglossary/defines
.
-
Finding lessons.
- Amari writes a lesson in R Markdown.
She adds the
glossary
key to its YAML metadata and indicates that the lesson requires the termcorrelation
and defines the termregression
. - Beatriz is writing a lesson on linear models.
She adds YAML metadata indicating that
the lesson requires the term
regression
. - To find prerequisite lessons she can recommend to her students,
Beatriz runs a command-line script that:
- Uses
rmarkdown::yaml_front_matter(filename)
to read metadata from all of the lessons she has archived. - Lists all of the lessons that state they define the term
regression
.
- Uses
- Amari writes a lesson in R Markdown.
She adds the
-
Summarizing a lesson.
- Amari has written a lesson in R Markdown that includes YAML metadata
stating that it defines
correlation
andcausation
. - She adds a code chunk to the end of her lesson that includes a call to
glosario::summarize_terms()
. - When she knits the document to HTML,
this code chunk inserts a definition list
dl
at that point. Its entries are the definitions of all of the terms listed under theglossary/defines
key in the page's YAML header in alphabetical order by term according to the rules forglossary/language
.
- Amari has written a lesson in R Markdown that includes YAML metadata
stating that it defines
-
Why not just link to Wikipedia? We expect that many glossary definitions will do so, However, Wikipedia articles provide explanations, not definitions.
-
YAML is hard for people to edit—why not use something else for the glossary file? Because other formats are just as hard to edit (e.g., JSON) or make one-to-many relationships hard to express (e.g., CSV).
-
Why use Jekyll for the online version? It is the default for GitHub Pages.
SADiLaR is one of the collaborators in the finalisation and expansion of the Glosario Project to African Languages. SADiLaR is a research infrastructure established by the Department of Science and Innovation of the South African government as part of the South African Research Infrastructure Roadmap (SARIR).
We are pleased to share that the Andrew W. Mellon Foundation approved a grant for use over 12 months (November 2023 through October 2024) to support an upgrade to Glosario.
- Parrot logo by restocktheshelves.