TODOs from Barcelona #1809

natefoo · 2020-03-02T10:03:09Z

An issue for collecting things we notice during the 2020 Galaxy Admin Training in Barcelona that need to be fixed

Decide how to handle variables (set custom vs role defined vs we're defining this because we'll re-use it later in vars like job_conf location)
In Ansible tutorial
- Slides: include files/ on dir tree on "roles" slide
- Use "hello, universe" or "hello, galaxy 🚀" instead of "hello, world" 😜
- When speaking of templates, explain that the .j2 suffix is used to indicate that this is a template file in Jinja 2 format. After filling the template with the variable values, we copy the file to its remote destination without the .j2 suffix
- "Other stuff": we should use a list of packages under the package module's name option rather than with_items
- In the same place as above, we should give another example for looping that uses loop instead of with_items
- Next to where we recommend geerlingguy Ansible Galaxy roles, we should also recommend galaxyproject and usegalaxy_eu.
- when: service_conf.changed should be when: service_conf is changed
- YAMLize fenced code blocks in "Notifying Handlers"
In Galaxy Installation with Ansible tutorial
- "This role is found in Ansible Galaxy (no relation - it is Ansible’s ) as galaxyproject.galaxy."
- "The official recommendation is that you should have a variables file such as a group_vars/galaxy.yml for storing all of the Galaxy configuration." - should be. group_vars/galaxyservers.yml
- galaxy_config_style should default to yaml in the role
- Remove Question from point 3 of "Hands-on: Minimal Galaxy Playbook" (same question is in point 4)
- Hands-on: (Optional) Launching uWSGI by hand is duplicated
- galaxy.yml should not be world readable (but to change this, the config needs to be readable by group galaxy)
- Add restart handler directly to the usegalaxy_eu.galaxy_systemd role.
- Not in the tutorial but we tried making admin_users a list, which doesn't work (I thought we ran the value of this through util.listify() but maybe not?)
- We should run a job and have students look in /data
- We should consider having students set cleanup_job: never and looking in /srv/galaxy/jobs (or maybe set it to onsuccess and run a job that would surely fail, e.g. due to missing dependencies)
- Fix for CentOS 7 post-Py3 (Install miniconda and make virtualenv from conda ansible-galaxy#110)
- Does galaxyproject.postgresql properly handle the state of the Ubuntu systemd-instanceized postgresql service? A couple of people running the playbook on their own VMs had issues where PostgreSQL was down due to misconfiguration and the "make sure it's running task" apparently passed ok even though it was down? Yeah pretty sure it doesn't Service name is incorrect on recent Debian-based systems ansible-postgresql#23
In ephemeris slides:
- Remove "Dependency resolvers" slide (covered in Tool Shed slide deck)
- Move "Suites - more repos in one" to Tool Shed slide deck
- Move "Example config entry" slide for integrated_tool_panel.xml after or into "Toolpanel management" slide
Merge Tool Shed slide deck with the bloated one in the dev topic (see also below)
User, Groups, Quotas
- Maybe merge some/all "Production" slides (library_import_dir is here)
- Could definitely use a Data Library tutorial - link to data, etc.
- An example showing how groups/roles/permissions and associations work would be good
Object store:
- Remove database table details if unnecessary?
Do we really need to quote galaxy_config booleans as strings (i.e. "True", "False")??
In cluster:
- Dependency resolvers slides are rehash of earlier stuff
- Slurm should use --ntasks=1 --cpus-per-task=4 rather than --ntasks=4
- Make slurm part of galaxy playbook? (maybe show tags?)
- vars to group_vars
- Stop re-using IDs between sections (aka don't use the same values for runner IDs, destination IDs, job resource IDs, etc.
- Make unselected resources more robust by using section instead of conditional galaxy#9485
- Update map_resources.py: https://gist.github.com/natefoo/bbcfc162fad83cbc31bc98d82dbfd1c8
- Use standalone vars for DTD config and job resource param file paths (as is done with job config file path) and rearrange these copy boxes so they're in the same order as the job config file one (actually - fenced diff block here is probably preferable so you can see that you're adding to existing vars - should do this across tutorials modifying group vars, including interactive tools, CVMFS) see "Decide how to handle..."
- I don't think we ever actually explain what dirs/paths need to be cluster accessible(!!!) I believe the full list (in galaxyproject.galaxy vars) is galaxy_shed_tools_dir, galaxy_tool_dependency_dir, galaxy_file_path, galaxy_job_working_directory, galaxy_server_dir, galaxy_venv_dir. We should probably update the Installing tutorial to put these all on some distinct path (e.g. /data, but rename to /clusterFS or something). And maybe there should be a layout in galaxyproject.galaxy that does this.
CVMFS/Ref data
- Make proper tutorial of this
BioBlend:
- ~~[ ] Move the Jupyter notebook from https://github.com/nsoranzo/bioblend-tutorial/ to a files directory~~ Given that Binder seem to clone the entire GitHub repository, it seems better to keep the notebooks in a separate small repo.
- Write a small tutorial with links to run the notebooks on Binder
Object store
- Many slides are duplicates with Maintaining and others, the remainder are fairly junk, only the last 2 are about object store.
- Use "dot notation" for dictionary access in template vars (a few other tutorials as well)
- Document object store max_percent_full
Pulsar
- We should probably set transport_timeout (a PulsarRESTJobRunner plugin param) so that it is more resilient to connection timeouts. Also document this if it's not in job_conf.xml.sample_advanced
General
- Diff the client directory and only rebuild whenever the client directory changes Rebuild client only when needed ansible-galaxy#107
Monitoring w/ gxadmin
- Needs the gxadmin group vars (they are defined in a different tutorial (Grafana)).
Troubleshooting
- Make a split_logging config var that automatically sets up filename_template logging as described in advanced logging configuration
- Replace job_runner_name with (or add column) handler in gxadmin query job-info
- Add pgcleanup support to galaxyproject.galaxy
- Add tmpwatch support to galaxyproject.galaxy
- tmpwatch on other caches (object store cache for instance)
- Add backup managed configs support to galaxyproject.galaxy
TIaaS
- Create some short intro slides (@shiltemann will do)
Jenkins
- proxy_pass use variable

Not admin-related:

In the dev topic, create a new slide deck for publishing Galaxy tools on the Tool Shed moving the corresponding slides from tool-integration and toolshed decks

The text was updated successfully, but these errors were encountered:

- Move explanation of suites from TM to TS - Remove "Dependency resolvers" from TM (already in TS) - Remove conda recipe example from TS (off topic) - Remove duplicated "Key points" slide from dev TS xref. galaxyproject#1809

hexylena · 2020-03-04T10:04:48Z

Writing in my own comment, lest any updates conflict or be ovewritten

Connect to compute
- validate job xml etc against the XML DTDs when possible
Pulsar
- Switch to MQ from http for py3 issues. also more 'real'. Don't need to secure the MQ since that's painful, but this would be enough.
- Vault for secrets.
Other
- Rename https://github.com/galaxyproject/dagobah-training/ to https://github.com/galaxyproject/gat/
- Rewrite the job_conf to use template from the start. Maybe everything should just go in templates? In case?
gxadmin part 3
- influxdb-client error?
- move monitoring to group_vars/monitoring.yml (Switch to monitoring group where it should apply #1827)
- fix this Switch to monitoring group where it should apply #1827 (comment)
- missing begin/endraw in monitoring in one part.

ondrejme · 2020-03-05T13:48:02Z

Hands-on: Enabling Interactive Tools in Galaxy
Step3:
I would suggest changing order if "id" and "destination" in tag, as it is with other tool-destinations mappings

Step4:
interactivetools_enable: "True"
remove quotation marks and make the capital letter small

lldelisle · 2020-03-05T13:50:53Z

in https://training.galaxyproject.org/training-material/topics/admin/tutorials/ansible-galaxy/tutorial.html
If you want not to use ssl, I guess you also need to change the templates/nginx/galaxy.j2 because:

    # Listen on port 443
    listen        *:443 ssl default_server;

Will not work, right?

natefoo · 2020-03-05T14:36:46Z

@lldelisle If you changed this to listen *:80 default_server;, you should also move this template from nginx_ssl_servers to nginx_servers, remove redirect-ssl from nginx_servers, and comment nginx_ssl_role. You would also need to remove /etc/nginx/sites-enabled/redirect-ssl. You could do this with a pre_task like:

- name: Remove redirect-ssl config
  file:
    path: /etc/nginx/sites-enabled/redirect-ssl
    state: absent

lldelisle · 2020-03-05T16:05:46Z

Many thanks... So the only think which is missing in the training material is:
change

    # Listen on port 443
    listen        *:443 ssl default_server;

to

    # Listen on port 80
    listen        *:80 default_server;

If you ran the playbook once with redirect-ssl before deciding to do not use SSL, remove the file /etc/nginx/sites-enabled/redirect-ssl.

lldelisle · 2020-03-05T17:14:28Z

In https://training.galaxyproject.org/training-material/topics/admin/tutorials/connect-to-compute-cluster/tutorial.html:
You wrote: Add a post_task to your playbook to install slurm-drmaa1 (Debian/Ubuntu) or slurm-drmaa (RedHat/CentOS), and additionally include the galaxyproject.repos role
Then maybe you could use:

  post_tasks:
    - name: Install slurm-drmaa1 if Debian
      package:
        name: slurm-drmaa1
      when: ansible_os_family == "Debian"
    - name: Install slurm-drmaa if RedHat
      package:
        name: slurm-drmaa
      when: ansible_os_family == "RedHat"

(If I undertood well...)

lldelisle · 2020-03-06T12:29:18Z

To myself: ansible_python.version.major

hexylena · 2020-03-12T12:48:18Z

combination of statements and opinions from @natefoo @Slugger70 @mvdbeek @nsoranzo @hexylena and @shiltemann, synthesized into one summary/todo list.

Barcelona

This training was fantastic! And incredibly strange, things worked! Like flawlessly nearly. We got through 5 days of content in 3. We had to come up with an extra 2 days.

A notable difference this time was how many students tried to run the playbooks immediately on their own infrastructure, either from the start on their own VMs, or after class on their own infra. Despite asking everyone to run it on the VM, we also had a couple of people brave enough to run from their own laptop, mostly without issue.

All around great set of participants! But it led us to focus on areas we need to improve the materials

Seeing the Effects

From @natefoo:

an idea I had: two column design on the tutorials where one column is the things you do in ansible and the other column is the effects it has on the system

this latest training went well but at times it felt very black-boxish, "just run these things and voila!"

For something like the ansible tutorial we could show a

$ cat /tmp/test.txt
some contents

In something like the galaxy tutorial we'd show all the changes to the system that each step makes. I'd say something like the latest commit on the release_XX.YY branch has been cloned to /srv/galaxy/server

In order to reduce how much it needs to be updated, we will just use this in the first two trainings where we need to show this effect (ansible, ansible-galaxy).

The students can then see the differences the ansible is making and gain the understanding to help enable them to troubleshoot.. As things never always "just work", especially when running on varied or outsourced hardware, with the large viariety of quality of tools etc..

"Real exercises"

We noted that a few students had issues with how ansible really works, variables being set in different places, which changes have which effects. So we're considering adding "real" exercises or hide a bit more the answers for some of the ones we already have.

It's a tough balance to strike. For most of the questions & answers in ansible-galaxy, they're awful, they ask "how does your final config look" and everyone just copies that. Maybe we should rewrite them as "Here is the config." and ask better questions??? "what does this do?" "what effect will that have?"

We should show the students Ansible Best Practices at some point? Before the training? Or after the 1st day? https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html

And we should consider developing "Ansible - advanced" or an ansible "exam" (CTF?) for the students, saying "ok, now that you know ansible, accomplish these tasks"

I also think that sometimes "just re-run the playbook" isn't enough.. Figuring out why something has changed can sometimes be more important for the big picture than how to do it. (If that makes sense.)

Continuum

I think there's a continuum, at one end is "galaxy of a few years ago where people needed to be programmers/tool devs/admins together, and we needed to teach everything in detail so they could debug" and the other end is "galaxy (of the current/ future) where things mostly just work, and they can just deploy it and not care too much since the documentation / tutorials cover all of the main points, and they don't resort to low level debugging"

If we're really moving to the "just works" end, maybe we remove that detail from the curriculum because it doesn't benefit students vs a higher level picture.

I think if they're gonna go back and not use ansible it's good to show "here's what this production deployment looks like" so they can adapt it for their own purposes

We sympathise with "ought to get an in-depth understanding", but:

some things aren't learned without real life experiences
it's difficult to synthesise useful lessons from our myriad experiences. see the troubleshooting slides, we can't summarise in some bullet points, admin is a huge topic which requires understanding across a large number of fields (linux, networking, kernel, python, c code, etc.) and that isn't something taught in a day or even a week
the people there want to solve their problem, and "their problem" seems to mostly be "get galaxy running on weird hpc no.2345"

It's two sides of a coin... people coming to a week long training probably ought to come away with a pretty low level understanding - but we've also found that it's really difficult to teach that low level understanding, especially to folks who mostly aren't sysadmins.

Which leads us to the next question:

What is "A Galaxy Admin"

What should students come away from GAT knowing how to do?

I hope when their Galaxy breaks they know what to do, or when they need to set up a fancier Galaxy server (like ours) they know where to start.
If they've got Pulsar, that's a lot of it
they should be able to resolve systems crashing (nginx, galaxy; "check the logs") and know where to look and which things could be at fault.
If a tool is crashing they should know how to handle this and where to look (dependencies, inputs, check stderr, etc.)
Should be able to set up a cluster (+find docs for other clusters)
adding storage
setting up monitoring
Setting up data and sharing it

everything else is less important?

Splitting

We should include more on the splitting of roles amongst machines, and write them in a way they can be used as-is. E.g. transitioning from ident auth to network auth is complex (see next aside). A number of participants tried deploying the playbook on their own systems toward the end of the week and some struggled with getting the proper DB configuration.

So db on separate server as an example and how to setup the ansible to do things like that. And talk about production setups for a large user base in detail. The benefits of automation for larger setups and some examples of tool maintenance etc.

There are now I think two different places in the tutorials where we say "if we were really doing best practices we'd create a new group and put vars in a different group vars file," maybe we should just do that,

I'd see the following splitting for the whole week:

db
galaxy (+proxy +slurm submit +tiaas)
compute-central manager
compute exec
pulsar
monitoring (influx/grafana)

In ansible-galaxy, only one split, db + galaxy that sounds manageable. And it is a good place to introduce this concept of "here is where you can divide your infrastructure"

DB Auth

let's bind to 127, and use md5, and make everyone use passwords. I think that would be a positive change over ident magic. (I mean, I love ident, but, it's difficult to switch / not obvious for students)

Conclusion

Setting splitting of input/output
Split playbooks better
Fix DB auth
Ansible advanced exercise
Libraries exercise

hexylena · 2020-05-28T09:18:39Z

WIP implementation of the side-by-side discussed during admin debriefing

hexylena · 2020-05-28T09:19:57Z

@annefou this might be interesting for you, too! Do you have any feedback on this? Authors have the choice of

side-by-side (which automatically becomes a single column when screen becomes too narrow)
or always vertical (second set of in/out)

natefoo · 2022-12-06T19:30:00Z

CVMFS/ref data

Make proper tutorial of this

#3778

hexylena · 2023-03-29T08:09:41Z

In general I think enough of this is done to finally close it out.