Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor (library installation fixes and new static site generator) #247

Merged
merged 26 commits into from
Jun 22, 2023

Conversation

myedibleenso
Copy link
Collaborator

@myedibleenso myedibleenso commented Jun 20, 2023

This PR is a larger than originally intended. It started as a switch to a different static site generator that led to a series of other changes to correct broken imports and streamline installation.

Summary of changes

  • [documentation]: Move from Haskell-based doc generation to mkdocs (Python-based static site generator)
    • new site will include API docs for python and rust modules (scala components are not included for the time being)
  • [documentation]: Publish code coverage reports for Python module (Scala and Rust TBD)
  • [misc]: Refactor module and corresponding pyproject.toml to account for missing project deps and broken references
  • [misc] fixed link in skema_py/README.md
  • [misc] Added .unused suffix to program_analysis files that are unused and soon to be removed
  • [integration / @adarshp @jastier ]: Update Dockerfile to include all skema python module components
    • This removes the need for a separate dockerfile for img2mml; for the time being, we still publish a lumai/askem-skema-img2mml image (which now mirrors skema-py)
  • [img2mml]: img2mml model (cnn_xfmer_OMML-90K_best_model_RPimage.pt from https://kraken.sista.arizona.edu/skema/img2mml/models/ is now retrieved at image build time and included in the published docker image (NOTE: we'll probably want to revisit this in the future)
  • [text reading / @enoriega ]: text_reading python utilities are included as another sub-module (at the cost of some additional dependencies noted in pyproject.toml)
  • docker-compose files are updated with new invocations reflecting new package structure (ex. uvicorn skema.skema_py.server:app --host 0.0.0.0 --port 8000)
  • [code2fn / @vincentraymond-ua ]: corrected a bug related to naming of tree-sitter-fortran grammar in build_tree_sitter_fortran.py (build, my-languages.so -> build/ts-fortran.so)

Notes

  • ISA's inclusion means the skema-py Docker image will now include a Rust installation. We may want to consider combining this with the skema-rs components to generate a single image. This might simplify calling things like the code comment extractor from Python for a unified pipeline.

myedibleenso and others added 5 commits June 19, 2023 13:58
- update pyproject.toml with subproject dependencies
- fix broken module imports
- switch to Python-based mkdocs in place of Haskell ss
- skema/text_reading/text_reading -> skema/text_reading
@myedibleenso myedibleenso added this to the [DARPA] Milestone 9 milestone Jun 20, 2023
@myedibleenso
Copy link
Collaborator Author

myedibleenso commented Jun 20, 2023

@adarshp @Free-Quarks @cl4yton @vincentraymond-ua , example refactoring in #233 quietly broke Rust tests by removing gromet JSON from the repository. I suspect this has to do with the triggers for the test_rust_components Drone CI step:

steps:
- name: test_rust_components
  image: rust:1.67
  commands:
  - apt update && apt install -y cmake openssl libclang-13-dev
  - cd skema/skema-rs
  - cargo test --verbose --all
  when:
    paths:
    - .drone.yml
    - skema/skema-rs/**

This should probably include - data/gromet/**:

steps:
- name: test_rust_components
 image: rust:1.67
 commands:
 - apt update && apt install -y cmake openssl libclang-13-dev
 - cd skema/skema-rs
 - cargo test --verbose --all
 when:
   paths:
   - .drone.yml
   - skema/skema-rs/**
   - data/gromet/**

The motivation for removing this was to minimize committed gromet JSON, as updates to gromet will cause large diffs and bloat the repo's history.

Perhaps we can consider generating the gromet JSON at test time?

Maybe we can discuss this after the team meeting on 6/20.

@adarshp
Copy link
Contributor

adarshp commented Jun 20, 2023

Thanks @myedibleenso for taking on this large refactoring! A few comments/questions:

  • I noticed that we now have separate .dockerignore files for the skema-py and skema-rs Dockerfiles. Does Docker automatically pick up the right one based on the filename while building the image?
  • After we migrate to using tree-sitter for Python files, we should relax the requires-python==3.8.* constraint in pyproject.toml since we won't be relying on the specific form of the Python 3.8 AST anymore.
  • Does graspologic really need Rust? I don't see it mentioned in the documentation. There does seem to be a companion package, graspologic-native that does depend on Rust (at least to build from source), but I don't think we are using graspologic-native in the ISA pipeline. If it does require the Rust toolchain, then yes, I agree, makes sense to combine the Docker images.
  • Re: [code2fn] Creates set of Fortran examples and bulk Gromet generation script #233 breaking the Rust tests - you are right, we should add the - data/gromet/** directory to the paths in the relevant section of .drone.yml.

@adarshp
Copy link
Contributor

adarshp commented Jun 20, 2023

Re: generating GroMET JSON at test time, I can see both pros and cons. I won't be able to attend the 6/20 meeting due to a prior conflict, but I'll go with whatever route is chosen.

Pros:

  • We don't need to have large GroMET diffs
  • This may help reduce regression bugs.

Cons:

  • Now folks developing the Rust components will also need to either (1) deal with installing the Python components (which can be a pain especially if we are restricting to Python 3.8.*) or (2) run the Dockerized PA REST service.
  • Even if installing the Python dependencies etc. is successful, there is now a considerable amount of overhead added to testing the Rust components - prior to running any tests, they will need to have the PA REST API running, and each testing call will involve running the PA pipeline. Since @Free-Quarks is the one developing Rust components that ingest GroMET, we should probably check with him about whether either of these options is preferable.

name="skema"
authors=[
name = "skema"
authors = [
{name="Enrique Noriega", email="[email protected]"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myedibleenso We should add Keith here too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kwalcock , if you'd like to be added to the list of authors in this PR, please post your preferred email here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for thinking of me. If it isn't specifically just for the Python code, which I haven't worked on, then it would be OK. LUM would probably like to have it be [email protected]. If it is, then credit to where it is due, of course.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kwalcock is being too humble. Of course he should be listed

pyproject.toml Outdated
"skema.program_analysis" = "skema/program_analysis"
"skema.skema_py" = "skema/skema_py"
# re-map skema/text_reading/python to skema.text_reading
"skema.text_reading" = "skema/text_reading/python"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myedibleenso : Let's keep in mind that the python code that lives under mention_linking will be deprecated in favor of model_linker introduced in #246

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enoriega , if you want to open a PR for this branch that replaces that code, feel free to pull these changes into another branch and do so. Otherwise we can take care of it in a subsequent PR.

@myedibleenso
Copy link
Collaborator Author

myedibleenso commented Jun 20, 2023

@adarshp , thanks for your feedback.

multiple .dockerignore files

  • I noticed that we now have separate .dockerignore files for the skema-py and skema-rs Dockerfiles. Does Docker automatically pick up the right one based on the filename while building the image?

Ah, thanks for reminding me to include a section for docker builds in the documentation. The weird naming convention relates to a somewhat obscure docker feature that has been around for a bit now. I can't recall where I first learned about it, but here is a comment that may have first announced it: moby/moby#12886 (comment)

This is a feature supported by Buildkit (the default builder for Docker Desktop and Docker Engine since v23). By specifying a Dockerfile with -f during a build, Docker will look for a corresponding dockerignore file named <dockerfile-name>.dockerignore. So something like f "Dockerfile.skema-py" will prompt docker to look for an ignore file called Dockerfile.skema-py.dockerignore first before defaulting to .dockerignore.

Most newer Docker installs support this out of the box, but here is an example for explicitly invoking use of buildkit (which includes this feature):

DOCKER_BUILDKIT=1 docker build -f "Dockerfile.skema-py" -t "lumai/askem-skema-py:test" .

You can verify the new dockerignore is being applied here by building locally using this branch with the following command:

DOCKER_BUILDKIT=1 docker build --progress=plain --no-cache -f "Dockerfile.skema-py" -t "lumai/askem-skema-py:test" .

(I left a RUN tree skema line in Dockerfile.skema-py to confirm while testing)

More info on enabling buildkit: https://docs.docker.com/build/buildkit/#getting-started

Python version

  • After we migrate to using tree-sitter for Python files, we should relax the requires-python==3.8.* constraint in pyproject.toml since we won't be relying on the specific form of the Python 3.8 AST anymore.

Looking forward to getting out of that corner!

Rust dependency and graspologic

  • Does graspologic really need Rust? I don't see it mentioned in the documentation. There does seem to be a companion package, graspologic-native that does depend on Rust (at least to build from source), but I don't think we are using graspologic-native in the ISA pipeline. If it does require the Rust toolchain, then yes, I agree, makes sense to combine the Docker images.

I took another look at this and realized the source of the problem: my local docker image builds on my Mac were failing, because no wheel appears to be available for ARM64/M1. When the wheel is absent, graspologic-native gets pulled in along with its dependency on Rust.

Technically, we don't need to publish images for the M1. If we wanted to do so, though, I think including Rust will make things much easier. This also has some other benefits (unified image, adding support to call Rust from Python, etc).

Drone CI triggers for Rust tests

Ok, I can include that change here. It won't address the failing tests, but it should help us avoid the same problem in the future.

@adarshp
Copy link
Contributor

adarshp commented Jun 20, 2023

@myedibleenso - thanks for clarifying about the multiple .dockerignore files and the graspologic issue!

@Free-Quarks
Copy link
Collaborator

Hi @myedibleenso and @adarshp

As for the gromet JSON's in the repo, I would not be opposed to generating them during test time, however my rust components that are being tested on the JSON's often lag behind in version support from the code2fn module, so if we always generate with the most recent version of code2fn, it would likely cause test failures until I update my end for the most recent version, which isn't a huge deal and might be good for people to know when there is version mismatch. but this is why there are so many JSON's currently and the different versions. It is a bit of a pain to update the JSON too since I basically manually curate the JSON's on the repo from their sourcing being the google drive: https://drive.google.com/drive/folders/1RnNVTifUESFXXkNh3tc8J6kot0xink07?usp=drive_link , so I am completely open to suggestions on how to manage the JSON's and their tests.

@myedibleenso
Copy link
Collaborator Author

Thanks, @Free-Quarks !

my rust components that are being tested on the JSON's often lag behind in version support from the code2fn module, so if we always generate with the most recent version of code2fn, it would likely cause test failures until I update my end for the most recent version, which isn't a huge deal and might be good for people to know when there is version mismatch

I actually think this would be a good thing, as it would immediately alert us of drift. I'll discuss generating gromet JSON at test time with the PA team and circle back with you.

@myedibleenso myedibleenso merged commit fe5f565 into main Jun 22, 2023
4 checks passed
@myedibleenso myedibleenso deleted the ghp/mkdocs-migration branch June 22, 2023 06:31
github-actions bot added a commit that referenced this pull request Jun 22, 2023
…247)

Resolves #251

This PR is a larger than originally intended. It started as a switch to
a different static site generator that led to a series of other changes
to correct broken imports and streamline installation.

## Summary of changes
- [**documentation**]: Move from Haskell-based doc generation to mkdocs
(Python-based static site generator)
- new site will include API docs for python and rust modules (scala
components are not included for the time being)
- [**documentation**]: Publish code coverage reports for Python module
(Scala and Rust TBD)
- [**misc**]: Refactor module and corresponding `pyproject.toml` to
account for missing project deps and broken references
- [**misc**] fixed link in `skema_py/README.md`
- [**misc**] Added `.unused` suffix to `program_analysis` files that are
unused and soon to be removed
- [**integration** / @adarshp @jastier ]: Update Dockerfile to include
all skema python module components
- This removes the need for a separate dockerfile for `img2mml`; for the
time being, we still publish a `lumai/askem-skema-img2mml` image (which
now mirrors `skema-py`)
- [**img2mml**]: `img2mml` model
(`cnn_xfmer_OMML-90K_best_model_RPimage.pt` from
`https://kraken.sista.arizona.edu/skema/img2mml/models/` is now
retrieved at image build time and included in the published docker image
(NOTE: we'll probably want to revisit this in the future)
- [**text reading** / @enoriega ]: `text_reading` python utilities are
included as another sub-module (at the cost of some additional
dependencies noted in `pyproject.toml`)
- docker-compose files are updated with new invocations reflecting new
package structure (ex. `uvicorn skema.skema_py.server:app --host 0.0.0.0
--port 8000`)
- [**code2fn** / @vincentraymond-ua ]: corrected a bug related to naming
of `tree-sitter-fortran` grammar in
[`build_tree_sitter_fortran.py`](https://github.com/ml4ai/skema/blob/7c3c8959df3a909ce758313f035efd91c93938af/skema/program_analysis/TS2CAST/build_tree_sitter_fortran.py#L9)
(`build, my-languages.so` -> `build/ts-fortran.so`)

## Notes
- ISA's inclusion means the `skema-py` Docker image will now include a
Rust installation. We may want to consider combining this with the
`skema-rs` components to generate a single image. This might simplify
calling things like the code comment extractor from Python for a unified
pipeline. fe5f565
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants