MedaCy seeks to create a unified platform to streamline research efforts in medical text mining while also providing an interface to easily apply models to real world problems. Due to this, contributions to medaCy are often consequences and direct by-products of active research projects. However, if not for the contributions, bug fixes/reports, and suggestions of practioners - medaCy could not grow and thrive.
This contribution guide is designed to inform:
- Researchers in how they can efficiently utilize medaCy to make their work more reachable by practioners.
- Practioners in how they can tune medaCy's cutting-edge functionalities to their specific application.
Please do a search before posting an issue/bug report - your problem may already be solved! If your search comes up for not - congratulations, you may have something to contribute!
At it's most basic one can fork medaCy, clone down their fork, and use their favorite text editor to develop. However, some up-front set-up effort goes a long way towards streamlining the contribution process and keeping organized. This section details a suggested set-up for efficient development, testing, and experimentation with medaCy utilizing PyCharm.
Assumptions of this section:
- You are working in a UNIX based operating system.
- Part 2 assumes you have Pycharm Professional installed - Pycharm Professional is provided with the Jetbrains University License. (this isn't entirely necessary but the useful Remote Host feature is disabled on the Community Edition)
Part 1: Development Installation
- If you are shaky with git - this link provides an excellent description of the branching model medaCy follows to organize contributions.
- Fork medaCy and copy the clone link.
- On your machine, insure you have Python 3 installed. Set-up a virtual environment and activate it.
- Run the bash commands:
python --version
andpip list
. Upgrade pip to the latest version as suggested. Your python version should be above 3.4 and your installed packages should be few in number - if both of these conditions do not hold return to Step 3. - In a directory separate from the one created by the virtual envirorment set-up command, clone down your fork of medaCy.
- Whilst inside your cloned fork, insure you are in at-least the development branch or a branch of the development branch.
This can be verified by running
git status
and branching can be done withgit checkout <branch-name>
- Run
pip install -e .
This will install medaCy in editable mode inside of your virtual environment and will take several minutes to install dependencies - medaCy stands on the shoulders of giants! Errors one is likely to encounter here include the installation of sci-py and numpy. Google search the errors as they are easily fixable via the installation of some extra dependencies. Likely, your python installation is missing C headers required by scipy.
Part 2: Developing with PyCharm PyCharm can streamline development efforts - especially if you are developing locally and running medaCy on a remote machine for model building.
Part 3: Logging
MedaCy uses the logging module to allow users insight into how medaCy is handling their data. Insure you are logging critical steps in any functionality you implement at the appropriate logging levels to make it easy for users to debug.
All components of medaCy have associated unit tests. Please insure these all pass before submitting pull requests. When medaCy runs unit tests, it first automatically installs the END dataset then uses it to test various functionalities of the package. Some tests involve building a model over the dataset - these may take some time to complete.
After installing medaCy for development, make sure that pytest
is installed. Then:
-
For quick testing of the whole framework, run:
python setup.py test
. -
For more fine-grained testing on individual files with colorful log output run:
pytest -s tests/tools/test_data_manager.py -o log_cli=True --log-cli-level=INFO
.This will show log output during tests and allow you to adust logging level for the test file being run. Read the pytest documentation for details.
Note that some of the unit tests require knowledge about the configuration of your machine, and that those tests will be skipped if those configuration settings are not specified in the config.json file. These settings include the location of a MetaMap binary file on your machine, which GPU core to use for certain tests, and the location of a word embeddings file. It may be that your contributions will not affect functionality that depend on these features, however, all pull requests will be tested against the full unit test suite.