Documentation & Updates (#5)

- add colab notebook etc to README - various usability enhancements Signed-off-by: Peter <[email protected]>
pszemraj · Jan 21, 2023 · 419eb3b · 419eb3b
1 parent 7405685
commit 419eb3b
Show file tree

Hide file tree

Showing 4 changed files with 98 additions and 75 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,29 +1,3 @@
-```{todo} THIS IS SUPPOSED TO BE AN EXAMPLE. MODIFY IT ACCORDING TO YOUR NEEDS!
-
-   The document assumes you are using a source repository service that promotes a
-   contribution model similar to [GitHub's fork and pull request workflow].
-   While this is true for the majority of services (like GitHub, GitLab,
-   BitBucket), it might not be the case for private repositories (e.g., when
-   using Gerrit).
-
-   Also notice that the code examples might refer to GitHub URLs or the text
-   might use GitHub specific terminology (e.g., *Pull Request* instead of *Merge
-   Request*).
-
-   Please make sure to check the document having these assumptions in mind
-   and update things accordingly.
-```
-
-```{todo} Provide the correct links/replacements at the bottom of the document.
-```
-
-```{todo} You might want to have a look on [PyScaffold's contributor's guide],
-
-   especially if your project is open source. The text should be very similar to
-   this template, but there are a few extra contents that you might decide to
-   also include, like mentioning labels of your issue tracker or automated
-   releases.
-```
 
 # Contributing
 
@@ -41,6 +15,25 @@ considerate, reasonable, and respectful**. When in doubt,
 [Python Software Foundation's Code of Conduct] is a good reference in terms of
 behavior guidelines.
 
+---
+
+- [Contributing](#contributing)
+  - [Issue Reports](#issue-reports)
+  - [Documentation Improvements](#documentation-improvements)
+    - [creating pyscaffold-compatible documentation](#creating-pyscaffold-compatible-documentation)
+    - [Working on the documentation](#working-on-the-documentation)
+  - [Code Contributions](#code-contributions)
+    - [Submit an issue](#submit-an-issue)
+    - [Create an environment](#create-an-environment)
+    - [Clone the repository](#clone-the-repository)
+    - [Implement your changes](#implement-your-changes)
+    - [Submit your contribution](#submit-your-contribution)
+    - [Troubleshooting](#troubleshooting)
+  - [Maintainer tasks](#maintainer-tasks)
+    - [Releases](#releases)
+
+---
+
 ## Issue Reports
 
 If you experience bugs or general issues with `textsum`, please have a look
@@ -62,47 +55,43 @@ you help us to identify the root cause of the issue.
 ## Documentation Improvements
 
 You can help improve `textsum` docs by making them more readable and coherent, or
-by adding missing information and correcting mistakes.
+by adding missing information and correcting mistakes. Currently, this is easy as there is no official documentation. The README.md file is the only documentation, outside of the [wiki]. If you want to improve it, please do so and submit a pull request.
 
-`textsum` documentation uses [Sphinx] as its main documentation compiler.
-This means that the docs are kept in the same repository as the project code, and
-that any documentation update is done in the same way was a code contribution.
+### creating pyscaffold-compatible documentation
 
-```{todo} Don't forget to mention which markup language you are using.
+First, install [pyscaffoldext-markdown] and [pyscaffoldext-sphinx] extensions (as well as all other extensions):
 
-    e.g.,  [reStructuredText] or [CommonMark] with [MyST] extensions.
+```bash
+pip install pyscaffold[all]
 ```
 
-```{todo} If your project is hosted on GitHub, you can also mention the following tip:
-
-   :::{tip}
-      Please notice that the [GitHub web interface] provides a quick way of
-      propose changes in `textsum`'s files. While this mechanism can
-      be tricky for normal code contributions, it works perfectly fine for
-      contributing to the docs, and can be quite handy.
-
-      If you are interested in trying this method out, please navigate to
-      the `docs` folder in the source [repository], find which file you
-      would like to propose changes and click in the little pencil icon at the
-      top, to open [GitHub's code editor]. Once you finish editing the file,
-      please write a message in the form at the bottom of the page describing
-      which changes have you made and what are the motivations behind them and
-      submit your proposal.
-   :::
+Then, clone this repo and update the documentation:
+
+```bash
+git clone https://github.com/pszemraj/textsum.git
+putup textsum --force --markdown
 ```
 
+This will create a new directory `docs` with the documentation. You can now edit the files in `docs` and commit the changes.
+
+### Working on the documentation
+
 When working on documentation changes in your local machine, you can
 compile them using [tox] :
 
 ```
+
 tox -e docs
+
 ```
 
 and use Python's built-in web server for a preview in your web browser
 (`http://localhost:8000`):
 
 ```
+
 python3 -m http.server --directory 'docs/_build/html'
+
 ```
 
 ## Code Contributions
@@ -335,37 +324,26 @@ on [PyPI], the following steps can be used to release a new version for
     to collectively create software are general and can be applied to all sorts
     of environments, including private companies and proprietary code bases.
 
-
 [black]: https://pypi.org/project/black/
-[commonmark]: https://commonmark.org/
 [contribution-guide.org]: http://www.contribution-guide.org/
-[creating a pr]: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request
 [descriptive commit message]: https://chris.beams.io/posts/git-commit
 [docstrings]: https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html
-[first-contributions tutorial]: https://github.com/firstcontributions/first-contributions
 [flake8]: https://flake8.pycqa.org/en/stable/
 [git]: https://git-scm.com
-[github web interface]: https://docs.github.com/en/github/managing-files-in-a-repository/managing-files-on-github/editing-files-in-your-repository
-[github's code editor]: https://docs.github.com/en/github/managing-files-in-a-repository/managing-files-on-github/editing-files-in-your-repository
-[github's fork and pull request workflow]: https://guides.github.com/activities/forking/
 [guide created by freecodecamp]: https://github.com/freecodecamp/how-to-contribute-to-open-source
 [miniconda]: https://docs.conda.io/en/latest/miniconda.html
-[myst]: https://myst-parser.readthedocs.io/en/latest/syntax/syntax.html
 [other kinds of contributions]: https://opensource.guide/how-to-contribute
 [pre-commit]: https://pre-commit.com/
 [pypi]: https://pypi.org/
-[pyscaffold's contributor's guide]: https://pyscaffold.org/en/stable/contributing.html
 [pytest can drop you]: https://docs.pytest.org/en/stable/usage.html#dropping-to-pdb-python-debugger-at-the-start-of-a-test
 [python software foundation's code of conduct]: https://www.python.org/psf/conduct/
-[restructuredtext]: https://www.sphinx-doc.org/en/master/usage/restructuredtext/
-[sphinx]: https://www.sphinx-doc.org/en/master/
 [tox]: https://tox.readthedocs.io/en/stable/
 [virtual environment]: https://realpython.com/python-virtual-environments-a-primer/
 [virtualenv]: https://virtualenv.pypa.io/en/stable/
 
-
 ```{todo} Please review and change the following definitions:
 ```
 
-[repository]: https://github.com/<USERNAME>/textsum
-[issue tracker]: https://github.com/<USERNAME>/textsum/issues
+[repository]: https://github.com/pszemraj/textsum
+[issue tracker]: https://github.com/pszemraj/textsum/issues
+[wiki]: https://github.com/pszemraj/textsum/wiki
diff --git a/README.md b/README.md
@@ -12,11 +12,35 @@
 
 # textsum
 
+ <a href="https://colab.research.google.com/gist/pszemraj/ff8a8486dc3303199fe9c9790a606fff/textsum-summarize-text-files-example.ipynb">
+  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+</a>
+<a href="https://pypi.org/project/textsum/"> <img src="https://img.shields.io/pypi/v/textsum.svg" alt="PyPI-Server"/></a>
+
+<br>
+
 > utility for using transformers summarization models on text docs
 
-The purpose of this package is to provide a simple interface (python API, CLI, gradio web UI) for using summarization models on text documents of arbitrary length.
+This package is to provides easy-to-use interfaces for using summarization models on text documents of arbitrary length. Currently implemented interfaces include a python API, CLI, and a shareable demo app.
+
+For details, explanations, and docs, see the [wiki](https://github.com/pszemraj/textsum/wiki)
+
+⚠️ _This is a WIP, but general functionality is available_ ⚠️
+
+---
 
-⚠️ **WARNING**: _This package is a WIP and is not ready for production use. Some things may not work yet._ ⚠️
+- [textsum](#textsum)
+  - [Installation](#installation)
+    - [Full Installation](#full-installation)
+    - [Additional Details](#additional-details)
+  - [Usage](#usage)
+    - [Python API](#python-api)
+    - [CLI](#cli)
+    - [Demo App](#demo-app)
+  - [Contributing](#contributing)
+  - [Roadmap](#roadmap)
+
+---
 
 ## Installation
 
@@ -27,7 +51,7 @@ Install using pip:
 pip install textsum
 ```
 
-The `textsum` package is now installed in your virtual environment. You can now use the CLI or python API to summarize text docs see the [Usage](#usage) section for more details.
+The `textsum` package is now installed in your virtual environment. CLI commands/python API can be summarize text docs from anywhere. see the [Usage](#usage) section for more details.
 
 ### Full Installation
 
@@ -125,6 +149,12 @@ This will start a local server that you can access in your browser & a shareable
 
 ---
 
+## Contributing
+
+Contributions are welcome! Please open an issue or PR if you have any ideas or suggestions.
+
+See the [CONTRIBUTING.md](CONTRIBUTING.md) file for details on how to contribute.
+
 ## Roadmap
 
 - [x] add CLI for summarization of all text files in a directory
@@ -133,6 +163,7 @@ This will start a local server that you can access in your browser & a shareable
 - [x] put on pypi
 - [ ] optimum inference integration, LLM.int8 inference
 - [ ] better documentation [in the wiki](https://github.com/pszemraj/textsum/wiki), details on improving performance (speed, quality, memory usage, etc.)
+- [ ] improvements to OCR helper module
 
 _Other ideas? Open an issue or PR!_
 

diff --git a/src/textsum/cli.py b/src/textsum/cli.py
@@ -233,7 +233,7 @@ def main(args):
         )
 
     logging.info(f"finished summarization loop - output dir: {output_dir.resolve()}")
-    summarizer.save_params(output_dir=output_dir, hf_tag=args.model_name)
+    summarizer.save_params(output_path=output_dir, hf_tag=args.model_name)
     logging.info("finished summarizing files")
 
 

diff --git a/src/textsum/summarize.py b/src/textsum/summarize.py
@@ -138,6 +138,10 @@ def get_inference_params(self):
         """get the inference parameters currently being used"""
         return self.inference_params
 
+    def update_loglevel(self, loglevel: int = logging.INFO):
+        """update the loglevel of the logger"""
+        self.logger.setLevel(loglevel)
+
     def summarize_and_score(self, ids, mask, **kwargs):
         """
         summarize_and_score - summarize a batch of text and return the summary and output scores
@@ -157,6 +161,9 @@ def summarize_and_score(self, ids, mask, **kwargs):
         # put global attention on <s> token
         global_attention_mask[:, 0] = 1
 
+        self.logger.debug(
+            f"generating summary for batch of size {input_ids.shape} with {kwargs}"
+        )
         if self.is_general_attention_model:
             summary_pred_ids = self.model.generate(
                 input_ids,
@@ -180,6 +187,7 @@ def summarize_and_score(self, ids, mask, **kwargs):
             skip_special_tokens=True,
             remove_invalid_values=True,
         )
+        self.logger.debug(f"summary: {summary}")
         score = round(summary_pred_ids.sequences_scores.cpu().numpy()[0], 4)
 
         return summary, score
@@ -200,15 +208,14 @@ def summarize_via_tokenbatches(
         :return: a list of summaries, a list of scores, and a list of the input text for each batch
         """
 
-        logger = logging.getLogger(__name__)
         # log all input parameters
         if batch_length and batch_length < 512:
-            logger.warning(
+            self.logger.warning(
                 "WARNING: entered batch_length was too low at {batch_length}, resetting to 512"
             )
             batch_length = 512
 
-        logger.debug(
+        self.logger.debug(
             f"batch_length: {batch_length} batch_stride: {batch_stride}, kwargs: {kwargs}"
         )
         if kwargs:
@@ -246,7 +253,7 @@ def summarize_via_tokenbatches(
                 "summary_score": score,
             }
             gen_summaries.append(_sum)
-            logger.debug(f"\n\t{result[0]}\nScore:\t{score}")
+            self.logger.debug(f"\n\t{result[0]}\nScore:\t{score}")
             pbar.update()
 
         pbar.close()
@@ -374,10 +381,12 @@ def summarize_file(
         **kwargs,
     ) -> Path:
         """
-        summarize_file - a function that takes a text file and returns a summary
+        summarize_file - summarize a text file and save the summary to a file
 
         :param str or Path file_path: the path to the text file
         :param str or Path output_dir: the directory to save the summary to, defaults to None (current working directory)
+        :param int batch_length: number of tokens to use in each batch, defaults to None (self.token_batch_length)
+        :param int batch_stride: number of tokens to stride between batches, defaults to None (self.batch_stride)
         :param bool lowercase: whether to lowercase the text prior to summarization, defaults to False
 
         :return Path: the path to the summary file
@@ -406,22 +415,26 @@ def summarize_file(
 
     def save_params(
         self,
-        output_dir: str or Path = None,
+        output_path: str or Path = None,
         hf_tag: str = None,
         verbose: bool = False,
     ) -> None:
         """
         save_params - save the parameters of the run to a json file
 
         :param dict params: parameters to save
-        :param str or Path output_dir: directory to save the parameters to
+        :param str or Path output_path: directory or filepath to save the parameters to
         :param str hf_tag: the model tag on huggingface (will be used instead of self.model_name_or_path)
         :param bool verbose: whether to log the parameters
 
         :return: None
         """
-        output_dir = Path(output_dir) if output_dir is not None else Path.cwd()
-        metadata_path = output_dir / "summarization_parameters.json"
+        output_path = Path(output_path) if output_path is not None else Path.cwd()
+        metadata_path = (
+            output_path / "summarization_parameters.json"
+            if output_path.is_dir()
+            else output_path
+        )  # if output_path is a file, use that, otherwise use the default name
 
         exported_params = self.get_inference_params().copy()
         exported_params["META_huggingface_model"] = (
@@ -436,3 +449,4 @@ def save_params(
         logging.debug(f"Saved parameters to {metadata_path}")
         if verbose:
             self.logger.info(f"parameters: {exported_params}")
+            print(f"saved parameters to {metadata_path}")