Skip to content

Commit

Permalink
Documentation & Updates (#5)
Browse files Browse the repository at this point in the history
- add colab notebook etc to README
- various usability enhancements

Signed-off-by: Peter <[email protected]>
  • Loading branch information
pszemraj authored Jan 21, 2023
1 parent 7405685 commit 419eb3b
Show file tree
Hide file tree
Showing 4 changed files with 98 additions and 75 deletions.
102 changes: 40 additions & 62 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,3 @@
```{todo} THIS IS SUPPOSED TO BE AN EXAMPLE. MODIFY IT ACCORDING TO YOUR NEEDS!
The document assumes you are using a source repository service that promotes a
contribution model similar to [GitHub's fork and pull request workflow].
While this is true for the majority of services (like GitHub, GitLab,
BitBucket), it might not be the case for private repositories (e.g., when
using Gerrit).
Also notice that the code examples might refer to GitHub URLs or the text
might use GitHub specific terminology (e.g., *Pull Request* instead of *Merge
Request*).
Please make sure to check the document having these assumptions in mind
and update things accordingly.
```

```{todo} Provide the correct links/replacements at the bottom of the document.
```

```{todo} You might want to have a look on [PyScaffold's contributor's guide],
especially if your project is open source. The text should be very similar to
this template, but there are a few extra contents that you might decide to
also include, like mentioning labels of your issue tracker or automated
releases.
```

# Contributing

Expand All @@ -41,6 +15,25 @@ considerate, reasonable, and respectful**. When in doubt,
[Python Software Foundation's Code of Conduct] is a good reference in terms of
behavior guidelines.

---

- [Contributing](#contributing)
- [Issue Reports](#issue-reports)
- [Documentation Improvements](#documentation-improvements)
- [creating pyscaffold-compatible documentation](#creating-pyscaffold-compatible-documentation)
- [Working on the documentation](#working-on-the-documentation)
- [Code Contributions](#code-contributions)
- [Submit an issue](#submit-an-issue)
- [Create an environment](#create-an-environment)
- [Clone the repository](#clone-the-repository)
- [Implement your changes](#implement-your-changes)
- [Submit your contribution](#submit-your-contribution)
- [Troubleshooting](#troubleshooting)
- [Maintainer tasks](#maintainer-tasks)
- [Releases](#releases)

---

## Issue Reports

If you experience bugs or general issues with `textsum`, please have a look
Expand All @@ -62,47 +55,43 @@ you help us to identify the root cause of the issue.
## Documentation Improvements

You can help improve `textsum` docs by making them more readable and coherent, or
by adding missing information and correcting mistakes.
by adding missing information and correcting mistakes. Currently, this is easy as there is no official documentation. The README.md file is the only documentation, outside of the [wiki]. If you want to improve it, please do so and submit a pull request.

`textsum` documentation uses [Sphinx] as its main documentation compiler.
This means that the docs are kept in the same repository as the project code, and
that any documentation update is done in the same way was a code contribution.
### creating pyscaffold-compatible documentation

```{todo} Don't forget to mention which markup language you are using.
First, install [pyscaffoldext-markdown] and [pyscaffoldext-sphinx] extensions (as well as all other extensions):

e.g., [reStructuredText] or [CommonMark] with [MyST] extensions.
```bash
pip install pyscaffold[all]
```

```{todo} If your project is hosted on GitHub, you can also mention the following tip:
:::{tip}
Please notice that the [GitHub web interface] provides a quick way of
propose changes in `textsum`'s files. While this mechanism can
be tricky for normal code contributions, it works perfectly fine for
contributing to the docs, and can be quite handy.
If you are interested in trying this method out, please navigate to
the `docs` folder in the source [repository], find which file you
would like to propose changes and click in the little pencil icon at the
top, to open [GitHub's code editor]. Once you finish editing the file,
please write a message in the form at the bottom of the page describing
which changes have you made and what are the motivations behind them and
submit your proposal.
:::
Then, clone this repo and update the documentation:

```bash
git clone https://github.com/pszemraj/textsum.git
putup textsum --force --markdown
```

This will create a new directory `docs` with the documentation. You can now edit the files in `docs` and commit the changes.

### Working on the documentation

When working on documentation changes in your local machine, you can
compile them using [tox] :

```
tox -e docs
```

and use Python's built-in web server for a preview in your web browser
(`http://localhost:8000`):

```
python3 -m http.server --directory 'docs/_build/html'
```

## Code Contributions
Expand Down Expand Up @@ -335,37 +324,26 @@ on [PyPI], the following steps can be used to release a new version for
to collectively create software are general and can be applied to all sorts
of environments, including private companies and proprietary code bases.


[black]: https://pypi.org/project/black/
[commonmark]: https://commonmark.org/
[contribution-guide.org]: http://www.contribution-guide.org/
[creating a pr]: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request
[descriptive commit message]: https://chris.beams.io/posts/git-commit
[docstrings]: https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html
[first-contributions tutorial]: https://github.com/firstcontributions/first-contributions
[flake8]: https://flake8.pycqa.org/en/stable/
[git]: https://git-scm.com
[github web interface]: https://docs.github.com/en/github/managing-files-in-a-repository/managing-files-on-github/editing-files-in-your-repository
[github's code editor]: https://docs.github.com/en/github/managing-files-in-a-repository/managing-files-on-github/editing-files-in-your-repository
[github's fork and pull request workflow]: https://guides.github.com/activities/forking/
[guide created by freecodecamp]: https://github.com/freecodecamp/how-to-contribute-to-open-source
[miniconda]: https://docs.conda.io/en/latest/miniconda.html
[myst]: https://myst-parser.readthedocs.io/en/latest/syntax/syntax.html
[other kinds of contributions]: https://opensource.guide/how-to-contribute
[pre-commit]: https://pre-commit.com/
[pypi]: https://pypi.org/
[pyscaffold's contributor's guide]: https://pyscaffold.org/en/stable/contributing.html
[pytest can drop you]: https://docs.pytest.org/en/stable/usage.html#dropping-to-pdb-python-debugger-at-the-start-of-a-test
[python software foundation's code of conduct]: https://www.python.org/psf/conduct/
[restructuredtext]: https://www.sphinx-doc.org/en/master/usage/restructuredtext/
[sphinx]: https://www.sphinx-doc.org/en/master/
[tox]: https://tox.readthedocs.io/en/stable/
[virtual environment]: https://realpython.com/python-virtual-environments-a-primer/
[virtualenv]: https://virtualenv.pypa.io/en/stable/


```{todo} Please review and change the following definitions:
```

[repository]: https://github.com/<USERNAME>/textsum
[issue tracker]: https://github.com/<USERNAME>/textsum/issues
[repository]: https://github.com/pszemraj/textsum
[issue tracker]: https://github.com/pszemraj/textsum/issues
[wiki]: https://github.com/pszemraj/textsum/wiki
37 changes: 34 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,35 @@

# textsum

<a href="https://colab.research.google.com/gist/pszemraj/ff8a8486dc3303199fe9c9790a606fff/textsum-summarize-text-files-example.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a href="https://pypi.org/project/textsum/"> <img src="https://img.shields.io/pypi/v/textsum.svg" alt="PyPI-Server"/></a>

<br>

> utility for using transformers summarization models on text docs
The purpose of this package is to provide a simple interface (python API, CLI, gradio web UI) for using summarization models on text documents of arbitrary length.
This package is to provides easy-to-use interfaces for using summarization models on text documents of arbitrary length. Currently implemented interfaces include a python API, CLI, and a shareable demo app.

For details, explanations, and docs, see the [wiki](https://github.com/pszemraj/textsum/wiki)

⚠️ _This is a WIP, but general functionality is available_ ⚠️

---

⚠️ **WARNING**: _This package is a WIP and is not ready for production use. Some things may not work yet._ ⚠️
- [textsum](#textsum)
- [Installation](#installation)
- [Full Installation](#full-installation)
- [Additional Details](#additional-details)
- [Usage](#usage)
- [Python API](#python-api)
- [CLI](#cli)
- [Demo App](#demo-app)
- [Contributing](#contributing)
- [Roadmap](#roadmap)

---

## Installation

Expand All @@ -27,7 +51,7 @@ Install using pip:
pip install textsum
```

The `textsum` package is now installed in your virtual environment. You can now use the CLI or python API to summarize text docs see the [Usage](#usage) section for more details.
The `textsum` package is now installed in your virtual environment. CLI commands/python API can be summarize text docs from anywhere. see the [Usage](#usage) section for more details.

### Full Installation

Expand Down Expand Up @@ -125,6 +149,12 @@ This will start a local server that you can access in your browser & a shareable

---

## Contributing

Contributions are welcome! Please open an issue or PR if you have any ideas or suggestions.

See the [CONTRIBUTING.md](CONTRIBUTING.md) file for details on how to contribute.

## Roadmap

- [x] add CLI for summarization of all text files in a directory
Expand All @@ -133,6 +163,7 @@ This will start a local server that you can access in your browser & a shareable
- [x] put on pypi
- [ ] optimum inference integration, LLM.int8 inference
- [ ] better documentation [in the wiki](https://github.com/pszemraj/textsum/wiki), details on improving performance (speed, quality, memory usage, etc.)
- [ ] improvements to OCR helper module

_Other ideas? Open an issue or PR!_

Expand Down
2 changes: 1 addition & 1 deletion src/textsum/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,7 @@ def main(args):
)

logging.info(f"finished summarization loop - output dir: {output_dir.resolve()}")
summarizer.save_params(output_dir=output_dir, hf_tag=args.model_name)
summarizer.save_params(output_path=output_dir, hf_tag=args.model_name)
logging.info("finished summarizing files")


Expand Down
32 changes: 23 additions & 9 deletions src/textsum/summarize.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,10 @@ def get_inference_params(self):
"""get the inference parameters currently being used"""
return self.inference_params

def update_loglevel(self, loglevel: int = logging.INFO):
"""update the loglevel of the logger"""
self.logger.setLevel(loglevel)

def summarize_and_score(self, ids, mask, **kwargs):
"""
summarize_and_score - summarize a batch of text and return the summary and output scores
Expand All @@ -157,6 +161,9 @@ def summarize_and_score(self, ids, mask, **kwargs):
# put global attention on <s> token
global_attention_mask[:, 0] = 1

self.logger.debug(
f"generating summary for batch of size {input_ids.shape} with {kwargs}"
)
if self.is_general_attention_model:
summary_pred_ids = self.model.generate(
input_ids,
Expand All @@ -180,6 +187,7 @@ def summarize_and_score(self, ids, mask, **kwargs):
skip_special_tokens=True,
remove_invalid_values=True,
)
self.logger.debug(f"summary: {summary}")
score = round(summary_pred_ids.sequences_scores.cpu().numpy()[0], 4)

return summary, score
Expand All @@ -200,15 +208,14 @@ def summarize_via_tokenbatches(
:return: a list of summaries, a list of scores, and a list of the input text for each batch
"""

logger = logging.getLogger(__name__)
# log all input parameters
if batch_length and batch_length < 512:
logger.warning(
self.logger.warning(
"WARNING: entered batch_length was too low at {batch_length}, resetting to 512"
)
batch_length = 512

logger.debug(
self.logger.debug(
f"batch_length: {batch_length} batch_stride: {batch_stride}, kwargs: {kwargs}"
)
if kwargs:
Expand Down Expand Up @@ -246,7 +253,7 @@ def summarize_via_tokenbatches(
"summary_score": score,
}
gen_summaries.append(_sum)
logger.debug(f"\n\t{result[0]}\nScore:\t{score}")
self.logger.debug(f"\n\t{result[0]}\nScore:\t{score}")
pbar.update()

pbar.close()
Expand Down Expand Up @@ -374,10 +381,12 @@ def summarize_file(
**kwargs,
) -> Path:
"""
summarize_file - a function that takes a text file and returns a summary
summarize_file - summarize a text file and save the summary to a file
:param str or Path file_path: the path to the text file
:param str or Path output_dir: the directory to save the summary to, defaults to None (current working directory)
:param int batch_length: number of tokens to use in each batch, defaults to None (self.token_batch_length)
:param int batch_stride: number of tokens to stride between batches, defaults to None (self.batch_stride)
:param bool lowercase: whether to lowercase the text prior to summarization, defaults to False
:return Path: the path to the summary file
Expand Down Expand Up @@ -406,22 +415,26 @@ def summarize_file(

def save_params(
self,
output_dir: str or Path = None,
output_path: str or Path = None,
hf_tag: str = None,
verbose: bool = False,
) -> None:
"""
save_params - save the parameters of the run to a json file
:param dict params: parameters to save
:param str or Path output_dir: directory to save the parameters to
:param str or Path output_path: directory or filepath to save the parameters to
:param str hf_tag: the model tag on huggingface (will be used instead of self.model_name_or_path)
:param bool verbose: whether to log the parameters
:return: None
"""
output_dir = Path(output_dir) if output_dir is not None else Path.cwd()
metadata_path = output_dir / "summarization_parameters.json"
output_path = Path(output_path) if output_path is not None else Path.cwd()
metadata_path = (
output_path / "summarization_parameters.json"
if output_path.is_dir()
else output_path
) # if output_path is a file, use that, otherwise use the default name

exported_params = self.get_inference_params().copy()
exported_params["META_huggingface_model"] = (
Expand All @@ -436,3 +449,4 @@ def save_params(
logging.debug(f"Saved parameters to {metadata_path}")
if verbose:
self.logger.info(f"parameters: {exported_params}")
print(f"saved parameters to {metadata_path}")

0 comments on commit 419eb3b

Please sign in to comment.