Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nlp.pipe(..., n_process>1) won't return if wrapped by tqdm() and zip() #8798

Open
ivyleavedtoadflax opened this issue Jul 23, 2021 · 4 comments
Labels
bug Bugs and behaviour differing from documentation scaling Scaling, serving and parallelizing spaCy

Comments

@ivyleavedtoadflax
Copy link
Contributor

ivyleavedtoadflax commented Jul 23, 2021

This is a pretty specific set of circumstances that I discovered today, but on the off chance that it is useful to someone, here it is.

If you include the output of nlp.pipe(...,n_process>1) in a zip() within tqdm() it will hang interminably. See below

How to reproduce the behaviour

#!/usr/bin/env python

import pandas as pd
import spacy
from tqdm import tqdm

data = [
    {"text": "I just wanna tell you how I'm feeling", "id": 0},
    {"text": "Gotta make you understand", "id": 1},
    {"text": "Never gonna give you up", "id": 2},
    {"text": "Never gonna let you down", "id": 3},
    {"text": "Never gonna run around and desert you", "id": 4},
    {"text": "Never gonna make you cry", "id": 5},
    {"text": "Never gonna say goodbye", "id": 6},
    {"text": "Never gonna tell a lie and hurt you", "id": 7},
]


df = pd.DataFrame(data)

nlp = spacy.load("en_core_web_md")

# Works with a single process

for id, doc in tqdm(
    zip(
       df["id"],
        nlp.pipe(
            df["text"],
            n_process=1,
        ),
    )
):

    print(id, doc.text)


# Works with no zip and multiple processes

for doc in tqdm(
        nlp.pipe(
            df["text"],
            n_process=2,
        ),
):

    print(doc.text)

# Hangs with multiple processes and zip

for id, doc in tqdm(
    zip(
       df["id"],
        nlp.pipe(
            df["text"],
            n_process=2,
        ),
    )
):

    print(id, doc.text)

Output:

$python script.py
0it [00:00, ?it/s]0 I just wanna tell you how I'm feeling
1 Gotta make you understand
2 Never gonna give you up
3 Never gonna let you down
4 Never gonna run around and desert you
5 Never gonna make you cry
6 Never gonna say goodbye
7 Never gonna tell a lie and hurt you
8it [00:00, 977.21it/s]
0it [00:00, ?it/s]I just wanna tell you how I'm feeling
Gotta make you understand
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
8it [00:00, 345.82it/s]
0it [00:00, ?it/s]0 I just wanna tell you how I'm feeling
1 Gotta make you understand
2 Never gonna give you up
3 Never gonna let you down
4 Never gonna run around and desert you
5 Never gonna make you cry
6 Never gonna say goodbye
7 Never gonna tell a lie and hurt you
8it [00:00, 342.59it/s]

Your Environment

Info about spaCy

  • spaCy version: 3.0.1
  • Platform: Linux-5.8.0-50-generic-x86_64-with-glibc2.29
  • Python version: 3.8.5
  • Pipelines: en_ner_bc5cdr_md (0.4.0), en_core_web_md (3.0.0), en_ner_craft_md (0.4.0), en_ner_bionlp13cg_md (0.4.0), en_ner_jnlpba_md (0.4.0)
@ivyleavedtoadflax ivyleavedtoadflax changed the title nlp.pipe won't return if wrapped by tqdm and zipper nlp.pipe won't return if wrapped by tqdm and zip Jul 23, 2021
@ivyleavedtoadflax ivyleavedtoadflax changed the title nlp.pipe won't return if wrapped by tqdm and zip nlp.pipe(..., n_process>1) won't return if wrapped by tqdm and zip Jul 23, 2021
@ivyleavedtoadflax ivyleavedtoadflax changed the title nlp.pipe(..., n_process>1) won't return if wrapped by tqdm and zip nlp.pipe(..., n_process>1) won't return if wrapped by tqdm() and zip() Jul 23, 2021
@adrianeboyd
Copy link
Contributor

Thanks for the report, I can reproduce the behavior where it hangs.

As a workaround, I think it works if you wrap tqdm around the texts rather than on zip:

for i, doc in zip(ids, nlp.pipe(tqdm(texts), n_process=2)):
   print(doc)

@adrianeboyd adrianeboyd added bug Bugs and behaviour differing from documentation scaling Scaling, serving and parallelizing spaCy labels Jul 26, 2021
@ivyleavedtoadflax
Copy link
Contributor Author

ah nice, thanks @adrianeboyd

@adrianeboyd
Copy link
Contributor

As a note, I've marked this as a bug because it shouldn't hang like this, but since there's an easy workaround it's going to be pretty low priority for us to fix.

Maybe some of the changes related to error handling have caused this? I'm not sure. In any case, it's better to use tqdm on something with a length rather than a generator.

@adrianeboyd
Copy link
Contributor

As a note, we've seen that tqdm can run into deadlocks when errors are raised during the loop. With python 3.12 you can also see the new related deprecation warning related to fork and threading: https://discuss.python.org/t/concerns-regarding-deprecation-of-fork-with-alive-threads/33555

The spacy test suite would hang on all OSes with python 3.12 prior to 467c824. (This commit is just a workaround for the test suite / common use cases. It doesn't fix the underlying issue with deadlocks and tqdm.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation scaling Scaling, serving and parallelizing spaCy
Projects
None yet
Development

No branches or pull requests

2 participants