Skip to content
This repository has been archived by the owner on Nov 22, 2022. It is now read-only.

where is _round_robin_process_groups? #1232

Open
jiangxiluning opened this issue Jan 19, 2020 · 8 comments
Open

where is _round_robin_process_groups? #1232

jiangxiluning opened this issue Jan 19, 2020 · 8 comments

Comments

@jiangxiluning
Copy link

_round_robin_process_group = dist_c10d._round_robin_process_groups(

@jiangxiluning jiangxiluning changed the title whese is _round_robin_process_groups? where is _round_robin_process_groups? Jan 19, 2020
@mwu1993
Copy link
Contributor

mwu1993 commented Jan 21, 2020

diff was 151be72, either @chenyangyu1988 or the pytorch docs might be able to help further.

@jiangxiluning
Copy link
Author

 _round_robin_process_group = dist_c10d._round_robin_process_groups

my pytorch is 1.3.1, but I cannnot fund _round_robin_process_groups method.
@mwu1993

@jiangxiluning
Copy link
Author

@chenyangyu1988 you actually import '_round_robin_process_groups' from torch, but indeed it dosenot exist.

(pytext-nlp) luning@luning-mate:~/dev/tools/pytext$ pytext train < demo/configs/distributed_docnn.json 
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
No config file specified, reading from stdin
WARNING - Applying old config adapter for version=0. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=1. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=2. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=3. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=4. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=5. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=6. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=7. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=8. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=9. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=10. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=11. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=12. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=13. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=14. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=15. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=16. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=17. Please consider migrating your old configs to the latest version.
WARNING - Applying old config adapter for version=18. Please consider migrating your old configs to the latest version.

===Starting training...

=== Starting training, World size is 2
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

Parameters: PyTextConfig:
    auto_resume_from_snapshot: False
    debug_path: /tmp/model.debug
    distributed_world_size: 2
    export_caffe2_path: None
    export_onnx_path: /tmp/model.onnx
    export_torchscript_path: None
    gpu_streams_for_distributed_training: 1
    include_dirs: None
    load_snapshot_path: 
    modules_save_dir: 
    random_seed: None
    report_eval_results: False
    save_all_checkpoints: False
    save_module_checkpoints: False
    save_snapshot_path: /tmp/model.pt
    task: DocumentClassificationTask.Config:
        data: Data.Config:
            batcher: PoolingBatcher.Config:
                eval_batch_size: 16
                num_shuffled_pools: 1
                pool_num_batches: 10000
                test_batch_size: 16
                train_batch_size: 16
            in_memory: True
            sort_key: None
            source: TSVDataSource.Config:
                column_mapping: {}
                delimiter: 	
                drop_incomplete_rows: False
                eval_filename: base_dir/test_tiny.tsv
                field_names: ['text', 'doc_label']
                quoted: False
                test_filename: base_dir/test_tiny.tsv
                train_filename: base_dir/train_tiny.tsv
        metric_reporter: ClassificationMetricReporter.Config:
            additional_column_names: []
            model_select_metric: ComparableClassificationMetric.ACCURACY
            output_path: /tmp/test_out.txt
            pep_format: False
            recall_at_precision_thresholds: [0.2, 0.4, 0.6, 0.8, 0.9]
            target_label: None
            text_column_names: ['text']
        model: DocModel.Config:
            decoder: MLPDecoder.Config:
                activation: Activation.RELU
                dropout: 0.0
                freeze: False
                hidden_dims: []
                layer_norm: False
                load_path: None
                out_dim: None
                save_path: None
                shared_module_key: None
            embedding: WordEmbedding.Config:
                cpu_only: False
                delimiter:  
                embed_dim: 100
                embedding_init_range: None
                embedding_init_strategy: EmbedInitStrategy.RANDOM
                export_input_names: ['tokens_vals']
                freeze: False
                load_path: None
                lowercase_tokens: True
                min_freq: 1
                mlp_layer_dims: []
                padding_idx: None
                pretrained_embeddings_path: 
                save_path: None
                shared_module_key: None
                skip_header: True
                vocab_file: 
                vocab_from_all_data: False
                vocab_from_pretrained_embeddings: False
                vocab_from_train_data: True
                vocab_size: 0
            inputs: ModelInput:
                dense: None
                labels: LabelTensorizer.Config:
                    allow_unknown: False
                    column: doc_label
                    is_input: False
                    label_vocab: None
                    pad_in_vocab: False
                tokens: TokenTensorizer.Config:
                    add_bos_token: False
                    add_eos_token: False
                    column: text
                    is_input: True
                    max_seq_len: None
                    tokenizer: Tokenizer.Config:
                        lowercase: True
                        split_regex: \s+
                    use_eos_token_for_bos: False
                    vocab: VocabConfig:
                        build_from_data: True
                        size_from_data: 0
                        vocab_files: []
                    vocab_file_delimiter:  
            output_layer: ClassificationOutputLayer.Config:
                freeze: False
                label_weights: None
                load_path: None
                loss: CrossEntropyLoss.Config:
                save_path: None
                shared_module_key: None
            representation: BiLSTMDocAttention.Config:
                dropout: 0.4
                freeze: False
                load_path: None
                lstm: BiLSTM.Config:
                    bidirectional: True
                    dropout: 0.4
                    freeze: False
                    load_path: None
                    lstm_dim: 32
                    num_layers: 1
                    pack_sequence: True
                    save_path: None
                    shared_module_key: None
                mlp_decoder: None
                pooling: SelfAttention.Config:
                    attn_dimension: 64
                    dropout: 0.4
                save_path: None
                shared_module_key: None
        trainer: TaskTrainer.Config:
            do_eval: True
            early_stop_after: 0
            epochs: 10
            fp16_args: FP16OptimizerFairseq.Config:
                init_loss_scale: 128
                min_loss_scale: 0.0001
                scale_tolerance: 0.0
                scale_window: None
                threshold_loss_scale: None
            load_best_model_after_train: True
            max_clip_norm: None
            num_accumulated_batches: 1
            num_batches_per_epoch: None
            num_samples_to_log_progress: 1000
            optimizer: Adam.Config:
                eps: 1e-08
                lr: 0.001
                weight_decay: 1e-05
            report_train_metrics: True
            scheduler: None
            sparsifier: None
            target_time_limit_seconds: None
    test_out_path: /tmp/test_out.txt
    torchscript_quantize: False
    use_config_from_snapshot: True
    use_cuda_for_testing: True
    use_cuda_if_available: True
    use_deterministic_cudnn: False
    use_fp16: False
    use_tensorboard: True
    version: 19


        # for debug of GPU
        use_cuda_if_available: True
        device_id: 0
        world_size: 2
        torch.cuda.is_available(): True
        cuda.CUDA_ENABLED: True
        cuda.DISTRIBUTED_WORLD_SIZE: 2
        
# for debug of FP16: fp16_enabled=False
Traceback (most recent call last):
  File "/home/luning/.pyenv/versions/pytext-nlp/bin/pytext", line 8, in <module>
    sys.exit(main())
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/pytext/main.py", line 369, in train
    train_model_distributed(config, metric_channels)
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/pytext/main.py", line 91, in train_model_distributed
    config.distributed_world_size,
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/pytext/main.py", line 114, in run_single
    metadata=metadata,
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/pytext/workflow.py", line 101, in train_model
    config, dist_init_url, device_id, rank, world_size, metric_channels, metadata
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/pytext/workflow.py", line 130, in prepare_task
    config.gpu_streams_for_distributed_training,
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/pytext/workflow.py", line 73, in _set_distributed
    rank, world_size, dist_init_url, device_id, gpu_streams=gpu_streams
  File "/home/luning/.pyenv/versions/3.7.5/envs/pytext-nlp/lib/python3.7/site-packages/pytext/utils/distributed.py", line 42, in dist_init
    _round_robin_process_group = dist_c10d._round_robin_process_groups(
AttributeError: module 'torch.distributed' has no attribute '_round_robin_process_groups'

@pietern
Copy link

pietern commented Jan 28, 2020

It's available on unstable. If PyText supports 1.3.1, it should gracefully degrade if it can't find the function.

@jiangxiluning
Copy link
Author

so which pytorch version should I use ? @pietern @chenyangyu1988
I cannot find this method in pytorch's master branch also.

@pietern
Copy link

pietern commented Jan 28, 2020

It's in the nightly releases and perhaps also in 1.4 (not sure).

See pytorch/pytorch@0282c5a for the commit.

@jiangxiluning
Copy link
Author

@pietern 1.4.0 does not have it. How could pytext release include this unstable feature ??? And even make any suggestion ?
image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants