Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

How to get parallel dataset from already shared raw tokenized data ? #42

Open
himanshu034 opened this issue Jul 21, 2021 · 1 comment
Open

Comments

@himanshu034
Copy link

Hi I have looked into the raw tokenized parallel data which is in .tok format. Downloaded the same from https://dl.fbaipublicfiles.com/transcoder/TransCoder_tokenized_test_set_functions.zip . Seems the same methods are written into all 3 language C++, Python and Java. I need to know the generation process of binarized .pth files like "python_sa-cpp_sa-python_sa","cpp_sa-python_sa-cpp_sa"..
Please help. Any help would be much appreciated.

@malachaux
Copy link
Contributor

This repo is now deprecated. Please now refer to our new repository https://github.com/facebookresearch/CodeGen.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants