We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
在数据处理部分,没用到吗? 训练集id很稀疏,只有字典存在的词才有, 还有就是 embedding = tf.get_variable('embedding', [self.config.vocab_size, self.config.embedding_dim]) embedding_inputs = tf.nn.embedding_lookup(embedding, self.input_x) 这个词向量是随机生成的,在训练过程中会训练这部分吗?
The text was updated successfully, but these errors were encountered:
的 id 为 0 ,是一个占位符,在这里特意把它留了下来,是为了后续的补齐操作,在长度不足的情况下在前面补 0,是序列预处理的常规操作。
另外,embedding 是一个 tensorflow 变量,会随着训练过程自动的训练。
Sorry, something went wrong.
我基于 https://github.com/Embedding/Chinese-Word-Vectors 这里的词向量做的训练,效果略微提升2个点。
另外,我做过长度分布查看,50%分位都有600+。我看代码中没有保留unk,应该是会把很多词语过滤掉,这个应该能降低seq_len;但还是考虑可以增加max_seq_len。
count 49999.000000 mean 913.320506 std 930.094315 min 8.000000 25% 350.000000 50% 688.000000 75% 1154.000000 max 27467.000000 Name: text, dtype: float64 100%|██████████████████████████████████████████████████████████████████| 49999/49999 [00:09<00:00, 5113.81it/s] count 4999.000000 mean 882.249050 std 863.752597 min 15.000000 25% 380.000000 50% 626.000000 75% 1072.000000 max 10919.000000
No branches or pull requests
在数据处理部分,没用到吗?
训练集id很稀疏,只有字典存在的词才有,
还有就是
embedding = tf.get_variable('embedding', [self.config.vocab_size, self.config.embedding_dim])
embedding_inputs = tf.nn.embedding_lookup(embedding, self.input_x)
这个词向量是随机生成的,在训练过程中会训练这部分吗?
The text was updated successfully, but these errors were encountered: