Skip to content

对于超大型(10G+级别)的csv文件,如何巧妙的创建数据集,使得电脑内存不会崩溃? #15

Answered by cx-333
cx-333 asked this question in Q&A
Discussion options

You must be logged in to vote

利用图片数据集的创建方式解决超大型csv文件加载问题

对于一个超大型csv文件,假设其行(rows)表示样本,列(cols)表示特征,有的最后一列还有标签。那么可以将该csv文件里的每一行都单独写入一个txt文件,文件名以样本名命名,每个txt文件存储一个样本的所有特征。样本名单独存在一个txt文件(如果有标签的话,也可以将样本名和标签一起存在一个文件夹)。
下面是将csv切分为样本与特征的程序。实现上面所说的将csv的每个样本的特征单独写入一个txt文件(txt文件以样本名命名),样本单独写入一个文件

import pandas as pd

file = pd.read_csv('file.csv', index_col=0, iterator=True, chunksize=1)
for data in file:
    sample = data.index.tolist()[0]
    value = data.values.tolist()
    value = str(value).replace('[', '')
    value = value.replace(']', '')
    print(value)
    with open('data/{}.txt'.format(sample), 'w', encoding='utf-8') as f,\
            open('samples.txt', 'a', encoding='utf-8') as s:
        f.write(value)
        s.write

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@ZhikangNiu
Comment options

Answer selected by ZhikangNiu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants