-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] How to hashlib a list of csv files with aiofile? #29
Comments
with |
thanks @Natim asyncio.gather would independently process, say 100 independent csv files in parallel or non-blocking matter? for example 100,000 csv files to hashlib took 30 mins, so interesting to see this approach would take? |
You might want to watch my talk about IO Bound and CPU Bound mixes: https://www.youtube.com/watch?v=eJBbM3RpEUI |
To elaborate a little bit. Single processing unitIf you don't mind having a single process computational CSV hasher. You can use aiofile to read your files line by lines and then it will be made in parallel. import asyncio
import hashlib
from aiofile import AIOFile, LineReader
async def hashlib_file(filename):
# Open file
async with AIOFile(filename, 'rb') as afd:
# Create hasher
hasher = hashlib.sha256()
async for line in LineReader(afd):
# For each line update hasher
hasher.update(line)
# return hexdigest
return (hasher.hexdigest(), filename)
async def main():
FILES = (
"worker.py",
"README.md",
)
actions = [hashlib_file(f) for f in FILES]
results = await asyncio.gather(*actions)
for filehash, filename in results:
print(f"{filehash}\t{filename}")
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Multi processing unitIf you want something really fast, you should use import asyncio
async def hashlib_file(filename):
proc = await asyncio.create_subprocess_exec(
"sha256sum", filename,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
stdout, stderr = await proc.communicate()
value, _ = stdout.decode().split()
return value, filename
async def main():
FILES = (
"worker.py",
"README.md",
)
actions = [hashlib_file(f) for f in FILES]
results = await asyncio.gather(*actions)
for filehash, filename in results:
print(f"{filehash}\t{filename}")
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
|
@scheung38 I added the code in my previous message. |
performance of single vs multi processing option? |
I am going to let you try on your huge CSV files and tell us, it might be interesting. For small files you won't see the difference. For huge ones, I am interested. |
not dealing with a huge csv file but maybe 100,000 small csv files? Appreciate your feedback though... |
Using standard non-async:
12.5 sec Tried your single processing version:
71 sec Tried your multiprocessing version:
Error: NotImplementedError? So not sure why it is slower than sync version? |
It seems you are using Windows, so you cannot exec a unix process from there. |
Your code is wrong for the async test 😂
|
single or multiple is wrong? |
thanks
The text was updated successfully, but these errors were encountered: