Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

size of Kinetics-600 #28

Open
cmhungsteve opened this issue Jun 6, 2018 · 49 comments
Open

size of Kinetics-600 #28

cmhungsteve opened this issue Jun 6, 2018 · 49 comments

Comments

@cmhungsteve
Copy link

I am wondering how large Kinetics-600 is. I am downloading it now and finished around 330G.
I saw someone said Kinetics-400 is around 311G. Does that mean Kinetics-600 is around 470G?
Just curious about that.
Thank you.

@chrischute
Copy link

Training set is 604 GB (392k clips), downloaded with this improved script. Scaling by the number of clips, that would make validation ~46 GB (30k clips) and test ~92 GB (60k clips).

@cmhungsteve
Copy link
Author

Thank you for the reply. What do you mean by "scaling by the number of clips"?

@chrischute
Copy link

No problem. I just meant that I had only downloaded the training set videos, so I was estimating the validation and test set sizes using the number of clips as given by the annotation files (i.e. multiplying by 30/392 to get the validation size, and 60/392 to get the test set size). I ended up downloading the validation clips and can confirm they're just over 46 GB total.

@cmhungsteve
Copy link
Author

Got it. Thank you.
Do you know the main difference between the improved script mentioned above and the one in your repo?

@chrischute
Copy link

Kinetics is made up of 10-second clips from full YouTube videos. The original script downloads the full video for each example, then extracts the 10-second clip once it's downloaded. The improved script by @jremmons only downloads the 10-second clip you need.

You can see the line changes here: #16

@cmhungsteve
Copy link
Author

I see. Thank you so much.

@escorciav
Copy link
Collaborator

escorciav commented Jun 25, 2018

Note: I did not manage to get clean* videos with that script. Would be nice to see if the effect of that on classification accuracy.

  • clean means that playing the clip with a player e.g. VLC, you don't see black frames. That might suggest that audio and video stream are not synced.

@jremmons
Copy link

jremmons commented Jun 26, 2018

@cmhungsteve I think I fixed the issue that @escorciav mentioned with my latest commits.

https://github.com/jremmons/ActivityNet/blob/master/Crawler/Kinetics/download.py

@cmhungsteve
Copy link
Author

@jremmons Thank you!!

@chrischute
Copy link

@jremmons FWIW I sampled about 20 videos downloaded with the old script, and never saw the artifacts referred to in the other post (viewing in QuickTime Player). Any insight into why I might not have had the issue? Am I just getting lucky in sampling videos with no artifacts?

@escorciav
Copy link
Collaborator

escorciav commented Jun 26, 2018

@chrischute out of curiosity, did you take care of sampling video with t_start significantly different than zero?
for the record, I used VLC or fedora27-default video player aided with non-free codecs such as x264.

@chrischute
Copy link

@escorciav I did try sampling a video with t_start greater than 0 (abseiling/YEgqBGmmPV8). There were no artifacts. I ran the download script on a Ubuntu machine, Python 3.6 and ffmpeg 3.1.3-static. I downloaded the mp4 to my mac and viewed it in QuickTime Player.

@jremmons
Copy link

jremmons commented Jun 26, 2018

@escorciav I also didn't notice any issues with the first script I wrote. The current version of my script does re-encoding now like the original download.py did just without downloading the entire youtube video (this is just to be safe). If someone can provide an example of a video where this issue occurs that would help a lot.

It would be a huge performance win for most people if the script doesn't have to re-encode. If we can't reproduce this problem it might be worth going back to a version that doesn't do re-encoding.

@escorciav
Copy link
Collaborator

TLDR as we agree in the other PR, It's great to have this alternative to download and clip the videos. If anyone wanna start quickly please use it. Take my words as a disclaimer note that comes in the agreements that we usually don't read 😉

I am really happy with your contribution and comments. My original comment was more scientific/engineering question rather than a note to prevent the usage of this script.

I don't have too much bandwidth to test it in the next two weeks, I will try to do it but do not promise anything. If you have a docker image or conda environment please share it, that would reduce the amount of work.

@cmhungsteve
Copy link
Author

@jremmons I tried your script but all the downloaded files are just like empty text files. I also tried to print "status" at line 137, and it all showed something like this: "('QcVuxQAgrzU_000007_000017', False, b'')".
However, I have no problem downloading using the old script (I used the same command).
Can you help me figure out what the problem is? Thank you.

@jremmons
Copy link

jremmons commented Jun 27, 2018

@cmhungsteve that is strange... I literally just used the same script today to download kinetics-600 without any issues. Are you sure you are using the most up to date version here? If you have questions about my script though, we should chat on #16.

@cmhungsteve
Copy link
Author

I am not really sure why. Is it because my FFmpeg version is not correct or I miss some library?
Here is the info shown in ffmpeg -version:
ffmpeg version 3.4.2-116.04.york0.2 Copyright (c) 2000-2018 the FFmpeg developers
built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1
16.04.9) 20160609
configuration: --prefix=/usr --extra-version='1~16.04.york0.2' --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared
libavutil 55. 78.100 / 55. 78.100
libavcodec 57.107.100 / 57.107.100
libavformat 57. 83.100 / 57. 83.100
libavdevice 57. 10.100 / 57. 10.100
libavfilter 6.107.100 / 6.107.100
libavresample 3. 7. 0 / 3. 7. 0
libswscale 4. 8.100 / 4. 8.100
libswresample 2. 9.100 / 2. 9.100
libpostproc 54. 7.100 / 54. 7.100

and my youtube-dl version is "2018.6.25", which I think is the newest.

@sophia-wright-blue
Copy link

I know this is a rather open ended question, but I was looking to get some guidance on the time that it takes to download the entire dataset (for e.g if num-jobs=24 on an AWS p2 instance with 8 cores), thank you, @cmhungsteve @jremmons

@okankop
Copy link

okankop commented Nov 21, 2018

@cmhungsteve that is strange... I literally just used the same script today to download kinetics-600 without any issues. Are you sure you are using the most up to date version here? If you have questions about my script though, we should chat on #16.

Do you manage to download all the clips? Because when I try to download the dataset, around 10% of the clips cannot be downloaded either since video is unavailable, copyright issues or the user closed the account. Is it normal?

@cmhungsteve
Copy link
Author

@okankop yeah....there are lots of videos with copyright issues. I think it's normal.

@MannyKayy
Copy link

An update on the stats using @jremmons version of the download script.

Training set: 589GB (380802 clips)
Validation set: 45GB (29097 clips)
Test set: 19GB (12617 clips)

@dandelin
Copy link

While inspecting downloaded videos, I found out that joblib's parallelism would damage ffmpeg's transcoding (with url stream) and yield corrupted videos. The problem was solved by replacing joblib to python's built-in multiprocessing module.

@lesoleil
Copy link

lesoleil commented Mar 6, 2019

An update on the stats using @jremmons version of the download script.

Training set: 589GB (380802 clips)
Validation set: 45GB (29097 clips)
Test set: 19GB (12617 clips)

Hi, Can you share Kinetics-600 data file? Thanks a lot ! @MannyKayy

@xiaoyang-coder
Copy link

Hi, Can you share Kinetics-600 data file? Thanks a lot ! @MannyKayy

@MannyKayy
Copy link

What is the recommended way of sharing this dataset? It's roughly 0.6 TB and I am having trouble making a torrent of this dataset.

@escorciav
Copy link
Collaborator

escorciav commented Apr 6, 2019

Maybe, we should reach the CVDF foundation. Probably, they can host the data as they have done for other datasets.

Please thumbs up this message if you deem it essential. It would help to make a strong case.

@hollowgalaxy
Copy link

hollowgalaxy commented Jul 23, 2019

@sophia-wright-blue were you able to make it run on the p2 instance? I was running into too many requests issue, which I created an issue for #51 (comment)
@cmhungsteve @jremmons It seems you had did not have issues with sending too many requests from youtube-dl's side. Was the machine you used for downloading personal or a server?

@sophia-wright-blue
Copy link

Hi @hollowgalaxy , I was able to download it on an Azure VM, good luck!

@hollowgalaxy
Copy link

hollowgalaxy commented Jul 24, 2019

Thanks for letting me know, @sophia-wright-blue,
I just tried it. Ran an Standard D4s v3 (4 vcpus, 16 GiB memory) and it did not work.

@sophia-wright-blue
Copy link

@hollowgalaxy , I think I played around with that number a bit, I don't remember what finally worked, you might also wanna try this repo: https://github.com/Showmax/kinetics-downloader which worked for me

@MahdiKalayeh
Copy link

Apparently, pretty recently Youtube has started to extensively block large-scale downloading using youtube-dl. I have tried using the Crawler code for Kinetics and am always getting HTTP 429 ERROR. So, it does not matter which approach/code you use, Youtube apparently just does not allow systematic downloading. It would be great if ActivityNet hosts videos on some server so researchers would still be able to use Kinetics.

@escorciav
Copy link
Collaborator

@MahdiKalayeh Could you pls confirm if #51 is the same error message that you got?

@MahdiKalayeh
Copy link

MahdiKalayeh commented Aug 30, 2019

@escorciav Yes, the same error.

@escorciav
Copy link
Collaborator

Let's track it there. Thanks 😉

@MStumpp
Copy link

MStumpp commented Sep 21, 2019

@MannyKayy where you able to upload your download of kinetics600 so we may download it from there? thanks

@MannyKayy
Copy link

@MStumpp Unfortunately not. @escorciav It may be worth it if the CDVF reaches out to the authors for a copy of the full original kinetics dataset.

@escorciav
Copy link
Collaborator

I contacted Kinetics maintainers, and they are aware of the request. The ball is on their side. I will follow up with them by the end of the week.

@sailordiary
Copy link

I contacted Kinetics maintainers, and they are aware of the request. The ball is on their side. I will follow up with them by the end of the week.

It's been a month so I guess that possibility's gone out of the window by now...?

@tyyyang
Copy link

tyyyang commented Nov 8, 2019

Any update on this track?

@escorciav
Copy link
Collaborator

#28 (comment)

Regarding ☝️ , I haven't heard back from them officially. My feeling is that the maintainers have knocked multiple doors, and have not found any solution yet.

The most viable solutions, that I'm aware, are:

@kaiqiangh
Copy link

kaiqiangh commented Nov 29, 2019

I would ask some questions (they might be repeated by someone's).

  1. Currently, I downloaded "val" set from kinetics-600 dataset and got 28k clips (3.9G). Is it correct?

  2. Got error at final step (save download_report.json)
    Traceback (most recent call last): File "download.py", line 220, in <module> main(**vars(p.parse_args())) File "download.py", line 200, in main fobj.write(json.dumps(status_lst)) File "/usr/lib/python3.5/json/__init__.py", line 230, in dumps return _default_encoder.encode(obj) File "/usr/lib/python3.5/json/encoder.py", line 198, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.5/json/encoder.py", line 256, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.5/json/encoder.py", line 179, in default raise TypeError(repr(o) + " is not JSON serializable") TypeError: b'' is not JSON serializable
    What is this report? Is it very important?

  3. Got error when downloading "test" split, said
    FileNotFoundError: [Errno 2] No such file or directory: '3c4ab9ca-5eb6-4525-8e4d-ac4111536577.mp4.part Anyone got similar one?

Thanks in advance.

@mahsunaltin
Copy link

I am also working with Kinetics dataset for an academic purpose. Also, I got same error (429). Can you please share with us via Google Drive or something else? @kaiqiangh

@kaiqiangh
Copy link

Hi @mahsunaltin, I also have this issue and cannot download the whole dataset. Not sure how to solve it.

@mahsunaltin
Copy link

When you wrote like Currently, I downloaded "val" set from kinetics-600 dataset and got 28k clips (3.9G). Is it correct?, I thought you already downloaded the whole val dataset. @kaiqiangh

@kaiqiangh
Copy link

Hi, I checked the log files and also found some errors that lead to the incompleteness of val dataset. And then I re-run the codes, and the val dataset has been overrode. I am still working on it. By the way, I tried to download videos from my other server, but still get 429 error. Do you have any solution for that?

@mahsunaltin
Copy link

I already tried kind of techniques to download dataset, and as everyone I got 429 error. In the fact, if we can change ip address in every 50 videos, there will be no problem. So that, I have kind of tricky way to download dataset by using Colab. ( I know not very logical :) but so far so good )

  • Download the dataset 50 by 50 from Colab and reset all runtimes in every 50 videos.
  • Or VPN can be used as locally.

@AmeenAli
Copy link

Regarding Activitynet
It was published , here you go : https://drive.google.com/file/d/12YOTnPc4zCwum_R9CSpZAI9ppAei8KMG/view

@KT27-A
Copy link

KT27-A commented May 26, 2020

If anyone cannot download samples with error 429, you can use --cookies to download it. URL is https://daveparrish.net/posts/2018-06-22-How-to-download-private-YouTube-videos-with-youtube-dl.html.

BTW, it seems that a lot of videos are private videos that cannot be accessed. How can we download the private videos to make the dataset complete?

@eglerean
Copy link

Thanks @Katou2!
Do you mean that some of those videos are private -> cannot be accessed by anybody except the uploader (and youtube of course)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests