Small changes to support reading CommonCrawl files from S3 #23

mmisiewicz · 2019-12-13T22:18:54Z

Hello! I've been using ArchiveSpark with the CommonCrawl files stored on S3. I found a few items that needed small fixes and I thought I'd send in a PR. I wouldn't say this is 100% ready to
merge - not sure if there any any automated tests to run - but I have been using the code with
these modifications for a few weeks without issues.

Commit message follows:
This change makes a few modifications to the HDFS utils. Importantly,
the FileSystem objects from the hadoop libraries are retrieved from
the URI of the files. This will allow accessing CommonCrawl WARC files
on filesystems other than the currently configured one in the HadoopConf.

Additionally there is a small fix for some sometimes corrupted WARC records
encountered in the output from CommonCrawl.

This change makes a few modifications to the HDFS utils. Importantly, the `FileSystem` objects from the hadoop libraries are retrieved from the URI of the files. This will allow accessing CommonCrawl WARC files on filesystems other than the currently configured one in the HadoopConf. Additionally there is a small fix for some sometimes corrupted WARC records encountered in the output from CommonCrawl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small changes to support reading CommonCrawl files from S3 #23

Small changes to support reading CommonCrawl files from S3 #23

mmisiewicz commented Dec 13, 2019

Small changes to support reading CommonCrawl files from S3 #23

Are you sure you want to change the base?

Small changes to support reading CommonCrawl files from S3 #23

Conversation

mmisiewicz commented Dec 13, 2019