Small changes to support reading CommonCrawl files from S3 #23
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello! I've been using
ArchiveSpark
with the CommonCrawl files stored on S3. I found a few items that needed small fixes and I thought I'd send in a PR. I wouldn't say this is 100% ready tomerge - not sure if there any any automated tests to run - but I have been using the code with
these modifications for a few weeks without issues.
Commit message follows:
This change makes a few modifications to the HDFS utils. Importantly,
the
FileSystem
objects from the hadoop libraries are retrieved fromthe URI of the files. This will allow accessing CommonCrawl WARC files
on filesystems other than the currently configured one in the HadoopConf.
Additionally there is a small fix for some sometimes corrupted WARC records
encountered in the output from CommonCrawl.