Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small changes to support reading CommonCrawl files from S3 #23

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mmisiewicz
Copy link

Hello! I've been using ArchiveSpark with the CommonCrawl files stored on S3. I found a few items that needed small fixes and I thought I'd send in a PR. I wouldn't say this is 100% ready to
merge - not sure if there any any automated tests to run - but I have been using the code with
these modifications for a few weeks without issues.

Commit message follows:
This change makes a few modifications to the HDFS utils. Importantly,
the FileSystem objects from the hadoop libraries are retrieved from
the URI of the files. This will allow accessing CommonCrawl WARC files
on filesystems other than the currently configured one in the HadoopConf.

Additionally there is a small fix for some sometimes corrupted WARC records
encountered in the output from CommonCrawl.

This change makes a few modifications to the HDFS utils. Importantly,
the `FileSystem` objects from the hadoop libraries are retrieved from
the URI of the files. This will allow accessing CommonCrawl WARC files
on filesystems other than the currently configured one in the HadoopConf.

Additionally there is a small fix for some sometimes corrupted WARC records
encountered in the output from CommonCrawl.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants