Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC files written in ArchiveSpark incompatible with warcio #26

Open
parismic opened this issue Jul 13, 2021 · 0 comments
Open

WARC files written in ArchiveSpark incompatible with warcio #26

parismic opened this issue Jul 13, 2021 · 0 comments

Comments

@parismic
Copy link

warcio raises
warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: response
at the second WARC record (after the warc-info record) in a WARC file written with ArchiveSpark.
Both state that they use ISO http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
warcio works fine for WARC-files written with Heritrix
I posted an issue on warcio as well.

warcio also returns a warning before the error:

WARNING: Record not followed by newline, perhaps Content-Length is invalid
Offset: 433
Remainder: b'WARC/1.0\r\n'

It could be that ArchiveSpark should write an additional empty line between the records.

warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WARC/0.17', 'WARC/0.18'] - Found: WARC-Type: response
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant