< Table of Contents	Building ArchiveSpark (advanced) >

Recipes / Examples

To get common tasks done more quickly, we have prepared a few recipes that you can copy and customize for your needs. In most cases, all you need to do is to change the paths to locate your data or replace the Data Specification to be used. More about the provided Data Specifications can be found here: DataSpecs.

Building a corpus with title + text for a selected set of URLs
Analyzing term / entity distributions in a dataset
Extracting hyperlinks from webpages
Extracting embedded resources from webpages
Loading WARC / Generating CDX (enable more efficient processing)
Downloading a web archive dataset as WARC/CDX from the Wayback Machine

These recipes are supposed to serve as templates for your tasks. In order to tailor them for your needs, feel free to combine elements from different recipes.

More application-specific examples can be found in the related projects, such as:

Create semantic Web triples from ArchiveSpark records with ArchiveSpark2Triples.
Analyze medical journals at the Medical Heritage Library (MHL) with MHLonArchiveSpark.
Start analyzing the temporal Web starting from keywords issued to Tempas (Temporal Archive Search) with Tempas2ArchiveSpark.

Interoperability

We have shown that recipes can be reused among different kinds of archival datasets as well as data sources, e.g., web archives and digital journals. For more information please read (and cite):

H. Holzmann, Emily Novak Gustainis and Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. 5th IEEE International Conference on Big Data (BigData). Boston, MA, USA. December 2017. Get full-text PDF

< Table of Contents	Building ArchiveSpark (advanced) >

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recipes.md

Recipes.md

Recipes / Examples

Interoperability

Files

Recipes.md

Latest commit

History

Recipes.md

File metadata and controls

Recipes / Examples

Interoperability