< Table of Contents | Building ArchiveSpark (advanced) > |
---|
To get common tasks done more quickly, we have prepared a few recipes that you can copy and customize for your needs. In most cases, all you need to do is to change the paths to locate your data or replace the Data Specification to be used. More about the provided Data Specifications can be found here: DataSpecs.
- Building a corpus with title + text for a selected set of URLs
- Analyzing term / entity distributions in a dataset
- Extracting hyperlinks from webpages
- Extracting embedded resources from webpages
- Loading WARC / Generating CDX (enable more efficient processing)
- Downloading a web archive dataset as WARC/CDX from the Wayback Machine
These recipes are supposed to serve as templates for your tasks. In order to tailor them for your needs, feel free to combine elements from different recipes.
More application-specific examples can be found in the related projects, such as:
- Create semantic Web triples from ArchiveSpark records with ArchiveSpark2Triples.
- Analyze medical journals at the Medical Heritage Library (MHL) with MHLonArchiveSpark.
- Start analyzing the temporal Web starting from keywords issued to Tempas (Temporal Archive Search) with Tempas2ArchiveSpark.
We have shown that recipes can be reused among different kinds of archival datasets as well as data sources, e.g., web archives and digital journals. For more information please read (and cite):
H. Holzmann, Emily Novak Gustainis and Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. 5th IEEE International Conference on Big Data (BigData). Boston, MA, USA. December 2017. Get full-text PDF
< Table of Contents | Building ArchiveSpark (advanced) > |
---|