This tool scrapes the public metadata sources from set of stratum0 and stratum1 servers. It grabs:
- cvmfs/info/v1/repositories.json
And then for every repo it finds (that it's not told to ignore), it grabs:
- cvmfs/<repo>/.cvmfs_status.json
- cvmfs/<repo>/.cvmfspublished
pip install cvmfs-server-scraper
#!/usr/bin/env python3
import logging
from cvmfsscraper import scrape, scrape_server, set_log_level
# server = scrape_server("aws-eu-west1.stratum1.cvmfs.eessi-infra.org")
set_log_level(logging.DEBUG)
servers = scrape(
stratum0_servers=[
"stratum0.tld",
],
stratum1_servers=[
"stratum1-no.tld",
"stratum1-au.tld",
],
repos=[],
ignore_repos=[],
)
# Note that the order of servers is undefined.
print(servers[0])
for repo in servers[0].repositories:
print("Repo: " + repo.name )
print("Root size: " + repo.root_size)
print("Revision: " + repo.revision)
print("Revision timestamp: " + repo.revision_timestamp)
print("Last snapshot: " + str(repo.last_snapshot))
Note that if you are using a Stratum1 server with S3 as its backend, you need to set repos explicitly.
This is because the S3 backend does not have a cvmfs/info/v1/repositories.json
file. Also, the GeoAPI
status will be NOT_FOUND
for these servers.
# Data structure
## Server
A server object, representing a specific server that has been scraped.
````python
servers = scrape(...)
server_one = servers[0]
server.name
The name of the server, usually its fully qualified domain name.
server.geoapi_status
A GeoAPIstatus enum object. Defined in constants.py
. The possible values are:
- OK (0: OK)
- LOCATION_ERROR (1: GeoApi gives wrong location)
- NO_RESPONSE (2: No response)
- NOT_FOUND (9: The server has no repository available so the GeoApi cannot be tested)
- NOT_YET_TESTED (99: The server has not yet been tested)
server.repositories
A list of repository objects, sorted by name. Empty if no repositores are scraped on the server.
server.ignored_repositories
List of repositories names that are to be ignored by the scraper.
server.forced_repositories
A list of repository names that the server is forced to scrape. If a repo name exists in both ignored_repositories and forced_repositories, it will be scraped.
A repository object, representing a single repository on a scraped server.
servers = scrape(...)
repo_one = servers[0].repositories[0]
repo_one.name
The fully qualified name of the repository.
repo_one.server
The server object to which the repository belongs.
repo_one.path
The path for the repository on the server. May differ from the name. To get a complete URL, one can do:
url = "http://" + repo_one.server.name + repo_one.path
These attributes are populated from cvmfs_status.json
:
Attribute | Value |
---|---|
last_gc | Timestamp of last garbage collection |
last_snapshot | Timestamp of the last snapshot |
Information from .cvmfspublished
is also provided. For explanations for these keys, please see CVMFS' official documentation. The field value in the table is the field key from .cvmfspublished
.
Attribute | Field |
---|---|
alternative_name | A |
full_name | N |
is_garbage_collectable | G |
metadata_cryptographic_hash | M |
micro_cataogues | L |
reflog_checksum_cryptographic_hash | Y |
revision_timestamp | T |
root_catalogue_ttl | D |
root_cryptographic_hash | C |
root_size | B |
root_path_hash | R |
signature | The end signature blob |
signing_certificate_cryptographic_hash | X |
tag_history_cryptographic_hash | H |