Skip to content

spraakbanken/swegov-opendata-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

swegov-opendata-rs

Tools used for collecting SFS (Svensk Författningssamling) from Riksdagens öppna data.

MIT licensed

Maturity badge - level 1

CI(check) CI(scheduled) CI(test)

fetch-sfs

Binary to run for collecting SFS.

Uses webcrawler and opendata-spider.

Takes roughly 1 hour to fetch all SFS data.

opendata-spiders

Lives in opendata-spider.

Uses swegov-opendata.

sfs

Contains concrete spider for collecting SFS.

This spider spawns urls that searches for documents of type SFS in 20 years spans, using the data.riksdagen.se/dokumentlista path.

These lists are scraped for dok_id to scrape documents and nasta_sida to scrape next page in the dokumentlista.

All fetched pages are stored to disk in JSON-format, except for the pages with html fragments, that are stored as-is. The documents are grouped by year.

This spider handles the following inconsistencies in the api.

  • Fetching the data in JSON format sometimes doesn't include text field.
    • Instead the documents are fetch with xml and translated to JSON in the process step.
  • data.riksdagen.se/dokument/<dok_id> is supposed to get the document with dok_id.
    • sometimes, an empty document with no data is returned
    • sometimes, the html field of a document is returned
    • for both problems above , the path data.riksdagen.se/dokumentstatus/<dok_id> is needed

sfs-corpus

Uses swegov-opendata. Build corpus files for processing with sparv.

swegov-opendata

Data model for the documents and document lists from riksdagens öppna data with serde serialization and deserialization.

webcrawler

Lives in webcrawler.

Generic web crawler that defines an interface for spiders.

The spiders work in 2 steps,

  • scraping an url for new urls and/or data
  • processing the fetched data

References

MSRV Policy

The MSRV (Minimum Supported Rust Version) is fixed for a given minor (1.x) version. However it can be increased when bumping minor versions, i.e. going from 1.0 to 1.1 allows us to increase the MSRV. Users unable to increase their Rust version can use an older minor version instead. Below is a list of swegov-opendata-rs versions and their MSRV:

  • v0.1: Rust 1.74.

Note however that swegov-opendata-rs also has dependencies, which might have different MSRV policies. We try to stick to the above policy when updating dependencies, but this is not always possible.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published