Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: I212 remote qa bulkrax refactored #263

Draft
wants to merge 22 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
c328ca3
[i212] - connect LOC to remote qa
ShanaLMoore Nov 28, 2022
04be464
[i212] - add subject partial for controlled vocabulary and spec
ShanaLMoore Nov 29, 2022
efe37b5
Merge branch 'main' into i212-remote-qa
ShanaLMoore Nov 29, 2022
d03e058
[i212] bring over implementation from UCSC
ShanaLMoore Dec 1, 2022
9b1716a
[i212] handle subject urls
ShanaLMoore Dec 1, 2022
b11db8c
[i212] - remove binding pry
ShanaLMoore Dec 1, 2022
67ea62d
[i212] implement buffer from ucsc
ShanaLMoore Dec 1, 2022
ff69d70
comment out unnecessary code from UCSC for now
ShanaLMoore Dec 2, 2022
568e83e
comment out unused code (for now)
ShanaLMoore Dec 2, 2022
0feb630
update allinson flex
ShanaLMoore Dec 2, 2022
d032e33
revert commenting out
ShanaLMoore Dec 2, 2022
118a1bd
pairing w jeremy notes
ShanaLMoore Dec 5, 2022
24b7a40
Merge branch 'main' into i212-remote-qa-bulkrax
ShanaLMoore Dec 5, 2022
6ed1e07
[i212] remove unused code and save label to solr
ShanaLMoore Dec 5, 2022
ccc4a7b
wip - remove unused code
ShanaLMoore Dec 6, 2022
e70a707
wip w jeremey's changes
ShanaLMoore Dec 7, 2022
bc3f9d2
Update has_local_processing.rb
ShanaLMoore Dec 8, 2022
d46d364
add exception handling
ShanaLMoore Dec 9, 2022
7acb1e3
Set up possibility to handle dynamic controlled values with bulkrax
ShanaLMoore Dec 9, 2022
b0c1737
reset gemfile to point to jeremy's authority wrapper branch
ShanaLMoore Dec 9, 2022
213e76d
remove unusued concerns
ShanaLMoore Dec 9, 2022
19a3066
[i212] add specs and auto correct rubocop
ShanaLMoore Dec 9, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,8 @@ group :development do
end

gem 'bulkrax', '~> 4.4'
gem 'linkeddata'
gem 'qa', git: 'https://github.com/samvera/questioning_authority.git', branch: 'jeremyf---extracting-logic-for-determining-qa-authority'

gem 'allinson_flex', git: 'https://github.com/samvera-labs/allinson_flex.git'
gem 'blacklight', '~> 6.7'
Expand Down
37 changes: 23 additions & 14 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
GIT
remote: https://github.com/samvera-labs/allinson_flex.git
revision: c1dfe12e5386f8f4325b3a19dde949ecef1fede3
revision: ab9b353a3034c182218356c9320ebe0b6b952a79
specs:
allinson_flex (0.1.0)
json_schemer
Expand Down Expand Up @@ -93,6 +93,21 @@ GIT
tinymce-rails (~> 5.10)
valkyrie (~> 2, >= 2.1.1)

GIT
remote: https://github.com/samvera/questioning_authority.git
revision: d3c51a573f7d96f2cddb01a1c2379373b8559482
branch: jeremyf---extracting-logic-for-determining-qa-authority
specs:
qa (5.10.0)
activerecord-import
deprecation
faraday (< 3.0, != 2.0.0)
geocoder
ldpath
nokogiri (~> 1.6)
rails (>= 5.0, < 7.1)
rdf

GIT
remote: https://github.com/tawan/active-elastic-job.git
revision: 092a8102cd38cffd7203d408fa03998cdff9dc03
Expand Down Expand Up @@ -158,7 +173,7 @@ GEM
activemodel (= 5.2.8)
activesupport (= 5.2.8)
arel (>= 9.0)
activerecord-import (1.4.0)
activerecord-import (1.4.1)
activerecord (>= 4.2)
activerecord-nulldb-adapter (0.8.0)
activerecord (>= 5.2.0, < 7.1)
Expand Down Expand Up @@ -505,7 +520,7 @@ GEM
railties (>= 3.2, < 8.0)
gender_detector (0.1.2)
unicode_utils (>= 1.3.0)
geocoder (1.8.0)
geocoder (1.8.1)
gitlab (4.19.0)
httparty (~> 0.20)
terminal-table (>= 1.5.1)
Expand Down Expand Up @@ -637,7 +652,7 @@ GEM
rdf (~> 3.1)
json-schema (2.8.1)
addressable (>= 2.4)
json_schemer (0.2.21)
json_schemer (0.2.24)
ecma-re-validator (~> 0.3)
hana (~> 1.3)
regexp_parser (~> 2.0)
Expand Down Expand Up @@ -680,10 +695,11 @@ GEM
rdf-turtle
rdf-vocab (>= 0.8)
slop
ldpath (1.1.0)
ldpath (1.2.0)
nokogiri (~> 1.8)
parslet
rdf (~> 3.0)
rdf-vocab (~> 3.0)
legato (0.7.0)
multi_json
libxml-ruby (3.1.0)
Expand Down Expand Up @@ -850,15 +866,6 @@ GEM
public_suffix (2.0.5)
puma (4.3.12)
nio4r (~> 2.0)
qa (5.8.1)
activerecord-import
deprecation
faraday (< 2.0)
geocoder
ldpath
nokogiri (~> 1.6)
rails (>= 5.0, < 6.2)
rdf
racc (1.6.0)
rack (2.2.3)
rack-proxy (0.7.2)
Expand Down Expand Up @@ -1283,6 +1290,7 @@ DEPENDENCIES
jbuilder (~> 2.5)
jquery-rails
launchy
linkeddata
listen (>= 3.0.5, < 3.2)
lograge
mods (~> 2.4)
Expand All @@ -1298,6 +1306,7 @@ DEPENDENCIES
pronto-rubocop
pry-byebug
puma (~> 4.3)
qa!
rack-test (= 0.7.0)
rails (~> 5.2.5)
rails-controller-testing
Expand Down
1 change: 0 additions & 1 deletion app/indexers/collection_indexer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ class CollectionIndexer < Hyrax::CollectionIndexer
# This indexes the default metadata. You can remove it if you want to
# provide your own metadata and indexing.
include Hyrax::IndexesBasicMetadata

# Uncomment this block if you want to add custom indexing behavior:
def generate_solr_document
super.tap do |solr_doc|
Expand Down
119 changes: 119 additions & 0 deletions app/models/concerns/bulkrax/has_local_processing.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# frozen_string_literal: true

module Bulkrax
module HasLocalProcessing
AuthorityInfo = Struct.new(:authority, :subauthority, :id, :uri, keyword_init: true)

def add_local
add_controlled_fields
end

private

# Controlled fields expect an ActiveTriples instance as a value. Bulkrax only imports strings.
# Use the imported string values to lookup or create valid ActiveTriples URIs and add them
# to the Entry's parsed_metadata in the format that the actor stack expects.
def add_controlled_fields
controlled_field_names.each do |field_name|
raw_metadata_for_field = raw_metadata.select { |k, _v| k.match?(/#{field_name.downcase}(_\d+)?/) }
next if raw_metadata_for_field.blank?

all_values = raw_metadata_for_field.values.compact&.map { |value| value.split(/\s*[|]\s*/) }&.flatten
parsed_metadata[field_name] = []
next if all_values.blank?

# parsed_metadata.delete(field_name) # replacing field_name with field_name_attributes
all_values.each_with_index do |value, i|
auth_id = sanitize_controlled_field_uri(value) # assume user-provided URI references a valid authority
next unless auth_id.present?

info = extract_authority_info_from(auth_id)
label = fetch_remote_label(info)
cache_label(info.uri, label)

parsed_metadata["#{field_name}"] ||= {}
# fetch and cache authority (job) => background job to go to LOC and pull them into local db. Authority.fetch_cache_term
# parsed_metadata["#{field_name}"][i] = ::AppIndexer.fetch_remote_label(auth_id)
# binding.pry if field_name == 'subject'

parsed_metadata["#{field_name}"][i] = fetch_remote_label(info)
end
end
end

def sanitize_controlled_field_uri(value)
return unless value.match?(::URI::DEFAULT_PARSER.make_regexp)

valid_value = value.strip.chomp.sub('https', 'http')
valid_value.chop! if valid_value.match?(%r{/$}) # remove trailing forward slash if one is present

valid_value
end

def extract_authority_info_from(url)
uri = URI.parse(url)
domain = uri.host.downcase # should this come from the metadata profile?
authority = :LOC if domain.include?("loc") # focus on implementing LOC first
# authority = get_field(field_name)
subauthority = uri.path.split('/').third # => ["", "authorities", "subjects", "sh85001932"]
uri_id = uri.path.split('/').last
AuthorityInfo.new(authority: authority, subauthority: subauthority, id: uri_id, uri: url)
end

def fetch_remote_label(info)
if info.uri.is_a? ActiveTriples::Resource
resource = info.uri
url = resource.id.dup
end
# if it's buffered, return the buffer
if (buffer = LdBuffer.find_by(url: url))
if (Time.now - buffer.updated_at).seconds > 1.year
LdBuffer.where(url: url).each{|buffer| buffer.destroy }
else
return buffer.label
end
end

begin
request_header = {:subauthority => info.subauthority}
context = Qa::AuthorityRequestContext.new(subauthority: info.subauthority, headers: request_header)
authority = Qa.authority_for(vocab: info.authority, subauthority: info.subauthority, context: context)
# authority = Qa::Authorities::LinkedData::GenericAuthority.new(info.authority) # how to get auth?
# label = authority.find(info.id, request_header: request_header)[:label]
return authority.find(info.id)[:label].join
rescue Exception => e
# IOError could result from a 500 error on the remote server
# SocketError results if there is no server to connect to
Rails.logger.error "Unable to fetch #{url} from the authorative source.\n#{e.message}"
return info.uri
end
end

def cache_label(url, label)
Rails.logger.info "Adding buffer entry - label: #{label}, url: #{url.to_s}"
LdBuffer.create(url: url, label: label)

# Delete oldest records if we have more than 5K in the buffer
if (cnt = LdBuffer.count - 5000) > 0
ids = LdBuffer.order('created_at ASC').limit(cnt).pluck(:id)
LdBuffer.where(id: ids).delete_all
end
end

def metadata_schema
AllinsonFlex::DynamicSchema.last
end

def controlled_field_names
@controlled_vocabulary_properties ||= []
metadata_schema.schema['properties'].each do |key, value|
@controlled_vocabulary_properties << key if value["controlled_values"] != ["null"]
end
@controlled_vocabulary_properties
end

def get_field(field_name)
metadata_schema.schema['properties'][field_name]
end
end
end
8 changes: 8 additions & 0 deletions app/models/ld_buffer.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# frozen_string_literal: true

# This method corresponds to an entry in our linked data buffer
# When we retrieve a label from an externally controlled vocabulary,
# The url and label are buffered here so they need not be looked up again.
# Old buffer entries are automatically deleted.
class LdBuffer < ApplicationRecord
end
54 changes: 54 additions & 0 deletions config/authorities/linked_data/loc_direct.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
{
"QA_CONFIG_VERSION": "2.1",
"prefixes": {
"loc": "http://id.loc.gov/vocabulary/identifiers/",
"madsrdf": "http://www.loc.gov/mads/rdf/v1#"
},
"term": {
"url": {
"@context": "http://www.w3.org/ns/hydra/context.jsonld",
"@type": "IriTemplate",
"template": "http://id.loc.gov/authorities/{subauth}/{term_id}",
"variableRepresentation": "BasicRepresentation",
"mapping": [
{
"@type": "IriTemplateMapping",
"variable": "term_id",
"property": "hydra:freetextQuery",
"required": true
},
{
"@type": "IriTemplateMapping",
"variable": "subauth",
"property": "hydra:freetextQuery",
"required": false,
"default": "names"
}
]
},
"qa_replacement_patterns": {
"term_id": "term_id",
"subauth": "subauth"
},
"term_id": "ID",
"language": ["en"],
"results": {
"id_ldpath": "loc:lccn | madsrdf:code",
"label_ldpath": "skos:prefLabel :: xsd:string",
"altlabel_ldpath": "skos:altLabel :: xsd:string",
"sameas_ldpath": "skos:exactMatch | owl:sameAs :: xsd:anyURI",
"narrower_ldpath": "madsrdf:hasNarrowerAuthority :: xsd:anyURI",
"broader_ldpath": "madsrdf:hasBroaderAuthority :: xsd:anyURI"
},
"subauthorities": {
"subjects": "subjects",
"names": "names",
"classification": "classification",
"child_subject": "childrensSubjects",
"genre": "genreForms",
"demographic": "demographicTerms",
"music_performance": "performanceMediums"
}
},
"search": {}
}
Loading