Holding Pen: Merger Musings #3537

ksachs · 2018-07-10T09:47:33Z

Some musings about the merger.
Or more precisely: when do we need to update and merge.

Merging records is the most complex and error-prone action we have.
Don't do it unless necessary.

Process:

harvest: oai from arXiv, feed from publisher - original.V1
conversion to inspire-json - leads to basic_record.V1
enrichment - adds enrichment_record.V1 to basic_record.V1 (visible in HoldingPen)
harvest of updated version - original.v2
conversion to inspire-json - leads to basic_record.V2
Here we could compare basic_record.V2 and basic_record.V1.
No change -> end workflow
If the pdf changed (can we see that?), replace only the fulltext and re-run refextract
enrichment
auto-merge incl. info from BAI tables - leads to (partially-)merged_record.V2 (visible in HoldingPen)

It is difficult (impossible for me) to compare which info really came from arXiv and what should be updated. I'm not sure this is the best procedure.

To determine whether an update and merge is necessary I would base the comparison on the converted INSPIRE json record, not on the original harvest. Publisher metadata can be very rich and might change in a place we don't use. In addition the structure might change. The conversion to json is a good filter to avoid such problems.

Example: arXiv.1807.02123

At arXiv:

Current metadata
[v1] Thu, 5 Jul 2018 18:00:12 GMT (65kb,D)

 <?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2018-07-10T08:34:55Z</responseDate>
<request verb="GetRecord" identifier="oai:arXiv.org:1807.02123" metadataPrefix="arXiv">http://export.arxiv.org/oai2</request>
<GetRecord>
<record>
<header>
 <identifier>oai:arXiv.org:1807.02123</identifier>
 <datestamp>2018-07-09</datestamp>
 <setSpec>physics:astro-ph</setSpec>
 <setSpec>physics:gr-qc</setSpec>
 <setSpec>physics:hep-th</setSpec>
</header>
<metadata>
 <arXiv xmlns="http://arxiv.org/OAI/arXiv/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://arxiv.org/OAI/arXiv/ http://arxiv.org/OAI/arXiv.xsd">
  <id>1807.02123</id>
  <created>2018-07-05</created>
  <authors>
   <author>
    <keyname>Isi</keyname>
    <forenames>Maximiliano</forenames>
   </author>
   <author>
    <keyname>Stein</keyname>
    <forenames>Leo C.</forenames>
   </author>
  </authors>
  <title>Measuring stochastic gravitational-wave energy beyond general relativity</title>
  <categories>gr-qc astro-ph.CO astro-ph.HE hep-th</categories>
  <comments>18 pages (plus appendices), 1 figure</comments>
  <report-no>LIGO-P1700234</report-no>
  <license>http://arxiv.org/licenses/nonexclusive-distrib/1.0/</license>
  <abstract>  Gravity theories beyond general relativity (GR) can change the properties of ....</abstract>
 </arXiv>
</metadata>
</record>
</GetRecord>
</OAI-PMH>

INSPIRE 1st harvest

HP 1117913

Can anyone find out which arXiv metadata were harvested?

basic_record.V1

record after conversion to json

"id": 1117913, 
"metadata": {
  "$schema": "https://labs.inspirehep.net/schemas/records/hep.json", 
  "_collections": [
    "Literature"
  ], 
  "_files": [
    {
      "bucket": "2544991d-f864-4e3c-84db-175a3d9d796b", 
      "checksum": "md5:a4c818b1694a6a502a0a2f21674ca92e", 
      "key": "1807.02123.tar.gz", 
      "size": 66761, 
      "version_id": "13acfe1f-bc07-4965-82d8-d814fa47e17f"
    }, 
    {
      "bucket": "2544991d-f864-4e3c-84db-175a3d9d796b", 
      "checksum": "md5:005bb51602500a9a0b66c925205e2afd", 
      "key": "1807.02123.pdf", 
      "size": 916611, 
      "version_id": "33c371b6-69ca-4aa2-974d-8413d01be527"
    }
  ], 
  "abstracts": [
    {
      "source": "arXiv", 
      "value": "Gravity theories beyond general relativity (GR) can change t....."
    }
  ], 
  "acquisition_source": {
    "datetime": "2018-07-09T03:34:57.462577", 
    "method": "hepcrawl", 
    "source": "arXiv", 
    "submission_number": "1117913"
  }, 
  "arxiv_eprints": [
    {
      "categories": [
        "gr-qc", 
        "astro-ph.CO", 
        "astro-ph.HE", 
        "hep-th"
      ], 
      "value": "1807.02123"
    }
  ], 
  "authors": [
    {
      "full_name": "Isi, Maximiliano"
    }, 
    {
      "full_name": "Stein, Leo C."
    }
  ], 
  "documents": [
    {
      "fulltext": true, 
      "hidden": true, 
      "key": "1807.02123.pdf", 
      "material": "preprint", 
      "original_url": "http://export.arxiv.org/pdf/1807.02123", 
      "source": "arxiv", 
      "url": "/api/files/2544991d-f864-4e3c-84db-175a3d9d796b/1807.02123.pdf"
    }
  ], 
  "license": [
    {
      "license": "arXiv nonexclusive-distrib 1.0", 
      "material": "preprint", 
      "url": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/"
    }
  ], 
 "preprint_date": "2018-07-05", 
 "public_notes": [
    {
      "source": "arXiv", 
      "value": "18 pages (plus appendices), 1 figure"
    }
  ], 
  "report_numbers": [
    {
      "source": "arXiv", 
      "value": "LIGO-P1700234"
    }
  ], 
  "titles": [
    {
      "source": "arXiv", 
      "title": "Measuring stochastic gravitational-wave energy beyond general relativity"
    }
  ]

enrichment_record.V1

Information added during the worklow

"citeable": true, 
"control_number": 1681259, 
"core": true, 
"curated": false, 
"document_type": [
  "article"
], 
"inspire_categories": [
  {
    "source": "arxiv", 
    "term": "Gravitation and Cosmology"
  }, 
  {
    "source": "arxiv", 
    "term": "Astrophysics"
  }, 
  {
    "source": "arxiv", 
    "term": "Theory-HEP"
  }
], 
"number_of_pages": 18, 
"references": [
 ...    
],

Update

HP 1119139

basic_record.V2

looks very much the same as basic_record.V1

"id": 1119139, 
"metadata": {
  "$schema": "https://labs.inspirehep.net/schemas/records/hep.json", 
  "_collections": [
    "Literature"
  ], 
  "_files": [
    {
      "bucket": "7a52c6cf-2889-4233-8fb6-4fdfccf87f53", 
      "checksum": "md5:a4c818b1694a6a502a0a2f21674ca92e", 
      "key": "1807.02123.tar.gz", 
      "size": 66761, 
      "version_id": "d8afbc29-0514-43a5-9263-2adef7b8d371"
    }, 
    {
      "bucket": "7a52c6cf-2889-4233-8fb6-4fdfccf87f53", 
      "checksum": "md5:005bb51602500a9a0b66c925205e2afd", 
      "key": "1807.02123.pdf", 
      "size": 916611, 
      "version_id": "ba0ccc5d-2c9e-42fd-9dd5-3ecb47bb412a"
    }
  ], 
  "abstracts": [
    {
      "source": "arXiv", 
      "value": "Gravity theories beyond general relativity (GR) can change the properties of gravitational waves: their polarizations, dispersion, speed, and, importantly, energy content are all heavily theory- dependent. All these corrections can potentially be probed by measuring the stochastic gravitational- wave background. However, most existing treatments of this background beyond GR overlook modifications to the energy carried by gravitational waves, or rely on GR assumptions that are invalid in other theories. This may lead to mistranslation between the observable cross-correlation of detector outputs and gravitational-wave energy density, and thus to errors when deriving observational constraints on theories. In this article, we lay out a generic formalism for stochastic gravitational- wave searches, applicable to a large family of theories beyond GR. We explicitly state the (often tacit) assumptions that go into these searches, evaluating their generic applicability, or lack thereof. Examples of problematic assumptions are: statistical independence of linear polarization amplitudes; which polarizations satisfy equipartition; and which polarizations have well-defined phase velocities. We also show how to correctly infer the value of the stochastic energy density in the context of any given theory. We demonstrate with specific theories in which some of the traditional assumptions break down: Chern-Simons gravity, scalar-tensor theory, and Fierz-Pauli massive gravity. In each theory, we show how to properly include the beyond-GR corrections, and how to interpret observational results."
    }
  ], 
  "acquisition_source": {
    "datetime": "2018-07-10T03:36:36.182790", 
    "method": "hepcrawl", 
    "source": "arXiv", 
    "submission_number": "1117913"
  }, 
  "arxiv_eprints": [
    {
      "categories": [
        "gr-qc", 
        "astro-ph.CO", 
        "astro-ph.HE", 
        "hep-th"
      ], 
      "value": "1807.02123"
    }
  ], 
  "authors": [
    {
      "full_name": "Isi, Maximiliano", 
    }, 
    {
      "full_name": "Stein, Leo C.", 
    }
  ], 
  "documents": [
    {
      "fulltext": true, 
      "hidden": true, 
      "key": "1807.02123.pdf", 
      "material": "preprint", 
      "original_url": "http://export.arxiv.org/pdf/1807.02123", 
      "source": "arxiv", 
      "url": "/api/files/7a52c6cf-2889-4233-8fb6-4fdfccf87f53/1807.02123.pdf"
    }
  ], 
  "license": [
    {
      "license": "arXiv nonexclusive-distrib 1.0", 
      "material": "preprint", 
      "url": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/"
    }, 
    {
      "license": "arXiv nonexclusive-distrib 1.0", 
      "url": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/"
    }
  ], 
  "preprint_date": "2018-07-05", 
  "public_notes": [
    {
      "source": "arXiv", 
      "value": "18 pages (plus appendices), 1 figure"
    }
  ], 
  "report_numbers": missing due to problem with merger
  "titles": [
    {
      "source": "arXiv", 
      "title": "Measuring stochastic gravitational-wave energy beyond general relativity"
    }
  ]

merged_record.V2

I assume this is additional information from enrichment and automerge.
Difficult to say since the steps in between are not accessible to me.

    "ids": [
      {
        "schema": "INSPIRE BAI", 
        "value": "M.Isi.1"
      }
    ], 
    "record": {
      "$ref": "http://labs.inspirehep.net/api/authors/1275240"
    }, 
    "signature_block": "ISm", 
    "uuid": "3ec51e6f-56c3-4a36-82fe-bdf56e91afd0"

    "ids": [
      {
        "schema": "INSPIRE BAI", 
        "value": "L.C.Stein.2"
      }
    ], 
    "record": {
      "$ref": "http://labs.inspirehep.net/api/authors/1056947"
    }, 
    "signature_block": "STANl", 
    "uuid": "34cd70d9-c68b-4f64-b77e-29240ecb0120"

"citeable": true, 
"control_number": 1681259, 
"core": true, 
"curated": false, 
"document_type": [
  "article"
], 
"inspire_categories": [
  {
    "source": "arxiv", 
    "term": "Gravitation and Cosmology"
  }, 
  {
    "source": "arxiv", 
    "term": "Astrophysics"
  }, 
  {
    "source": "arxiv", 
    "term": "Theory-HEP"
  }
], 
"legacy_creation_date": "2018-07-09", 
"number_of_pages": 18, 
"self": {
  "$ref": "http://labs.inspirehep.net/api/literature/1681259"
}, 
"texkeys": [
  "Isi:2018miq"
],

The text was updated successfully, but these errors were encountered:

ksachs added this to the Ingestion tools in PROD milestone Jul 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Holding Pen: Merger Musings #3537

Holding Pen: Merger Musings #3537

ksachs commented Jul 10, 2018 •

edited

Loading

Holding Pen: Merger Musings #3537

Holding Pen: Merger Musings #3537

Comments

ksachs commented Jul 10, 2018 • edited Loading

Process:

At arXiv:

INSPIRE 1st harvest

basic_record.V1

enrichment_record.V1

Update

basic_record.V2

merged_record.V2

ksachs commented Jul 10, 2018 •

edited

Loading