documents: Import documents from RERODOC #135

sebdeleze · 2020-02-07T11:01:55Z

Harvests and imports data from RERODOC, based on defined sets.
Activates Celery beat for tasks processing.
Adds manual translations file.
Removes fake institutions and documents fixtures.
Logs errors in a file for tracking import failures.
Re-enables marshmallow serializers, closes Re-enable marshmallow checks #79.
Changes JSON schema properties for titles and abstracts.
Updates detail view.

Co-Authored-by: Sébastien Délèze [email protected]

jma · 2020-02-18T08:02:13Z

scripts/server

@@ -22,7 +22,8 @@ script_path=$(dirname "$0")

 export FLASK_ENV=development
 # Start Worker and Server
-pipenv run celery worker -A invenio_app.celery -l INFO & pid_celery=$!
+pipenv run celery worker -A invenio_app.celery -l INFO --autoscale=8,1 --beat --without-heartbeat & pid_celery=$!


I hope you have a good machine as it can takes up to 8 cores...

jma · 2020-02-18T08:03:40Z

scripts/setup

+
+# Create document sample
+if $local; then
+    curl -L -H 'Content-Type:application/json' --data-binary "@./data/complete_document_sample.json" -XPOST https://localhost:5000/api/documents/ --insecure


perhaps the instance url should be an script option or at least the 5000 port.

jma · 2020-02-18T08:08:49Z

sonar/modules/api.py

+
+        # If file with the same key exists and file size is the same as the
+        # registered file, we don't do anything
+        if key in self.files and kwargs.get('size',


Can you use the checksum instead?

jma · 2020-02-18T08:12:30Z

sonar/modules/documents/cli.py


-        click.secho('Done', fg='green', nl=True)
+def load_oai_configuration_json():


why is it hardcoded? Why not as input of the script or at least as an instance config variable?

jma · 2020-02-18T08:17:17Z

sonar/modules/documents/dojson/utils.py

+    def create_institution(institution_key):
+        """Create institution if not existing and return it.
+
+        :param str institution_key: Key (PID) of the institution.


returns is missing in the doc string.

jma · 2020-02-18T09:27:09Z

sonar/modules/pdf_extractor/utils.py

@@ -39,7 +39,8 @@ def extract_text_from_file(file):
    """Extract full-text from file."""
    # Process pdf text extraction
    text = subprocess.check_output(


Do you check if an error occurs?

Yes, the call of this method is into a try/catch. I think it's not the responsability of this function to do this check.

jma · 2020-02-18T09:28:46Z

sonar/theme/templates/sonar/macros/macro.html

+    </dt>
+    <dd class="col-sm-10">
+      {% if dict is string %}


can you rename dict into dict_or_string?

jma · 2020-02-18T09:29:08Z

sonar/theme/templates/sonar/macros/macro.html

+    </dt>
+    <dd class="col-sm-10">
+      {% if list is string %}


list -> list_or_string

jma · 2020-02-18T09:29:38Z

tests/conftest.py

-def document_fixture(app, db, organization_fixture):
-    """Create a document."""
+def document_json_fixture(app, db, organization_fixture):
+    """JSON docoument fixture."""


jma · 2020-02-18T10:02:09Z

tests/ui/institutions/test_jsonresolvers.py


-    assert record.replace_refs().get('institution')['name'] == 'Università ' \
-        'della Svizzera italiana'
+    assert document_fixture.replace_refs().get(


be careful with the resolvers. Changing the name can cause a huge number of document reindexing...

OK, as discussed, I will do another PR to fine tune this resolvers and to prefer multiple call to backend for having the reference object information.

* Harvests and imports data from RERODOC, based on defined sets. * Activates Celery beat for tasks processing. * Adds manual translations file. * Removes fake institutions and documents fixtures. * Logs errors in a file for tracking import failures. * Re-enables marshmallow serializers, closes #79. * Changes JSON schema properties for titles and abstracts. * Updates detail view. * Closes #76. Co-Authored-by: Sébastien Délèze <[email protected]>

sebdeleze requested a review from jma February 7, 2020 13:56

jma requested changes Feb 18, 2020

View reviewed changes

sebdeleze requested a review from jma February 19, 2020 12:56

jma approved these changes Feb 20, 2020

View reviewed changes

sebdeleze merged commit f1a419c into rero:dev Feb 21, 2020

sebdeleze deleted the sed-rerodoc-harvester branch February 21, 2020 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

documents: Import documents from RERODOC #135

documents: Import documents from RERODOC #135

sebdeleze commented Feb 7, 2020

jma Feb 18, 2020

sebdeleze Feb 19, 2020

jma Feb 18, 2020

sebdeleze Feb 19, 2020

jma Feb 18, 2020

sebdeleze Feb 19, 2020

jma Feb 18, 2020

sebdeleze Feb 19, 2020

jma Feb 18, 2020

sebdeleze Feb 19, 2020

jma Feb 18, 2020

sebdeleze Feb 18, 2020

jma Feb 18, 2020

sebdeleze Feb 19, 2020

jma Feb 18, 2020

sebdeleze Feb 19, 2020

jma Feb 18, 2020

sebdeleze Feb 19, 2020

jma Feb 18, 2020

sebdeleze Feb 18, 2020


		click.secho('Done', fg='green', nl=True)
		def load_oai_configuration_json():

documents: Import documents from RERODOC #135

documents: Import documents from RERODOC #135

Conversation

sebdeleze commented Feb 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment