Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documents: Import documents from RERODOC #135

Merged
merged 1 commit into from
Feb 21, 2020
Merged

documents: Import documents from RERODOC #135

merged 1 commit into from
Feb 21, 2020

Conversation

sebdeleze
Copy link
Contributor

  • Harvests and imports data from RERODOC, based on defined sets.
  • Activates Celery beat for tasks processing.
  • Adds manual translations file.
  • Removes fake institutions and documents fixtures.
  • Logs errors in a file for tracking import failures.
  • Re-enables marshmallow serializers, closes Re-enable marshmallow checks #79.
  • Changes JSON schema properties for titles and abstracts.
  • Updates detail view.

Co-Authored-by: Sébastien Délèze [email protected]

@sebdeleze sebdeleze requested a review from jma February 7, 2020 13:56
scripts/server Outdated
@@ -22,7 +22,8 @@ script_path=$(dirname "$0")

export FLASK_ENV=development
# Start Worker and Server
pipenv run celery worker -A invenio_app.celery -l INFO & pid_celery=$!
pipenv run celery worker -A invenio_app.celery -l INFO --autoscale=8,1 --beat --without-heartbeat & pid_celery=$!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you have a good machine as it can takes up to 8 cores...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

scripts/setup Outdated

# Create document sample
if $local; then
curl -L -H 'Content-Type:application/json' --data-binary "@./data/complete_document_sample.json" -XPOST https://localhost:5000/api/documents/ --insecure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps the instance url should be an script option or at least the 5000 port.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


# If file with the same key exists and file size is the same as the
# registered file, we don't do anything
if key in self.files and kwargs.get('size',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the checksum instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


click.secho('Done', fg='green', nl=True)
def load_oai_configuration_json():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it hardcoded? Why not as input of the script or at least as an instance config variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def create_institution(institution_key):
"""Create institution if not existing and return it.

:param str institution_key: Key (PID) of the institution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returns is missing in the doc string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

@@ -39,7 +39,8 @@ def extract_text_from_file(file):
"""Extract full-text from file."""
# Process pdf text extraction
text = subprocess.check_output(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you check if an error occurs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the call of this method is into a try/catch. I think it's not the responsability of this function to do this check.

</dt>
<dd class="col-sm-10">
{% if dict is string %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename dict into dict_or_string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

</dt>
<dd class="col-sm-10">
{% if list is string %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list -> list_or_string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def document_fixture(app, db, organization_fixture):
"""Create a document."""
def document_json_fixture(app, db, organization_fixture):
"""JSON docoument fixture."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected


assert record.replace_refs().get('institution')['name'] == 'Università ' \
'della Svizzera italiana'
assert document_fixture.replace_refs().get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be careful with the resolvers. Changing the name can cause a huge number of document reindexing...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, as discussed, I will do another PR to fine tune this resolvers and to prefer multiple call to backend for having the reference object information.

* Harvests and imports data from RERODOC, based on defined sets.
* Activates Celery beat for tasks processing.
* Adds manual translations file.
* Removes fake institutions and documents fixtures.
* Logs errors in a file for tracking import failures.
* Re-enables marshmallow serializers, closes #79.
* Changes JSON schema properties for titles and abstracts.
* Updates detail view.
* Closes #76.

Co-Authored-by: Sébastien Délèze <[email protected]>
@sebdeleze sebdeleze requested a review from jma February 19, 2020 12:56
@sebdeleze sebdeleze merged commit f1a419c into rero:dev Feb 21, 2020
@sebdeleze sebdeleze deleted the sed-rerodoc-harvester branch February 21, 2020 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants