-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
documents: Import documents from RERODOC #135
Conversation
scripts/server
Outdated
@@ -22,7 +22,8 @@ script_path=$(dirname "$0") | |||
|
|||
export FLASK_ENV=development | |||
# Start Worker and Server | |||
pipenv run celery worker -A invenio_app.celery -l INFO & pid_celery=$! | |||
pipenv run celery worker -A invenio_app.celery -l INFO --autoscale=8,1 --beat --without-heartbeat & pid_celery=$! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope you have a good machine as it can takes up to 8 cores...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected
scripts/setup
Outdated
|
||
# Create document sample | ||
if $local; then | ||
curl -L -H 'Content-Type:application/json' --data-binary "@./data/complete_document_sample.json" -XPOST https://localhost:5000/api/documents/ --insecure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps the instance url should be an script option or at least the 5000 port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sonar/modules/api.py
Outdated
|
||
# If file with the same key exists and file size is the same as the | ||
# registered file, we don't do anything | ||
if key in self.files and kwargs.get('size', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the checksum instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sonar/modules/documents/cli.py
Outdated
|
||
click.secho('Done', fg='green', nl=True) | ||
def load_oai_configuration_json(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it hardcoded? Why not as input of the script or at least as an instance config variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
def create_institution(institution_key): | ||
"""Create institution if not existing and return it. | ||
|
||
:param str institution_key: Key (PID) of the institution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returns is missing in the doc string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected
@@ -39,7 +39,8 @@ def extract_text_from_file(file): | |||
"""Extract full-text from file.""" | |||
# Process pdf text extraction | |||
text = subprocess.check_output( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you check if an error occurs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the call of this method is into a try/catch. I think it's not the responsability of this function to do this check.
</dt> | ||
<dd class="col-sm-10"> | ||
{% if dict is string %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rename dict into dict_or_string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
</dt> | ||
<dd class="col-sm-10"> | ||
{% if list is string %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list -> list_or_string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tests/conftest.py
Outdated
def document_fixture(app, db, organization_fixture): | ||
"""Create a document.""" | ||
def document_json_fixture(app, db, organization_fixture): | ||
"""JSON docoument fixture.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spelling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected
|
||
assert record.replace_refs().get('institution')['name'] == 'Università ' \ | ||
'della Svizzera italiana' | ||
assert document_fixture.replace_refs().get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be careful with the resolvers. Changing the name can cause a huge number of document reindexing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, as discussed, I will do another PR to fine tune this resolvers and to prefer multiple call to backend for having the reference object information.
* Harvests and imports data from RERODOC, based on defined sets. * Activates Celery beat for tasks processing. * Adds manual translations file. * Removes fake institutions and documents fixtures. * Logs errors in a file for tracking import failures. * Re-enables marshmallow serializers, closes #79. * Changes JSON schema properties for titles and abstracts. * Updates detail view. * Closes #76. Co-Authored-by: Sébastien Délèze <[email protected]>
Co-Authored-by: Sébastien Délèze [email protected]