Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for .xlsx upload #233

Closed
46 of 48 tasks
Don-Isdale opened this issue Mar 26, 2021 · 5 comments
Closed
46 of 48 tasks

support for .xlsx upload #233

Don-Isdale opened this issue Mar 26, 2021 · 5 comments
Assignees

Comments

@Don-Isdale
Copy link
Collaborator

Don-Isdale commented Mar 26, 2021

Introduction

Accepting .xlsx files via upload will take Pretzel closer to the working environment of the users, and reduce friction for them.

This task will result in 1 or more spreadsheet templates which are designed to contain the information required for a Pretzel dataset upload. Example spreadsheet templates : genetic / linkage map, QTLs, SNP list.

The node.js server will call out to bash scripts which are already prototyped, which convert .xlsx -> CSV -> Pretzel JSON

The template will contain a worksheet with general instructions on use and guidelines for data formats and naming conventions.
The column headers will guide the users to meet the format requirements.

It is important that format errors are reported back to the user and include information such as

  • the location of the error (worksheet name, possibly row/col),
  • some text which includes or helps to identify the error,
  • a reference to guidance about the specific format requirement - online help or the guide in the template.

The spreadsheet can contain multiple datasets, each in a single worksheet. (a future feature might enable 1 block per worksheet, or per-spreadsheet file using api/Dataset/blockFeaturesAdd to make it easier for users to handle larger datasets).

The first worksheet can contain metadata, possibly in a 2D table to contain the metadata for each worksheet/dataset. There may also be some metadata which is common to all the datasets in a file.
A separate worksheet, possibly last, will contain guidance / documentation for the users.
Data-checking functions can be built into the spreadsheet.

A common cause of difficulty for users is duplicate dataset names caused by repeated upload.
This function should probably delete the existing dataset if the spreadsheet is uploaded again (using the API function which also deletes blocks and features). The API parameters could include a flag to enable this, which could be a checkbox in the upload GUI; users may sometimes want to not overwrite work they have done previously. - (for discussion)
This will significantly streamline the process for experienced users also, because finding the correct dataset to delete in 'All Datasets' takes extra time before each upload, which may be repeated several times as the data is refined.

The existing file upload should suffice for .xlsx in the first implementation. A good option to follow that would be addition of drag & drop to the file upload, for a further improvement in UX flow.
There are a number of options; a quick scan suggests github.com/adopted-ember-addons/ember-file-upload as a first choice. More options at emberObserver : file-upload and drag-and-drop.

In a later stage, an export function can also be added to output a Pretzel dataset in this Pretzel standard spreadsheet format.
This could be used for data exchange between users of different Pretzel instances; JSON format could also be used, but the spreadsheet format is more familiar and useful for most people.
This is probably easiest to do via an API, using similar tools (ssconvert, perl/jq).

These earlier notes, following, will be developed in discussion to refine the requirements and select the initial feature set.


(notes from 2021Mar20 :)

-	add some spreadsheet templates for GMs etc :
file name : LinkageMap_....xlsx		(optional : project, date, Variety1xVariety2 or Variety1_Variety2_Variety3_...)
first worksheet : metadata
worksheet : Variety1 x Variety2		(The worksheet name is used for the Dataset name)
meta	name	value (try json in MS Excel)
name	chr	pos
Marker	Chromosome	Position
Chromosome	Marker	Position
No outside " required; there may be spaces within names. The script will strip off outer quotes and spaces.
There may be punctuation in Marker and Chromosome names.

Marker : names should match names used in other datasets, so that they can be aligned.
Chromosome : leading 'chr' is not required and will be stripped off.
Position : will be interpreted as a number.

The script assumes that Markers / SNPs for a single chromosome are grouped together, not intermingled.
The order in which chromosomes are given is the order they will be displayed in Pretzel in the Dataset Explorer (left panel).
Within a chromosome,  Markers / SNPs may be unsorted; they will be displayed on the Pretzel axis in Position order.


Example of using the script

Overview of MVP

  • in upload panel : use file selector to select .xlsx, and submit
  • file name is e.g. Linkage_Map_myMaps.xlsx
  • .xltx template includes Metadata worksheet which contains namespace, commonName
  • each map is in a separate worksheet, named 'Variety1 x Variety2', which is the datasetName
  • if dataset with that name exists, remove it before inserting the new dataset; if the user has permission to do so

Options subsequent to MVP

  • support file name SNP_List_mySNPS.xlsx
  • a drag & drop zone can be added to the file selector

Initial outline of sub-tasks

  • add .xlsx as a recognised file type to the API /api/Dataset/ (similar interface to /upload, perhaps added as another file type option in upload(), or possibly a distinct URL. .zip can be handled on the script side) (099fe0c)
    • [2-4H/2H] handle .xlsx in .upload() (099fe0c)
    • [1-2H/1H] complete the error handling; pass back error messages
      Show status on GUI after completion : done, or error. (547ed1c)
    • [1-2H] GUI : identify .xlsx as an option (2134b13)
  • add functions to handle this by calling out to script files packaged with the server
    • [2-4H] call out to script, passing received file and params (099fe0c)
      • [2-4H] single input file, so can perhaps be streamed to script or written to a file
        (tried streaming - the server was getting pipe error when writing - apparently the sub-process has closed its input; can revisit this later, not essential for this task - used a file instead).
    • [2-4H] include scripts folder in server directory
      Done, but the installation process on the server is yet to be tested.
      • [/4H] handled running ssconvert within docker, which required additional handling for spaces in fileName. cwd is pretzel/, de455bf
  • use ssconvert for .xlsx -> csv (sudo apt install gnumeric). An alternative is libreoffice --headless --convert-to csv, but this doesn't seem to handle multiple worksheets within the .xlsx. ssconvert handles other spreadsheet formats such as .ods, so this could be supported as long as the template maps to the same csv. An option which may not be in the initial implementation is support for compressed files (gzip, .zip).
  • the script uses snps2Dataset.pl to convert csv -> JSON.
    It would be possible to drive this from javascript within node.js, but supporting bash scripts makes this easier.
  • process the resulting JSON using existing functions
    • [2-4H] read and parse the result JSON files. Because there can be multiple, perhaps use a work dir to hold the results, and rm them after parsing. could re-use existing functionality by configuring the script to call back in via API.
      (look at options for streaming parsing of JSON : stream-json JSONStream )
      a2b6720

Later options

  • drag & drop

    • [2-8H/3H] add a drop zone (2134b13)
      • [/0.5H] add toggle replaceDataset (aed3ee1)
    • [4-16H/0.6H] (wide estimates because incorporating new modules) receive files and forward to API (2134b13)
  • add support for SNP List (da49ba4)

  • add support for QTL, Genome
    ( 756f60e : add QTL import; this is a first pass, will evolve.)

  • more error-checking

    • [/2H] check that column order is not changed from template, (304cafb : handle column order variation)
    • [/4H] detect incomplete / malformed spreadsheets and displays errors in GUI :
      report errors for : missing columns, no dataset worksheet. Handle absence of parentName and shortName in Metadata.
      (3918a75)
    • recognise Template in worksheet name, and report as a warning to user (b45cfb4)
    • [/2H] c19d31d : add file descriptors for communicating more error and warning details back to server
    • additional error information :
      • [1H] if generated JSON is 0 size
      • [2H] show context of error if JSON parse error at a position. 840c8d8
        • [1H/1.5H] show the message in a textarea so it is not folded.
          c99d41e: upload spreadsheet : show multiline error message in a text area so newlines are shown.
  • add .zip support, also .gz.

  • different metadata for worksheets within a file (cfb8eaf 3, also noted in later comment)

  • [/4H]record the original Chromosome name in block.meta when applying 'Chromosome Renaming' (5e83c05)


Issues

  • [2-16H / 5H] after deleting, then uploading, then deleting, 2nd delete does not work, and get error in FE and BE.
    ( guess one of the frameworks has a stale dataset handle when the 2nd delete happens)
    ... Uncaught Error: Attempted to handle event deleteRecord on <dataset:Red x Blue> while in state root.deleted.saved. '
    Testing confirmed that this also applies to the previous upload facility - JSON via file selector, so it is not just related to drag&drop or xlsx upload; possibly caused by framework changes from Ember 2 -> 3.
    Solved in bdf429b.

  • [1-2H/0.2H] (refn : support for .xlsx upload #233) colon in meta value : conversion is wrapping with "", then another "" is added in the script
    e.g. in JSON : "meta" : {"DOI" : ""DOI: 10.1... -z""
    02231b6: upload spreadsheet : handle punctuation in meta values

Current Items

branch : feature/qtlUpload

  • [1-2H/4H] Does not allow blank Start/End fields, even when flanking marker field is defined. 02dba91
  • [1-4H/7.5H] Consider what happens if the flanking marker is not defined in the parent. 77969f1
    • [/1H] 797ba1e: QTL upload : require that all FMs of each feature are found.
  • [2-3H/0H] Should allow a single marker, not necessarily two markers (how will this be displayed considering we assume intervals so far).
    this is handled.
    • [/7H] QTL representation : show as a diamond if rect height would be hard to see. ad0e51f
  • [1-3H/6H] should not get dialog (when parent is already in view) and exception when adding QTL dataset
    018d0ba: use parentBlock instead of referenceBlock if the parent has a parent.
    • [3-5H/4H] instead of requesting all features of the parent block, could request just the features by name, because the number of flankingMarkers is probably not large.
      fd6965d: QTL value : request just the required flanking markers by name, instead of .allFeatures of the parent.
      53214ec : QTL value : finish change in preceding commit.
  • [1-3H/3H] change name of datasets created by QTL upload : change - to the datasetName defined in the worksheet name, with -parentName appended
    ddddc38, de8ebf5 : dnaSequenceSearch : sanitize datasetId, and use it directly if no .dbName
    5bb5ab3 : API blockFeaturesCounts : handle param interval reversed
    a9f675e : Handle alternate result type from blockFeaturesCounts() when interval not known
    8225929 : spreadsheet upload : don't wrap Feature .value[] numbers in quotes

1ad462f : use worksheetname for dataset if parentName given in Metadata instead of column

  • [1-2H/1H] support parent name in Metadata worksheet as an alternative to parent column in QTL worksheet
    1ad462f : QTL spreadsheet upload : support parentName in Metadata worksheet as an alternative to parentName column in QTL worksheet;

  • [1H/0H] handle absence of 'Chromosome Renaming' worksheet
    Handled this sheet being empty in 8225929, and that has also addressed this item.

from testing 2021Sep14 8:30pm reported in slack :

  • [1-2H/0H] check brushing of parent of QTL
    tested OK, after e18c87f.

  • [1-2H/6H] unview parent if it was viewed via viewing the QTL. 9ba3e3f

  • [1-3H/3H] update issues after upload, tab refresh clears the problem
    (address what can be done in this time frame; further changes may be involved)
    e18c87f : after QTL upload, enable QTL to be re-viewed

  • [1-2H/0.5H] QTL name if parentName column : dataset-parentName. 7c1ccfe

    • [/2H] QTL spreadsheet upload : change QTL dataset name. 3cfefaba
    • [/0.5H] 752580e : QTL spreadsheet upload : don't include currentTrait in dataset name or meta.shortName
  • [/1H] 5e1e2ca: QTL spreadsheet upload : handle and report block.scope not matched in dataset.parent

  • [/1H] a01e471: spreadsheet upload : remove non-ascii chars on the outside of values

  • [1-2H] report error in GUI if Metadata is not given for QTL worksheet

  • [/5H] QTL spreadsheet upload : sort by parentName (if defined) then chr column. 445eab2

  • [/1H] QTL spreadsheet upload : when matching parent of QTL, exclude datasets which are copied from another server. 3a82877 (bc7d4a4)


after 2.11.0

  • [1-2H/1H] if the parentName is not in the db, report error to frontend GUI. ab40144

  • [1-2H/3H] the citation is truncated in the Metadata worksheet
    (the citation column of the dataset worksheet is OK)
    067719a: QTL spreadsheet upload : don't split on comma within quoted cell value in the Metadata worksheet

  • [4-8H] upload : convert to ascii : unicode punctuation which has a ascii equivalent
    This is not limited to spreadsheet upload; can apply to CSV, table and JSON also,
    refn:
    https://stackoverflow.com/questions/4808967/replacing-unicode-punctuation-with-ascii-approximations

[1-2H/2.5H] sort of parentName failed when it was in a column further right, so there is a column id issue probably comma or punctuation in headers
0e10d22 : handle comma in colums prior to the parentName column

7bee892 : QTL spreadsheet upload : avoid extra output to stdout which was obstructing the error message display

@Don-Isdale Don-Isdale self-assigned this Mar 26, 2021
Don-Isdale added a commit that referenced this issue Mar 31, 2021
start of #233 : support for .xlsx upload
dataset.js : Dataset.upload() : add .xlsx support.
add uploadSpreadsheet.bash : with functions  deletePunctuation(), readMetadata(), fileName2DatasetName() (based on functions_convert.bash), linkageMap(), and case $fileName.

snps2Dataset.pl : headerLine() : add recognition of name,chr,pos (used in template)
 bb64502  9462 Mar 30 21:44 backend/common/models/dataset.js
 d90b20d  2053 Mar 30 22:01 backend/scripts/uploadSpreadsheet.bash
 57e2efa 12961 Mar 30 22:03 resources/tools/dev/snps2Dataset.pl
Don-Isdale added a commit that referenced this issue Apr 1, 2021
…, insert to db. Handle undefined .axis1d; change GM scope to name.

part of #233

dataset.js :
factor to form uploadParsedTry()
upload() : add cbWrap(), loadAfterDelete(), removeExisting
before delete : add where._id as a source of datasetId.

uploadSpreadsheet.bash : linkageMap() : echo fileName and datasetName to stdout

Having 2 GM blocks with same scope caused 1 block to not have an axis1d; the following changes handle that :

axis-ticks-selected.js : renderTicks() : split get(axis1d.featureTicks) to axis1d.get(featureTicks) - since update to Ember 3, .get() doesn't seem to do .get() on sub-fields. Handle clickedFeaturesByAxis being undefined.
renderLabels() : split get(axis1d.featureTicks) as above.
featuresOfBlockLookup() : handle featuresOfBlockLookup being undefined.
selectedFeaturesOfBlockLookup() : handle get(... listName + 'ByBlock') being undefined.

block-adj.js : zoomedDomains() and domains()  : change mapBy to map, to handle axis1d being undefined.
receivedAllCheck() : handle undefined values in axes1d.

block.js : getSummary() : check if interval is empty (caused by data block or GM with only a single feature position), and return [] instead of sending API request (0-length interval is not handled - server crash).

stacks.js : getAxis1d() : use Ember.set for .axis1d (to avoid Ember warning : it is being tracked by a tracking context ... must invalidate the property when updating it).

snps2Dataset.pl : snpLine() : Set scope to be the chr name, without trimming off .[1-9]; this will ensure each GM block gets its own axis.

backend/ :
 b599b60 12798 Mar 31 18:22  common/models/dataset.js
 d999b3d  2216 Mar 31 16:10  scripts/uploadSpreadsheet.bash
frontend/app/ :
 b11bc5e  9320 Mar 31 20:14  components/draw/axis-ticks-selected.js
 17feed7 23302 Mar 31 19:35  models/block-adj.js
 49ac15b 49136 Mar 31 18:59  services/data/block.js
 d49b1a6 77055 Mar 31 20:32  utils/stacks.js
 8cf6fe2 13066 Apr  1 11:21 resources/tools/dev/snps2Dataset.pl
Don-Isdale added a commit that referenced this issue Apr 5, 2021
uploadSpreadsheet.bash : fileName2DatasetName() : handle filename containing SNP_List_ and worksheet name prefixed with SNP.  Add snpList().
file-drop-zone.hbs : minor label text change.
part of #233.
 ff5a1d1 3211 Apr  1 22:00 backend/scripts/uploadSpreadsheet.bash
 8b813cb 1911 Apr  1 21:54 frontend/app/templates/components/panel/upload/file-drop-zone.hbs
@Don-Isdale
Copy link
Collaborator Author

Don-Isdale commented Apr 8, 2021

Spreadsheet Template Format

  • Able to combine in a single file all the different worksheet types relating to a specific species / variety / project.
    This enables related data to be organised together and ensures that the version of datasets in the worksheets are synchronised.

Worksheet name identifies the type and name of the dataset which it contains.
The spreadsheet file names are arbitrary, not limited or interpreted by the Pretzel application.


Dataset worksheets :

  • Map|
  • SNP|
  • Alignment|
  • QTL|
  • Genome|

Vertical bar | is used as the separator between the worksheet dataset type label and the dataset name.
The dataset name appears after the label, e.g. 'Map| Red x Blue' (outside spaces will be trimmed)

There are 2 distinct worksheet types for SNP (which has a 1 bp position) and Alignment (which has Start / End), e.g. the alignment of probes (for SNPs) to a reference assembly.

Non-dataset worksheets :

  • Metadata
  • Chromosome Renaming
  • Chromosomes to Omit

These may contain comments : # comment (in column 1)
The comments in the provided templates will guide the user through populating the spreadsheet with data, explaining the format and field types. URLs in the comments will refer to further explanation in the user guide, and to the folder in the github repository where the user may download templates to start a new file.


Example worksheet 'Metadata' :

# Add columns for each dataset in Workbook

Field Alignment|EST_SNP
commonName Lentil
parentName Lens_culinaris_2.0
platform SNP_OPA
shortName SNP_OPA
  • extra field names other than these 4 will be placed in Dataset.meta.fieldName : value
    e.g. source, citation, notes, year, ...

Example worksheet 'Chromosome Renaming'

From To
Lcu.2RBY.Chr1 Lc1
Lcu.2RBY.Chr2 Lc2
Lcu.2RBY.Chr3 Lc3
Lcu.2RBY.Chr4 Lc4
Lcu.2RBY.Chr5 Lc5
Lcu.2RBY.Chr6 Lc6
Lcu.2RBY.Chr7 Lc7

Example worksheet 'Chromosomes to Omit'

Lcu.2RBY.unitig

Column names of the dataset worksheet types :

  • Map:
Marker Chromosome Position
  • SNP:
Name Chromosome Position
  • Alignment :
Name Chromosome Start End
  • Genome :
Chromosome Start End

Additional columns other than these fields : values are placed in Feature .values.fieldName, or for Genome : Block.meta.fieldName


Additional worksheets which don't match the names defined above are ignored - this enables users to keep additional information and data preparation worksheets in the same file.

@Don-Isdale
Copy link
Collaborator Author

Don-Isdale commented Apr 9, 2021

implementing the template format described in the above comment :

  • [1H/1H] recognise worksheet name labels, and not file name labels (0bfaee7)
  • Metadata :
    • [2H] 2D table : handle multiple columns (2H cfb8eaf : done the bash part)
    • [2H/2.5H] extra field names other than these 4 will be placed in Dataset.meta.fieldName : value (c4331d2)
  • [3H/2H] Additional columns other than these fields : values are placed in Feature .values.fieldName, (45e8941)
    • or for Genome : Block.meta.fieldName
  • [0.5H/0.5H] filter out # comments (0bfaee7)

Don-Isdale added a commit that referenced this issue Apr 9, 2021
…not file name

uploadSpreadsheet.bash : readMetadata() : filter out comments (doesn't handle 2D table yet).
fileName2DatasetName() : trim worksheet label up to |
columnsKeyStringPrepare() : map column names Start,End to pos,end.
linkageMap() : match worksheet by label Map| instead of name  x
snpList() : match worksheet by label Alignment| instead of SNP .
chrRenamePrepare() : wrap use of chrRenameSed with quotes because it is based on fileName which may contain whitespace.
main loop : drop case fileName in - no longer matching on file name, only worksheet label.
part of #233, changes described in comment : Spreadsheet Template Format
 7e7e399 8497 Apr  9 13:00 backend/scripts/uploadSpreadsheet.bash
@Don-Isdale
Copy link
Collaborator Author

Don-Isdale commented May 18, 2021

issues reported in testing :

  • max length of sheet names
    can't set the dataset name needed
    that is a hard MS limitation ..
    may need to put a long name in meta or something

@gabrielkg
Copy link
Member

Testing QTL upload on dev v2.10.0+83031857:

  • Does not allow blank Start/End fields, even when flanking marker field is defined
  • Consider what happens if the flanking marker is not defined in the parent
  • Should allow a single marker, not necessarily two markers (how will this be displayed considering we assume intervals so far)

Adding Start position, the file was uploaded successfully. Then attempting to load the dataset:

  • Pop-up explaining that parent dataset needs to be loaded appears even when parent already in the view, so dataset could not be loaded
  • (Note: After the above error, it appeared to break brushing capability.)

@gabrielkg
Copy link
Member

Testing QTL upload on dev v2.10.0+9ba3e3f6:

Issues noted above are fixed.

It was noted that if the QTLs in the Excel sheet are ordered by eg: trait, and different traits appear on the same chromosome, the result is multiple blocks created with the same scope.

This was referenced Sep 27, 2021
Don-Isdale added a commit that referenced this issue Oct 1, 2021
…tics

PR #269 : Master Release v2.11.1
branch feature/qtlUpload
 #233 : error message for unmatched parentName.  comma in Metadata worksheet (e.g. citation).
 #197 : update Trait table when QTL block is unviewed
 #216 : explorer : Recent / Favourites
 #257 : axis brush single-value GM without overlap or contain
@Don-Isdale Don-Isdale mentioned this issue Oct 6, 2021
21 tasks
@Don-Isdale Don-Isdale mentioned this issue Jan 18, 2022
14 tasks
@github-project-automation github-project-automation bot moved this to done in Pretzel Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: done
Development

No branches or pull requests

2 participants