-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for .xlsx upload #233
Comments
start of #233 : support for .xlsx upload dataset.js : Dataset.upload() : add .xlsx support. add uploadSpreadsheet.bash : with functions deletePunctuation(), readMetadata(), fileName2DatasetName() (based on functions_convert.bash), linkageMap(), and case $fileName. snps2Dataset.pl : headerLine() : add recognition of name,chr,pos (used in template) bb64502 9462 Mar 30 21:44 backend/common/models/dataset.js d90b20d 2053 Mar 30 22:01 backend/scripts/uploadSpreadsheet.bash 57e2efa 12961 Mar 30 22:03 resources/tools/dev/snps2Dataset.pl
…, insert to db. Handle undefined .axis1d; change GM scope to name. part of #233 dataset.js : factor to form uploadParsedTry() upload() : add cbWrap(), loadAfterDelete(), removeExisting before delete : add where._id as a source of datasetId. uploadSpreadsheet.bash : linkageMap() : echo fileName and datasetName to stdout Having 2 GM blocks with same scope caused 1 block to not have an axis1d; the following changes handle that : axis-ticks-selected.js : renderTicks() : split get(axis1d.featureTicks) to axis1d.get(featureTicks) - since update to Ember 3, .get() doesn't seem to do .get() on sub-fields. Handle clickedFeaturesByAxis being undefined. renderLabels() : split get(axis1d.featureTicks) as above. featuresOfBlockLookup() : handle featuresOfBlockLookup being undefined. selectedFeaturesOfBlockLookup() : handle get(... listName + 'ByBlock') being undefined. block-adj.js : zoomedDomains() and domains() : change mapBy to map, to handle axis1d being undefined. receivedAllCheck() : handle undefined values in axes1d. block.js : getSummary() : check if interval is empty (caused by data block or GM with only a single feature position), and return [] instead of sending API request (0-length interval is not handled - server crash). stacks.js : getAxis1d() : use Ember.set for .axis1d (to avoid Ember warning : it is being tracked by a tracking context ... must invalidate the property when updating it). snps2Dataset.pl : snpLine() : Set scope to be the chr name, without trimming off .[1-9]; this will ensure each GM block gets its own axis. backend/ : b599b60 12798 Mar 31 18:22 common/models/dataset.js d999b3d 2216 Mar 31 16:10 scripts/uploadSpreadsheet.bash frontend/app/ : b11bc5e 9320 Mar 31 20:14 components/draw/axis-ticks-selected.js 17feed7 23302 Mar 31 19:35 models/block-adj.js 49ac15b 49136 Mar 31 18:59 services/data/block.js d49b1a6 77055 Mar 31 20:32 utils/stacks.js 8cf6fe2 13066 Apr 1 11:21 resources/tools/dev/snps2Dataset.pl
uploadSpreadsheet.bash : fileName2DatasetName() : handle filename containing SNP_List_ and worksheet name prefixed with SNP. Add snpList(). file-drop-zone.hbs : minor label text change. part of #233. ff5a1d1 3211 Apr 1 22:00 backend/scripts/uploadSpreadsheet.bash 8b813cb 1911 Apr 1 21:54 frontend/app/templates/components/panel/upload/file-drop-zone.hbs
Spreadsheet Template Format
Worksheet name identifies the type and name of the dataset which it contains. Dataset worksheets :
Vertical bar There are 2 distinct worksheet types for SNP (which has a 1 bp position) and Alignment (which has Start / End), e.g. the alignment of probes (for SNPs) to a reference assembly. Non-dataset worksheets :
These may contain comments : # comment (in column 1) Example worksheet 'Metadata' :# Add columns for each dataset in Workbook
Example worksheet 'Chromosome Renaming'
Example worksheet 'Chromosomes to Omit'
Column names of the dataset worksheet types :
Additional columns other than these fields : values are placed in Feature .values.fieldName, or for Genome : Block.meta.fieldName Additional worksheets which don't match the names defined above are ignored - this enables users to keep additional information and data preparation worksheets in the same file. |
implementing the template format described in the above comment :
|
…not file name uploadSpreadsheet.bash : readMetadata() : filter out comments (doesn't handle 2D table yet). fileName2DatasetName() : trim worksheet label up to | columnsKeyStringPrepare() : map column names Start,End to pos,end. linkageMap() : match worksheet by label Map| instead of name x snpList() : match worksheet by label Alignment| instead of SNP . chrRenamePrepare() : wrap use of chrRenameSed with quotes because it is based on fileName which may contain whitespace. main loop : drop case fileName in - no longer matching on file name, only worksheet label. part of #233, changes described in comment : Spreadsheet Template Format 7e7e399 8497 Apr 9 13:00 backend/scripts/uploadSpreadsheet.bash
issues reported in testing :
|
Testing QTL upload on dev v2.10.0+83031857:
Adding Start position, the file was uploaded successfully. Then attempting to load the dataset:
|
Testing QTL upload on dev v2.10.0+9ba3e3f6: Issues noted above are fixed. It was noted that if the QTLs in the Excel sheet are ordered by eg: trait, and different traits appear on the same chromosome, the result is multiple blocks created with the same scope. |
…tics PR #269 : Master Release v2.11.1 branch feature/qtlUpload #233 : error message for unmatched parentName. comma in Metadata worksheet (e.g. citation). #197 : update Trait table when QTL block is unviewed #216 : explorer : Recent / Favourites #257 : axis brush single-value GM without overlap or contain
Introduction
Accepting .xlsx files via upload will take Pretzel closer to the working environment of the users, and reduce friction for them.
This task will result in 1 or more spreadsheet templates which are designed to contain the information required for a Pretzel dataset upload. Example spreadsheet templates : genetic / linkage map, QTLs, SNP list.
The node.js server will call out to bash scripts which are already prototyped, which convert .xlsx -> CSV -> Pretzel JSON
The template will contain a worksheet with general instructions on use and guidelines for data formats and naming conventions.
The column headers will guide the users to meet the format requirements.
It is important that format errors are reported back to the user and include information such as
The spreadsheet can contain multiple datasets, each in a single worksheet. (a future feature might enable 1 block per worksheet, or per-spreadsheet file using api/Dataset/blockFeaturesAdd to make it easier for users to handle larger datasets).
The first worksheet can contain metadata, possibly in a 2D table to contain the metadata for each worksheet/dataset. There may also be some metadata which is common to all the datasets in a file.
A separate worksheet, possibly last, will contain guidance / documentation for the users.
Data-checking functions can be built into the spreadsheet.
A common cause of difficulty for users is duplicate dataset names caused by repeated upload.
This function should probably delete the existing dataset if the spreadsheet is uploaded again (using the API function which also deletes blocks and features). The API parameters could include a flag to enable this, which could be a checkbox in the upload GUI; users may sometimes want to not overwrite work they have done previously. - (for discussion)
This will significantly streamline the process for experienced users also, because finding the correct dataset to delete in 'All Datasets' takes extra time before each upload, which may be repeated several times as the data is refined.
The existing file upload should suffice for .xlsx in the first implementation. A good option to follow that would be addition of drag & drop to the file upload, for a further improvement in UX flow.
There are a number of options; a quick scan suggests github.com/adopted-ember-addons/ember-file-upload as a first choice. More options at emberObserver : file-upload and drag-and-drop.
In a later stage, an export function can also be added to output a Pretzel dataset in this Pretzel standard spreadsheet format.
This could be used for data exchange between users of different Pretzel instances; JSON format could also be used, but the spreadsheet format is more familiar and useful for most people.
This is probably easiest to do via an API, using similar tools (ssconvert, perl/jq).
These earlier notes, following, will be developed in discussion to refine the requirements and select the initial feature set.
(notes from 2021Mar20 :)
Overview of MVP
Options subsequent to MVP
Initial outline of sub-tasks
Show status on GUI after completion : done, or error. (547ed1c)
(tried streaming - the server was getting pipe error when writing - apparently the sub-process has closed its input; can revisit this later, not essential for this task - used a file instead).
Done, but the installation process on the server is yet to be tested.
It would be possible to drive this from javascript within node.js, but supporting bash scripts makes this easier.
(look at options for streaming parsing of JSON : stream-json JSONStream )
a2b6720
Later options
drag & drop
add support for SNP List (da49ba4)
add support for QTL, Genome
( 756f60e : add QTL import; this is a first pass, will evolve.)
more error-checking
report errors for : missing columns, no dataset worksheet. Handle absence of parentName and shortName in Metadata.
(3918a75)
c99d41e: upload spreadsheet : show multiline error message in a text area so newlines are shown.
add .zip support, also .gz.
different metadata for worksheets within a file (cfb8eaf 3, also noted in later comment)
[/4H]record the original Chromosome name in block.meta when applying 'Chromosome Renaming' (5e83c05)
Issues
( guess one of the frameworks has a stale dataset handle when the 2nd delete happens)
... Uncaught Error: Attempted to handle event
deleteRecord
on <dataset:Red x Blue> while in state root.deleted.saved. 'Testing confirmed that this also applies to the previous upload facility - JSON via file selector, so it is not just related to drag&drop or xlsx upload; possibly caused by framework changes from Ember 2 -> 3.
Solved in bdf429b.
e.g. in JSON : "meta" : {"DOI" : ""DOI: 10.1... -z""
02231b6: upload spreadsheet : handle punctuation in meta values
Current Items
branch : feature/qtlUpload
this is handled.
018d0ba: use parentBlock instead of referenceBlock if the parent has a parent.
fd6965d: QTL value : request just the required flanking markers by name, instead of .allFeatures of the parent.
53214ec : QTL value : finish change in preceding commit.
ddddc38, de8ebf5 : dnaSequenceSearch : sanitize datasetId, and use it directly if no .dbName
5bb5ab3 : API blockFeaturesCounts : handle param interval reversed
a9f675e : Handle alternate result type from blockFeaturesCounts() when interval not known
8225929 : spreadsheet upload : don't wrap Feature .value[] numbers in quotes
1ad462f : use worksheetname for dataset if parentName given in Metadata instead of column
[1-2H/1H] support parent name in Metadata worksheet as an alternative to parent column in QTL worksheet
1ad462f : QTL spreadsheet upload : support parentName in Metadata worksheet as an alternative to parentName column in QTL worksheet;
[1H/0H] handle absence of 'Chromosome Renaming' worksheet
Handled this sheet being empty in 8225929, and that has also addressed this item.
from testing 2021Sep14 8:30pm reported in slack :
[1-2H/0H] check brushing of parent of QTL
tested OK, after e18c87f.
[1-2H/6H] unview parent if it was viewed via viewing the QTL. 9ba3e3f
[1-3H/3H] update issues after upload, tab refresh clears the problem
(address what can be done in this time frame; further changes may be involved)
e18c87f : after QTL upload, enable QTL to be re-viewed
[1-2H/0.5H] QTL name if parentName column : dataset-parentName. 7c1ccfe
[/1H] 5e1e2ca: QTL spreadsheet upload : handle and report block.scope not matched in dataset.parent
[/1H] a01e471: spreadsheet upload : remove non-ascii chars on the outside of values
[1-2H] report error in GUI if Metadata is not given for QTL worksheet
[/5H] QTL spreadsheet upload : sort by parentName (if defined) then chr column. 445eab2
[/1H] QTL spreadsheet upload : when matching parent of QTL, exclude datasets which are copied from another server. 3a82877 (bc7d4a4)
after 2.11.0
[1-2H/1H] if the parentName is not in the db, report error to frontend GUI. ab40144
[1-2H/3H] the citation is truncated in the Metadata worksheet
(the citation column of the dataset worksheet is OK)
067719a: QTL spreadsheet upload : don't split on comma within quoted cell value in the Metadata worksheet
[4-8H] upload : convert to ascii : unicode punctuation which has a ascii equivalent
This is not limited to spreadsheet upload; can apply to CSV, table and JSON also,
refn:
https://stackoverflow.com/questions/4808967/replacing-unicode-punctuation-with-ascii-approximations
[1-2H/2.5H] sort of parentName failed when it was in a column further right, so there is a column id issue probably comma or punctuation in headers
0e10d22 : handle comma in colums prior to the parentName column
7bee892 : QTL spreadsheet upload : avoid extra output to stdout which was obstructing the error message display
The text was updated successfully, but these errors were encountered: