Skip to content
This repository has been archived by the owner on Jul 18, 2024. It is now read-only.

[DataCap Application] Kernelogic - Various open datasets onboarding #457

Closed
kernelogic opened this issue Jun 29, 2022 · 28 comments
Closed

Comments

@kernelogic
Copy link

kernelogic commented Jun 29, 2022

Large Dataset Notary Application

To apply for DataCap to onboard your dataset to Filecoin, please fill out the following.

Core Information

  • Organization Name: Fei Yan - Kernelogic
  • Website / Social Media: https://slingshot.kernelogic.ca/ @feiya200
  • Total amount of DataCap being requested (between 500 TiB and 5 PiB): 5 PiB
  • Weekly allocation of DataCap requested (usually between 1-100TiB): 500 TiB
  • On-chain address for first allocation: f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Please respond to the questions below by replacing the text saying "Please answer here". Include as much detail as you can in your answer.

Project details

Share a brief history of your project and organization.

This will probably be my last individual slingshot 2.x LDN. And this Genomic Data Commons dataset has not been stored by many teams yet.

I have successfully completed a few LDNs on other datasets and I have record to show I have been following the rules of decentralization and have zero self dealing.

https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/60
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/59
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/46
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/297
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/298
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/304

What is the primary source of funding for this project?

I am hoping to utilize this datacap at no cost, on active community providers on Slack, especially SPX / enterprise-sp-wg members. 

What other projects/ecosystem stakeholders is this project associated with?

Slingshot, enterprise-sp-wp.

Use-case details

Describe the data being stored onto Filecoin
Archival data from the Genomic Data Commons.

Due to the original dataset download restrictions, this LDN now is repurposed to store various AWS open datasets qualified for existing Slingshot 2 and upcoming 3.

Where was the data in this dataset sourced from?
https://portal.gdc.cancer.gov/

1. (New) Ford Multi-AV Seasonal Dataset
2. (New) Cancer Cell Line Encyclopedia (CCLE)
3. (New) Allen Brain Observatory - Visual Coding AWS Public Data Set
4. (Existing) Fly Brain Anatomy
5. (Existing) Foldingathome COVID-19
5. (Existing) NASANEX

Can you share a sample of the data? A link to a file, an image, a table, etc., are good ways to do this.
Raw sequence data, stored as BAM files, make up the bulk of data stored at the NCI Genomic Data Commons (GDC).
https://portal.gdc.cancer.gov/files/54ac0975-cce0-40a9-a557-9a1c938ce167

https://registry.opendata.aws/ford-multi-av-seasonal/
https://registry.opendata.aws/ccle/
https://registry.opendata.aws/allen-brain-observatory/
https://registry.opendata.aws/janelia-flylight/
https://registry.opendata.aws/foldingathome-covid19/
https://registry.opendata.aws/nasanex/

Confirm that this is a public dataset that can be retrieved by anyone on the Network (i.e., no specific permissions or access rights are required to view the data).

Curated dataset https://github.com/filecoin-project/slingshot/blob/master/datasets.md

What is the expected retrieval frequency for this data?

I am expecting some retrievals during prize judging period, as well as anyone interested in downloading this dataset.

For how long do you plan to keep this dataset stored on Filecoin?

As slingshot rule, minimum 1 year. Most likely 520 days.

DataCap allocation plan

In which geographies (countries, regions) do you plan on making storage deals?

All regions.

How will you be distributing your data to storage providers? Is there an offline data transfer process?

I will upload my prepared CAR files to a web server and coordinate with providers to download and propose offline deals.

How do you plan on choosing the storage providers with whom you will be making deals? This should include a plan to ensure the data is retrievable in the future both by you and others.

I plan to deal with SPX, approved slingshot restore SPs and enterprise-sp-wg members, as well as any real community providers who are interested.

To name a few from the community that I deal with regularly: PIKNIK, Holon, CabrinaHuang, HarryM, BigBear, j1v, XinAn Xu, WillTechMusing.

Also exploring auction on https://www.bigd.exchange/

How will you be distributing deals across storage providers?

Evenly across all providers I propose to, if they can handle. If a miner is a notary itself, this notary will receive no more than 10% of the total granted datacap.

Do you have the resources/funding to start making deals as soon as you receive DataCap? What support from the community would help you onboard onto Filecoin?

I have all I need to start making deals.
@large-datacap-requests
Copy link

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!

@large-datacap-requests
Copy link

Thanks for your request!
Everything looks good. 👌

A Governance Team member will review the information provided and contact you back pretty soon.

@dkkapur
Copy link
Collaborator

dkkapur commented Jul 6, 2022

Datacap Request Trigger

Total DataCap requested

5 PiB

Expected weekly DataCap usage rate

500 TiB

Client address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

@large-datacap-requests
Copy link

DataCap Allocation requested

Multisig Notary address

f01858410

Client address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

DataCap allocation requested

250TiB

Copy link

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacecyhkqndhk7m46k5b7r7rlhfywj5biqjfolbpuboepvxcen7f3ss6

Address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Datacap Allocated

250.00TiB

Signer Address

f1e77zuityhvvw6u2t6tb5qlnsegy2s67qs4lbbbq

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecyhkqndhk7m46k5b7r7rlhfywj5biqjfolbpuboepvxcen7f3ss6

Copy link

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacec6l2kdw5ulfmeygtc56kn53dbn4bxu3chdmnd4bvxfip4y7bk34i

Address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Datacap Allocated

250.00TiB

Signer Address

f1pszcrsciyixyuxxukkvtazcokexbn54amf7gvoq

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacec6l2kdw5ulfmeygtc56kn53dbn4bxu3chdmnd4bvxfip4y7bk34i

@kernelogic
Copy link
Author

I want to make some changes to this LDN. The dataset "Genomic Data Commons" originally applied with this application is actually mostly private, requires the downloader to be a medical professional in the US. I did not realize this until start download.

To not make this LDN go to waste, also v2.8 is going to end soon, I propose changing the dataset of this LDN to be V3 datasets that I prepared already, namely the following for now:

  1. Ford Multi-AV Seasonal Dataset
  2. Cancer Cell Line Encyclopedia (CCLE)
  3. Allen Brain Observatory - Visual Coding AWS Public Data Set

@large-datacap-requests
Copy link

DataCap Allocation requested

Request number 2

Multisig Notary address

f01858410

Client address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

DataCap allocation requested

500TiB

@large-datacap-requests
Copy link

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!

@large-datacap-requests
Copy link

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!

@large-datacap-requests
Copy link

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

14 participants