Skip to content
This repository has been archived by the owner on Jul 18, 2024. It is now read-only.

Speedium - NIH NCBI Sequence Read Archive [ 1 / 27 ] #488

Closed
cryptowhizzard opened this issue Jul 11, 2022 · 55 comments
Closed

Speedium - NIH NCBI Sequence Read Archive [ 1 / 27 ] #488

cryptowhizzard opened this issue Jul 11, 2022 · 55 comments

Comments

@cryptowhizzard
Copy link

cryptowhizzard commented Jul 11, 2022

Large Dataset Notary Application

To apply for DataCap to onboard your dataset to Filecoin, please fill out the following.

Core Information

  • Organization Name: Speedium network
  • Website / Social Media: www.speedium.nl
  • Total amount of DataCap being requested (between 500 TiB and 5 PiB): 5 PiB
  • Weekly allocation of DataCap requested (usually between 1-100TiB): 500 TiB
  • On-chain address for first allocation: f1mgnwoczfj25foxn4555wvwyak6rsynzy7z73azy

Please respond to the questions below by replacing the text saying "Please answer here". Include as much detail as you can in your answer.

Project details

Share a brief history of your project and organization.

Speedium / Dcent has engaged in Slingshot starting 2.6. We have successfully stored more than 15 differerent datasets with 20+ different miners.

What is the primary source of funding for this project?

Company account

What other projects/ecosystem stakeholders is this project associated with?

No related association

Use-case details

Describe the data being stored onto Filecoin

NIH NCBI Sequence Read Archive (SRA) on AWS
The Sequence Read Archive (SRA), produced by the [National Center for Biotechnology Information (NCBI)](https://www.ncbi.nlm.nih.gov/) at the [National Library of Medicine (NLM)](http://nlm.nih.gov/) at the [National Institutes of Health (NIH)](http://www.nih.gov/), stores raw DNA sequencing data and alignment information from high-throughput sequencing platforms. 

Where was the data in this dataset sourced from?

AWS

Can you share a sample of the data? A link to a file, an image, a table, etc., are good ways to do this.

https://registry.opendata.aws/ncbi-sra/

Confirm that this is a public dataset that can be retrieved by anyone on the Network (i.e., no specific permissions or access rights are required to view the data).

Yes.

What is the expected retrieval frequency for this data?

Multiple times p/y

For how long do you plan to keep this dataset stored on Filecoin?

18 months or longer

DataCap allocation plan

In which geographies (countries, regions) do you plan on making storage deals?

EU / US / Australia / Asia
We intend to store 10 replica's of this data. The dataset has a size of 13.4 PiB.

How will you be distributing your data to storage providers? Is there an offline data transfer process?

The data will be transferred both offline and online.

How do you plan on choosing the storage providers with whom you will be making deals? This should include a plan to ensure the data is retrievable in the future both by you and others.

We have a few providers who have been working with us during Slingshot Restore program and we'd like to continue working with them for ongoing Slingshot competition.

How will you be distributing deals across storage providers?

Max 2 copy's per storage provider if stored on different miners / locations.

Do you have the resources/funding to start making deals as soon as you receive DataCap? What support from the community would help you onboard onto Filecoin?

Yes, we have the resources.
@large-datacap-requests
Copy link

Thanks for your request!
Everything looks good. 👌

A Governance Team member will review the information provided and contact you back pretty soon.

@large-datacap-requests
Copy link

Thanks for your request!
Everything looks good. 👌

A Governance Team member will review the information provided and contact you back pretty soon.

@Sunnyiscoming
Copy link
Collaborator

1.Total amount of DataCap being requested (between 500 TiB and 5 PiB): 135 PiB
The total amount must be less than 5 PB, maybe you can reapply after one by one.

2.Weekly allocation of DataCap requested (usually between 1-100TiB): 2 PiB
Can you list storage providers and their nodes for proving that you can store 2 PiB per week?

@cryptowhizzard
Copy link
Author

@Sunnyiscoming That is ok. I will alter this application title then and structure it with 5 PiB batches. It means 27 batches total, so i will alter the title to [ 1 of 27 ] etc. and reference back to this original application as soon as we progress into the next one and are over 50% done with the previous one.

My plan is to distribute these towards Holon , DLTX , PikNik and i will also distribute these to whoever wants them.

@large-datacap-requests
Copy link

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!

@large-datacap-requests
Copy link

Thanks for your request!
Everything looks good. 👌

A Governance Team member will review the information provided and contact you back pretty soon.

@cryptowhizzard cryptowhizzard changed the title Speedium - NIH NCBI Sequence Read Archive Speedium - NIH NCBI Sequence Read Archive [ 1 / 27 ] Jul 20, 2022
@large-datacap-requests
Copy link

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!

@large-datacap-requests
Copy link

Thanks for your request!
Everything looks good. 👌

A Governance Team member will review the information provided and contact you back pretty soon.

@Sunnyiscoming
Copy link
Collaborator

You have submitted a large number of dataset applications.
I would like to confirm the data set size with you. But I can't see the size of amazon's public data set. Can you provide any evidence to prove it?

@cryptowhizzard
Copy link
Author

Hello @Sunnyiscoming

The total size of this dataset is 13.14 PiB

You can retrieve that data with the following command according to AWS official source:

aws s3 ls --summarize --human-readable --recursive s3://bucket-name/

The bucket name and all information is on their public page :

https://registry.opendata.aws/ncbi-sra/

@kevzak
Copy link
Collaborator

kevzak commented Aug 16, 2022

@dkkapur @raghavrmadya FYI

@large-datacap-requests
Copy link

Deleting comment

@raghavrmadya hasn't the permissions to post this comment.

Please, contact the assignee of this issue.

@large-datacap-requests
Copy link

Stats & Info for DataCap Allocation

Multisig Notary address

f01858410

Client address

f1mgnwoczfj25foxn4555wvwyak6rsynzy7z73azy

Last two approvers

kernelogic & s0nik42

Rule to calculate the allocation request amount

800% of weekly dc amount requested

DataCap allocation requested

10.23GiB

Total DataCap granted for client so far

4.51PiB

Datacap to be granted to reach the total amount requested by the client (5 PiB)

499.28TiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
147859 25 5.27TiB 21.12 21.77GiB

@filplus-checker-app
Copy link

DataCap and CID Checker Report1

  • Organization: NIH- National Institute of Health
  • Client: f1mgnwoczfj25foxn4555wvwyak6rsynzy7z73azy

Approvers

2IreneYoung
4kernelogic
1liyunzhi-666
1MegTei
1psh0691
2s0nik42

Storage Provider Distribution

The below table shows the distribution of storage providers that have stored data for this client.

If this is the first time a provider takes verified deal, it will be marked as new.

For most of the datacap application, below restrictions should apply.

  • Storage provider should not exceed 30% of total datacap.
  • Storage provider should not be storing duplicate data for more than 20%.
  • Storage provider should have published its public IP address.
  • All storage providers should be located in different regions.

⚠️ 24.99% of total deal sealed by f01208803 are duplicate data.

⚠️ 36.91% of total deal sealed by f01208189 are duplicate data.

Provider Location Total Deals Sealed Percentage Unique Data Duplicate Deals
f01156975 Melbourne, Victoria, AU
Anycast Global Backbone
59.94 TiB 1.36% 49.66 TiB 17.15%
f01208632 Melbourne, Victoria, AU
Anycast Global Backbone
57.25 TiB 1.30% 55.94 TiB 2.29%
f01157271 Melbourne, Victoria, AU
Anycast Global Backbone
56.15 TiB 1.27% 54.37 TiB 3.17%
f01208803 Melbourne, Victoria, AU
Anycast Global Backbone
50.77 TiB 1.15% 38.08 TiB 24.99%
f01208189 Melbourne, Victoria, AU
Anycast Global Backbone
49.48 TiB 1.12% 31.22 TiB 36.91%
f01156901 Melbourne, Victoria, AU
Anycast Global Backbone
48.42 TiB 1.10% 41.64 TiB 14.00%
f01157018 Melbourne, Victoria, AU
Anycast Global Backbone
44.28 TiB 1.00% 42.72 TiB 3.53%
f01157249 Melbourne, Victoria, AU
Anycast Global Backbone
42.80 TiB 0.97% 41.92 TiB 2.04%
f01157027 Melbourne, Victoria, AU
Anycast Global Backbone
39.09 TiB 0.89% 37.37 TiB 4.40%
f01156835 Melbourne, Victoria, AU
Anycast Global Backbone
17.88 TiB 0.41% 17.26 TiB 3.49%
f01208154 Melbourne, Victoria, AU
Anycast Global Backbone
17.07 TiB 0.39% 17.01 TiB 0.37%
f01156538 Melbourne, Victoria, AU
Anycast Global Backbone
15.20 TiB 0.34% 15.20 TiB 0.00%
f022352 Oslo, Oslo, NO
Blix Solutions AS
77.31 TiB 1.75% 70.50 TiB 8.81%
f02000937 Chengdu, Sichuan, CN
China Mobile Communications Group Co., Ltd.
280.56 TiB 6.36% 280.56 TiB 0.00%
f01915033 Chengdu, Sichuan, CN
China Mobile Communications Group Co., Ltd.
94.50 TiB 2.14% 94.50 TiB 0.00%
f01972376new Maywood Park, Oregon, US
Flexential Colorado Corp.
962.98 TiB 21.82% 962.32 TiB 0.07%
f01972364new Maywood Park, Oregon, US
Flexential Colorado Corp.
926.84 TiB 21.01% 926.84 TiB 0.00%
f01952350 Maywood Park, Oregon, US
Flexential Colorado Corp.
238.13 TiB 5.40% 236.38 TiB 0.73%
f01944347 Maywood Park, Oregon, US
Flexential Colorado Corp.
226.63 TiB 5.14% 226.63 TiB 0.00%
f01392893 Amsterdam, North Holland, NL
Fusix Networks B.V.
50.54 TiB 1.15% 50.54 TiB 0.00%
f01199430 Heerhugowaard, North Holland, NL
Wijnand Schouten trading as Speedium
582.29 TiB 13.20% 575.54 TiB 1.16%
f01786387 Heerhugowaard, North Holland, NL
Wijnand Schouten trading as Speedium
197.76 TiB 4.48% 193.13 TiB 2.34%
f01201327 Heerhugowaard, North Holland, NL
Wijnand Schouten trading as Speedium
134.84 TiB 3.06% 134.84 TiB 0.00%
f01771403 Heerhugowaard, North Holland, NL
Wijnand Schouten trading as Speedium
77.05 TiB 1.75% 77.05 TiB 0.00%
f01937642 Heerhugowaard, North Holland, NL
Wijnand Schouten trading as Speedium
64.63 TiB 1.46% 61.50 TiB 4.84%

Provider Distribution

Deal Data Replication

The below table shows how each many unique data are replicated across storage providers.

  • No more than 30% of unique data are stored with less than 4 providers.

⚠️ 68.00% of deals are for data replicated across less than 4 storage providers.

Unique Data Size Total Deals Made Number of Providers Deal Percentage
1.50 PiB 1.50 PiB 1 34.82%
426.80 TiB 854.16 TiB 2 19.36%
200.33 TiB 610.07 TiB 3 13.83%
191.27 TiB 795.27 TiB 4 18.02%
52.69 TiB 276.05 TiB 5 6.26%
30.88 TiB 202.22 TiB 6 4.58%
18.19 TiB 135.50 TiB 7 3.07%
352.00 GiB 2.84 TiB 8 0.06%

Replication Distribution

Deal Data Shared with other Clients

The below table shows how many unique data are shared with other clients.
Usually different applications owns different data and should not resolve to the same CID.

However, this could be possible if all below clients use same software to prepare for the exact same dataset or they belong to a series of LDN applications for the same dataset.

⚠️ CID sharing has been observed.

Other Client Application Total Deals Affected Unique CIDs Approvers
f1z7jogzx4x42wtilzb4lu6iotlad5rptt2acbzpi Speedium network 44.17 TiB 1,341 1flyworker
1kernelogic
4MegTei
2psh0691
3Reiers
3s0nik42

Footnotes

  1. To manually trigger this report, add a comment with text checker:manualTrigger

@hyunmoon
Copy link

hyunmoon commented Mar 9, 2023

checker:manualTrigger

@filplus-checker-app
Copy link

DataCap and CID Checker Report Summary1

Storage Provider Distribution

⚠️ 2 storage providers sealed too much duplicate data - f01208189: 20.70%, f01208803: 20.66%

Deal Data Replication

⚠️ 35.15% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients2

⚠️ CID sharing has been observed. (Top 3)

Full report

Click here to view the full report.

Footnotes

  1. To manually trigger this report, add a comment with text checker:manualTrigger

  2. To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

@data-programs
Copy link
Collaborator

KYC

This user’s identity has been verified through filplus.storage

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests