Skip to content
This repository has been archived by the owner on Jul 18, 2024. It is now read-only.

[DataCap Application] FileDrive Labs - Datasets Landing Plan - [2/3] #1267

Closed
laurarenpanda opened this issue Nov 16, 2022 · 46 comments
Closed
Assignees
Labels
kyc verified User has passed KYC check

Comments

@laurarenpanda
Copy link

Large Dataset Notary Application

To apply for DataCap to onboard your dataset to Filecoin, please fill out the following.

Core Information

  • Organization Name: FileDrive Labs
  • Website / Social Media: https://twitter.com/FileDrive1
  • Total amount of DataCap being requested (between 500 TiB and 5 PiB): 5 PiB
  • Weekly allocation of DataCap requested (usually between 1-100TiB): 500 TiB
  • On-chain address for first allocation: f14uhjnqrocqcenbjfaergw2uvaimysi4snv2oepy

Please respond to the questions below by replacing the text saying "Please answer here". Include as much detail as you can in your answer.

Project details

Share a brief history of your project and organization.

FileDrive Datasets Landing Plan is a project for onboarding more valuable public datasets onto the Filecoin network. Through several phases, we plan to bring 10 PiB data and promote 100 PiB storage power growth to Filecoin. 


About FileDrive Datasets

FileDrive Datasets is a platform to effectively connect the huge storage market that Filecoin has built with publishers of public datasets.
The Filecoin network provides reliable, secure, and affordable decentralized storage services, and FileDrive Labs wants to deliver these benefits to end-users by building a public dataset platform.
It is challenging to attract traditional Cloud Storage and Object-base Storage users to the Filecoin network and benefit from it. Developers in the Felicoin ecosystem, such as FileDrive Labs, need to face this challenge together.
As a member of the Filecoin ecosystem, FileDrive Labs has been insisting on developing useful tools to make it easier for users to store their data onto the Filecoin network. 

FileDrive Datasets has integrated a group of tools to provide storage service with the compatibility of both Cloud Storage and Object-base Storage and better user experience to attract more users.
Projects(ongoing) behind:
- Go-Graphsplit: https://github.com/filedrive-team/go-graphsplit
- DS-Cluster: https://github.com/filedrive-team/go-ds-cluster
- Filejoy: https://github.com/filedrive-team/filejoy

Article about FileDrive Datasets on Filecoin Blog:
- Large Datasets: FileDrive: https://filecoin.io/blog/posts/large-datasets-filedrive/



About FileDrive Labs

FileDrive Labs has always defined ourselves as tool developers and infrastructure builders in the Filecoin ecosystem. From 2019, we continuously focus on technical solutions and development based on IPFS protocol and the Filecoin network and do our best to contribute to the community.
Over 80% of our team are qualified engineers, and half of them have more than 10-year development experience in multiple industries, including Communication, the Internet, and blockchain.
Since 2020, we have participated in Slingshot Competition, become one of the top teams, and stored over 5 PiB useful data from public datasets to the Filecoin network.
To contribute to the Filecoin Community, we developed an open-source data prep tool Graphsplit, FIL+ project dashboard filplus.info and storage provider discovery platform filfind,info.
Besides, we have also hold weekly online virtual events named FileDrive Meetup from March 2022, which aims to provide a platform for community members to grasp the latest trends of the Filecoin network and our work and research.

Please check the following links for more details.
- GitHub: https://github.com/filedrive-team
- Twitter: https://twitter.com/FileDrive1
- Eventbrite: https://www.eventbrite.hk/o/filedrive-labs-42456337463
- YouTube Channel: https://www.youtube.com/channel/UCxcZC1dtBUlQvZY7DX13W1w
- Medium: https://medium.com/@FileDrive1

What is the primary source of funding for this project?

FileDriven Labs, rewards from the Slingshot Competition, Filecoin DevGrants, Mircogrants and a series of Hackathons.

What other projects/ecosystem stakeholders is this project associated with?

FileDrive Dataset is an open dataset platform on IPFS Network, and all data will store on Filecoin Network through the Filecoin Plus project. Since that the primary ecosystem stakeholders are IPFS and Filecoin.

Use-case details

Describe the data being stored onto Filecoin

FileDrive Datasets Landing Plan #1
- Datasets: 6
- Total data capacity: 2451.1TiB


List of Datasets in #1:

1. ZINC Database
- 3D models for molecular docking screens.
- Size: 924.5 TiB

2. Transiting Exoplanet Survey Satellite (TESS)
- The Transiting Exoplanet Survey Satellite (TESS) is a multi-year survey that will discover exoplanets in orbit around bright stars across the entire sky using high-precision photometry. The survey will also enable a wide variety of stellar astrophysics, solar system science, and extragalactic variability studies. More information about TESS is available at MAST and the TESS Science Support Center.
- Size: 285.6 TiB

3. Smithsonian Open Access
- The Smithsonian’s mission is the "increase and diffusion of knowledge" and has been collecting since 1846. The Smithsonian, through its efforts to digitize its multidisciplinary collections, has created millions of digital assets and related metadata describing the collection objects. On February 25th, 2020, the Smithsonian released over 2.8 million CC0 interdisciplinary 2-D and 3-D images, related metadata, and additionally, research data from researches across the Smithsonian. The 2.8 million "open access" collections are a subset of the Smithsonian’s 155 million objects, 2.1 million library volumes and 156,000 cubic feet of archival collections held in 19 museums, 9 research centers, libraries, archives and the National Zoo. Digitization of collections is ongoing.
- Size: 621.2 TiB

4. Community Earth System Model v2 ARISE (CESM2 ARISE)
- Data from ARISE-SAI Experiments with CESM2
- Size: 263.5 TiB

5. 3DCoMPaT: Composition of Materials on Parts of 3D Things
- 3D CoMPaT is a richly annotated large-scale dataset of rendered compositions of Materials on Parts of thousands of unique 3D Models. This dataset primarily focuses on stylizing 3D shapes at part-level with compatible materials. Each object with the applied part-material compositions is rendered from four equally spaced views as well as four randomized views. We introduce a new task, called Grounded CoMPaT Recognition (GCR), to collectively recognize and ground compositions of materials on parts of 3D objects. We present two variations of this task and adapt state-of-art 2D/3D deep learning methods to solve the problem as baselines for future research. We hope our work will help ease future research on compositional 3D Vision.
- Size: 42.8 TiB

6. Reference Elevation Model of Antarctica (REMA)
- The Reference Elevation Model of Antarctica - 2m GSD Digital Elevation Models (DEMs) and mosaics from 2009 to the present. The REMA project seeks to fill the need for high-resolution time-series elevation data in the Antarctic. The time-dependent nature of the strip DEM files allows users to perform change detection analysis and to compare observations of topography data acquired in different seasons or years. The mosaic DEM tiles are assembled from multiple strip DEMs with the intention of providing a more consistent and comprehensive product over large areas. REMA data is constructed from in-track and cross-track high-resolution (~0.5 meter) imagery acquired by the Maxar constellation of optical imaging satellites.
- Size: 313.5 TiB

Where was the data in this dataset sourced from?

All data is from public open datasets.

Can you share a sample of the data? A link to a file, an image, a table, etc., are good ways to do this.

FileDrive Datasets: 
https://datasets.filedrive.io/

Original Source:
https://registry.opendata.aws/zinc15/
https://registry.opendata.aws/tess/
https://registry.opendata.aws/smithsonian-open-access/
https://registry.opendata.aws/ncar-cesm2-arise/
https://registry.opendata.aws/3dcompat/
https://registry.opendata.aws/pgc-rema/

Confirm that this is a public dataset that can be retrieved by anyone on the Network (i.e., no specific permissions or access rights are required to view the data).

Yes, it is. All data can be retrieved by anyone on Filecoin Network.

What is the expected retrieval frequency for this data?

The data of FileDrive Dataset will be pinned on IPFS Network before being stored on Filecoin, which means users could mainly have two different ways to retrieve data, thought IPFS or Filecoin. So the retrieval frequency depends on users' needs.

For how long do you plan to keep this dataset stored on Filecoin?

This data will be stored for at least 1 year on Filecoin, so the verified deals will use a 1-year minimum deal duration (from 356 to 530 days).
Ideally, this project will be a permanent archival on the Filecoin network, as long as there are actual users and data requirements.

DataCap allocation plan

In which geographies (countries, regions) do you plan on making storage deals?

All regions, as long as data transmission can be stable and successful.

How will you be distributing your data to storage providers? Is there an offline data transfer process?

Through both online/offline storage deals. Data transfer strategy might be considered in certain situations.
The expected data onboarding rate is 500TiB per week, which is around 72TiB per day. 
However, it will be influenced by factors such as the speed of data transmission(especially inter-regional transmission), daily power growth of storage providers, base fees, equipment performance, etc. For these reasons, the actual data onboarding rate may differ from our expectations.

How do you plan on choosing the storage providers with whom you will be making deals? This should include a plan to ensure the data is retrievable in the future both by you and others.

Major characteristics about choosing the storage providers:
- location
- transmission speed
- deal success rate
- previous experience of real data storage
- stability of their nodes
- reputation score

How will you be distributing deals across storage providers?

All storage deals will be verified if they have enough FIL to pledge and their equipment can handle.

Do you have the resources/funding to start making deals as soon as you receive DataCap? What support from the community would help you onboard onto Filecoin?

Yes, we do.
Suggestions and feedbacks can help us optimate this project and cooperate with more great storage providers from all over the world.
@large-datacap-requests
Copy link

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!

@large-datacap-requests
Copy link

Thanks for your request!
Everything looks good. 👌

A Governance Team member will review the information provided and contact you back pretty soon.

@raghavrmadya
Copy link
Collaborator

Datacap Request Trigger

Total DataCap requested

5PiB

Expected weekly DataCap usage rate

500TiB

Client address

f14uhjnqrocqcenbjfaergw2uvaimysi4snv2oepy

@large-datacap-requests
Copy link

large-datacap-requests bot commented Nov 22, 2022

DataCap Allocation requested

Multisig Notary address

f02049625

Client address

f14uhjnqrocqcenbjfaergw2uvaimysi4snv2oepy

DataCap allocation requested

250TiB

Id

f0608bb3-9102-4064-8a08-fc6afd65a477

@Joss-Hua
Copy link

I have some knowledge about FileDrive Labs and related products, and have conducted face-to-face visit with the team about the LDN. At present, I have confirmed that the above information is reliable, so as to start the first allocation.

Copy link

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzaceckf5m4ck6x2gypwrw2n3jgr76cnhhznr6magxe56vsvi6wm2jixk

Address

f14uhjnqrocqcenbjfaergw2uvaimysi4snv2oepy

Datacap Allocated

250.00TiB

Signer Address

f1tfg54zzscugttejv336vivknmsnzzmyudp3t7wi

Id

f0608bb3-9102-4064-8a08-fc6afd65a477

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceckf5m4ck6x2gypwrw2n3jgr76cnhhznr6magxe56vsvi6wm2jixk

@newwebgroup
Copy link

Meet and discuss this LDN with the FileDrive Labs team through Zoom, and are willing to support them in the first round.

Copy link

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacead2hs2hexuij6j6x2u3yudvwkjw2wmbrs2gvw6ie6r6vbsosrbdo

Address

f14uhjnqrocqcenbjfaergw2uvaimysi4snv2oepy

Datacap Allocated

250.00TiB

Signer Address

f1e77zuityhvvw6u2t6tb5qlnsegy2s67qs4lbbbq

Id

f0608bb3-9102-4064-8a08-fc6afd65a477

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacead2hs2hexuij6j6x2u3yudvwkjw2wmbrs2gvw6ie6r6vbsosrbdo

@filplus-checker
Copy link

DataCap and CID Checker Report1

  • Organization: FileDrive Labs
  • Client: f14uhjnqrocqcenbjfaergw2uvaimysi4snv2oepy

Storage Provider Distribution

The below table shows the distribution of storage providers that have stored data for this client.

If this is the first time a provider takes verified deal, it will be marked as new.

For most of the datacap application, below restrictions should apply.

  • Storage provider should not exceed 25% of total datacap.
  • Storage provider should not be storing duplicate data for more than 20%.
  • Storage provider should have published its public IP address.
  • All storage providers should be located in different regions.

⚠️ f01227975 has sealed 44.46% of total datacap.

⚠️ f01228008 has sealed 29.79% of total datacap.

⚠️ f0522948 has sealed 25.13% of total datacap.

Provider Location Total Deals Sealed Percentage Unique Data Duplicate Deals
f01227975 Hong Kong, Central and Western, HK 45.33 TiB 44.46% 45.33 TiB 0.00%
f01228008 Sydney, New South Wales, AU 30.38 TiB 29.79% 30.38 TiB 0.00%
f0522948 Singapore, Singapore, SG 25.63 TiB 25.13% 25.63 TiB 0.00%
f0134516 Hong Kong, Central and Western, HK 640.00 GiB 0.61% 640.00 GiB 0.00%

Provider Distribution

Deal Data Replication

The below table shows how each many unique data are replicated across storage providers.

  • No more than 25% of unique data are stored with less than 4 providers.

⚠️ 97.55% of deals are for data replicated across less than 4 storage providers.

Unique Data Size Total Deals Made Number of Providers Deal Percentage
15.36 TiB 15.36 TiB 1 15.07%
4.59 TiB 9.19 TiB 2 9.01%
24.97 TiB 74.91 TiB 3 73.47%
640.00 GiB 2.50 TiB 4 2.45%

Replication Distribution

Deal Data Shared with other Clients

The below table shows how many unique data are shared with other clients.
Usually different applications owns different data and should not resolve to the same CID.

⚠️ CID sharing has been observed.

Other Client Application Total Deals Affected Unique CIDs Verifier
f1bycr5r3ymkgqvkuxoemgsmnuawyawptwj44mqdi FileDrive Labs 195.81 TiB 1,600 LDN v3 multisig
f1sejgqbuwsf74qifuxqykwotyu5aswuwhubxghqa FileDrive Labs 49.97 TiB 820 LDN v3 multisig

Footnotes

  1. To manually trigger this report, add a comment with text checker:manualTrigger

@large-datacap-requests
Copy link

large-datacap-requests bot commented Dec 16, 2022

DataCap Allocation requested

Request number 2

Multisig Notary address

f02049625

Client address

f14uhjnqrocqcenbjfaergw2uvaimysi4snv2oepy

DataCap allocation requested

500TiB

Id

c4d39f80-3a31-40ca-8e7e-382cca1334fb

@stcloudlisa
Copy link

I would like to support them for the following reasons:

Data can be retrieved
Reasonable SP distribution

Copy link

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacedzgj3uxfgzsec6sypsju7wlgh3gux3lnabjkt7ntzq5esvzc5bpm

Address

f14uhjnqrocqcenbjfaergw2uvaimysi4snv2oepy

Datacap Allocated

1.34PiB

Signer Address

f1jvvltduw35u6inn5tr4nfualyd42bh3vjtylgci

Id

693fbd37-bee1-44ca-8ea0-e6ddacec1945

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedzgj3uxfgzsec6sypsju7wlgh3gux3lnabjkt7ntzq5esvzc5bpm

@Sunnyiscoming
Copy link
Collaborator

checker:manualTrigger

@filplus-checker-app
Copy link

DataCap and CID Checker Report Summary1

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 51.25% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients2

⚠️ CID sharing has been observed. (Top 3)

Full report

Click here to view the full report.

Footnotes

  1. To manually trigger this report, add a comment with text checker:manualTrigger

  2. To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

@large-datacap-requests
Copy link

large-datacap-requests bot commented Mar 1, 2023

DataCap Allocation requested

Request number 6

Multisig Notary address

f02049625

Client address

f14uhjnqrocqcenbjfaergw2uvaimysi4snv2oepy

DataCap allocation requested

1.03TiB

Id

7f74192e-2a21-4e83-81f5-957033f7fc4a

@filplus-checker-app
Copy link

DataCap and CID Checker Report Summary1

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 44.53% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients2

⚠️ CID sharing has been observed. (Top 3)

Full report

Click here to view the full report.

Footnotes

  1. To manually trigger this report, add a comment with text checker:manualTrigger

  2. To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

@data-programs
Copy link
Collaborator

KYC

This user’s identity has been verified through filplus.storage

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kyc verified User has passed KYC check
Projects
None yet
Development

No branches or pull requests

17 participants