Skip to content
This repository has been archived by the owner on Jul 18, 2024. It is now read-only.

[DataCap Application] Smithsonian Open Access #391

Closed
JackFsy opened this issue Jun 7, 2022 · 86 comments
Closed

[DataCap Application] Smithsonian Open Access #391

JackFsy opened this issue Jun 7, 2022 · 86 comments

Comments

@JackFsy
Copy link

JackFsy commented Jun 7, 2022

Large Dataset Notary Application

To apply for DataCap to onboard your dataset to Filecoin, please fill out the following.

Core Information

  • Organization Name: BigData Exchange
  • Website / Social Media: www.bigd.exchange / https://twitter.com/BigD_Exchange
  • Total amount of DataCap being requested (between 500 TiB and 5 PiB): 3.24PiB
  • Weekly allocation of DataCap requested (usually between 1-100TiB): 300TiB
  • On-chain address for first allocation: f1fkh47gdwclmovclfz2kvorhhdofvgr5y7fjvkmi

Please respond to the questions below by replacing the text saying "Please answer here". Include as much detail as you can in your answer.

Project details

Share a brief history of your project and organization.

BigData Exchange is a Filecoin data storage marketplace that connects clients with valuable public data and storage providers with storage capacity. Headquartered and operated in Singapore, there are currently over 30 people globally in the core team.
With the current demand for Filecoin plus deals, there are several benefits that BigData Exchange brings to the Filecoin network:

- An innovative storage platform to incentivize more people to source, migrate, and store valuable public data on Filecoin
- A transparent marketplace to connect Fil+ clients and storage providers, which adds visibility to the SP selection process and prevents potential self-dealing activities
- An index of public data stored on Filecoin, which paves the way to a vibrant retrieval and service market for Fil+ deals

The goal is not to profit from DataCap. For any proceeds, we will donate them to data preservation organization or use them to fund/incentivize activities that aim to onboard more useful data storage on Filecoin.

What is the primary source of funding for this project?

The project is self funded.

What other projects/ecosystem stakeholders is this project associated with?

The project is self funded.

Use-case details

Describe the data being stored onto Filecoin

Smithsonian Open Access is a public open dataset available in AWS S3. The Smithsonian’s mission is the "increase and diffusion of knowledge" and has been collecting since 1846. The Smithsonian, through its efforts to digitize its multidisciplinary collections, has created millions of digital assets and related metadata describing the collection objects.

Where was the data in this dataset sourced from?

On February 25th, 2020, the Smithsonian released over 2.8 million CC0 interdisciplinary 2-D and 3-D images, related metadata, and additionally, research data from researches across the Smithsonian. The 2.8 million "open access" collections are a subset of the Smithsonian’s 155 million objects, 2.1 million library volumes and 156,000 cubic feet of archival collections held in 19 museums, 9 research centers, libraries, archives and the National Zoo. Digitization of collections is ongoing.

Can you share a sample of the data? A link to a file, an image, a table, etc., are good ways to do this.

[Please answer here.](https://registry.opendata.aws/smithsonian-open-access/)

Confirm that this is a public dataset that can be retrieved by anyone on the Network (i.e., no specific permissions or access rights are required to view the data).

Yes

What is the expected retrieval frequency for this data?

It will serve as a public data copy on the decentralized web that is open for anybody’s access. The retrieval frequency for this data will depend on user demand.

For how long do you plan to keep this dataset stored on Filecoin?

18 months

DataCap allocation plan

In which geographies (countries, regions) do you plan on making storage deals?

We do not have regional preference currently, but our ideal goal would be to make sure that users from all over the world can access this valuable open data.

How will you be distributing your data to storage providers? Is there an offline data transfer process?

We will provide an online URL for the storage providers to download CAR files.

How do you plan on choosing the storage providers with whom you will be making deals? This should include a plan to ensure the data is retrievable in the future both by you and others.

We will list the dataset described above on BigData Exchange and await bids from storage providers. Once the auction period ends, we will check SP’s bidding price, reputation, geographic information, sealing speed, etc. to select the final storage providers to work with.

How will you be distributing deals across storage providers?

The dataset will have 3 replicas and we are planning to distribute them to different storage providers with up to 20% each.

Do you have the resources/funding to start making deals as soon as you receive DataCap? What support from the community would help you onboard onto Filecoin?

We will use BigData Exchange to find the right storage providers. For each allocation, we will create auctions with an auction period to be 7 days. When the auction ends, we will be ready to make deals with storage providers directly. The goal is not to profit from DataCap. For any proceeds, we will donate them to data preservation organization or use them to fund/incentivize activities that aim to onboard more useful data storage on Filecoin.
@large-datacap-requests
Copy link

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!

@large-datacap-requests
Copy link

Thanks for your request!
Everything looks good. 👌

A Governance Team member will review the information provided and contact you back pretty soon.

@large-datacap-requests
Copy link

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!

@large-datacap-requests
Copy link

Thanks for your request!
Everything looks good. 👌

A Governance Team member will review the information provided and contact you back pretty soon.

@Yvette516
Copy link

How would you make deals with SP online?

@JackFsy
Copy link
Author

JackFsy commented Jun 21, 2022

Hey @Yvette516 , we will be proposing offline deals with SPs. The process works as below:

  1. Prepare CAR files and upload to an online server
  2. SPs download CAR files from the provided URL
  3. Propose offline deal to SP once they have files ready to import.

@Sunnyiscoming
Copy link
Collaborator

@JackFsy Hey. Could you send an email to [email protected] with your official domain in order to confirm your identity?

@JackFsy
Copy link
Author

JackFsy commented Jun 23, 2022

@JackFsy Hey. Could you send an email to [email protected] with your official domain in order to confirm your identity?

Hey @Sunnyiscoming , sent. Thanks

@large-datacap-requests
Copy link

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!

@large-datacap-requests
Copy link

Thanks for your request!
Everything looks good. 👌

A Governance Team member will review the information provided and contact you back pretty soon.

@kernelogic
Copy link

I support this LDN. Similarly to Estuary and FilSwan, utility based LDNs are worth consideration.

@JackFsy
Copy link
Author

JackFsy commented Jun 28, 2022

I support this LDN. Similarly to Estuary and FilSwan, utility based LDNs are worth consideration.

Hi @kernelogic , thank you for your support!

@Fenbushi-Filecoin
Copy link

I talked to the team at Austin. I can support this LDN application.

@cryptowhizzard
Copy link

cryptowhizzard commented Jun 28, 2022

Hi there,

Today in the governance call there were some slides , including a slide stating that the revenue of this dataset would be donated. Can you put those slides here in this request @JackFsy ?

From what i understand is that @kernelogic ( Techgreedy ) and Xinan will build and distribute this dataset for you. Is that right?

If i understand correct on both then i will support this application.

@Destore2023
Copy link

I support this LDN.

100% match with LDN's demand. A Public Dataset Hub is easier for SP to find data and apply for DC.

@kernelogic
Copy link

kernelogic commented Jun 28, 2022

Hey @cryptowhizzard

To clarify I am not building / distribute dataset through this LDN. Maybe Xin'an will be doing it.

As my understanding from the slides, BigD exchange will build / distribute themselves to provide more liquidity on their platform, and the proceedings will be donated back to the community.

@JackFsy
Copy link
Author

JackFsy commented Jun 30, 2022

I talked to the team at Austin. I can support this LDN application.

Hey @Fenbushi-Filecoin , thank you for your support!

Copy link

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacedyvkp2wxekl77dpdnyto6jylwp7ilr6a3qo7mobmnosm2o2xmksq

Address

f1fkh47gdwclmovclfz2kvorhhdofvgr5y7fjvkmi

Datacap Allocated

1.20PiB

Signer Address

f1krmypm4uoxxf3g7okrwtrahlmpcph3y7rbqqgfa

Id

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedyvkp2wxekl77dpdnyto6jylwp7ilr6a3qo7mobmnosm2o2xmksq

@large-datacap-requests
Copy link

**We have found some problems in the information provided in the Approved Comment.
We could not find Id field in the information provided

Please, take a look at the comment and edit the body of the comment providing all the required information.

@cbtan21
Copy link

cbtan21 commented Feb 6, 2023

@cryptowhizzard @xinaxu thanks for signing! still have not received datacap. does any of you know what is the issue here?

@filplus-checker-app
Copy link

DataCap and CID Checker Report1

  • Organization: BigData Exchange
  • Client: f1fkh47gdwclmovclfz2kvorhhdofvgr5y7fjvkmi

Approvers

11ane-1
1cryptowhizzard
6kernelogic
1liyunzhi-666
1MetaWaveInfo
3newwebgroup
1stcouldlisa
2xinaxu

Storage Provider Distribution

The below table shows the distribution of storage providers that have stored data for this client.

If this is the first time a provider takes verified deal, it will be marked as new.

For most of the datacap application, below restrictions should apply.

  • Storage provider should not exceed 30% of total datacap.
  • Storage provider should not be storing duplicate data for more than 20%.
  • Storage provider should have published its public IP address.
  • All storage providers should be located in different regions.

✔️ Storage provider distribution looks healthy.

Provider Location Total Deals Sealed Percentage Unique Data Duplicate Deals
f01919423new Sydney, New South Wales, AU
Andrew Sjoquist Enterprises Pty Ltd
96.98 TiB 4.29% 96.98 TiB 0.00%
f01858429 Boston, Massachusetts, US
Comcast Cable Communications, LLC
49.91 TiB 2.21% 49.84 TiB 0.13%
f09848 Rancho Santa Margarita, California, US
Cox Communications Inc.
90.89 TiB 4.02% 89.98 TiB 1.00%
f01949183 Maywood Park, Oregon, US
Flexential Colorado Corp.
134.70 TiB 5.96% 134.70 TiB 0.00%
f01972364 Maywood Park, Oregon, US
Flexential Colorado Corp.
99.61 TiB 4.41% 99.61 TiB 0.00%
f01972376 Maywood Park, Oregon, US
Flexential Colorado Corp.
64.86 TiB 2.87% 64.86 TiB 0.00%
f01889668new San Jose, California, US
HONG KONG Megalayer Technology Co.,Limited
331.59 TiB 14.67% 328.17 TiB 1.03%
f01518369new San Jose, California, US
HONG KONG Megalayer Technology Co.,Limited
304.30 TiB 13.46% 300.53 TiB 1.24%
f01926686 Hangzhou, Zhejiang, CN
Jiangxi Jiujiang IDC
99.34 TiB 4.40% 99.31 TiB 0.03%
f0187709 Moscow, Moscow, RU
MTS PJSC
11.84 TiB 0.52% 11.84 TiB 0.00%
f01933917 Hong Kong, Central and Western, HK
OneAsia Network Limited
313.69 TiB 13.88% 299.42 TiB 4.55%
f01926635 Hong Kong, Central and Western, HK
OneAsia Network Limited
299.55 TiB 13.25% 299.55 TiB 0.00%
f01999119 Hong Kong, Central and Western, HK
OneAsia Network Limited
132.89 TiB 5.88% 132.77 TiB 0.09%
f01989888 Hong Kong, Central and Western, HK
OneAsia Network Limited
31.05 TiB 1.37% 31.05 TiB 0.00%
f01873432 Las Vegas, Nevada, US
PiKNiK & Company Inc.
53.08 TiB 2.35% 48.41 TiB 8.80%
f01402814 Singapore, Singapore, SG
StarHub Ltd
46.83 TiB 2.07% 46.83 TiB 0.00%
f01908671 Atlanta, Georgia, US
Unitas Global LLC
56.27 TiB 2.49% 56.27 TiB 0.00%
f01119939 Atlanta, Georgia, US
Unitas Global LLC
42.78 TiB 1.89% 42.78 TiB 0.00%

Provider Distribution

Deal Data Replication

The below table shows how each many unique data are replicated across storage providers.

  • No more than 30% of unique data are stored with less than 4 providers.

✔️ Data replication looks healthy.

Unique Data Size Total Deals Made Number of Providers Deal Percentage
7.22 TiB 7.22 TiB 1 0.32%
73.84 TiB 147.84 TiB 2 6.54%
171.02 TiB 513.61 TiB 3 22.72%
75.50 TiB 303.84 TiB 4 13.44%
150.27 TiB 763.38 TiB 5 33.78%
58.75 TiB 354.78 TiB 6 15.70%
18.28 TiB 131.36 TiB 7 5.81%
800.00 GiB 7.78 TiB 8 0.34%
2.59 TiB 28.47 TiB 9 1.26%
160.00 GiB 1.88 TiB 10 0.08%

Replication Distribution

Deal Data Shared with other Clients

The below table shows how many unique data are shared with other clients.
Usually different applications owns different data and should not resolve to the same CID.

However, this could be possible if all below clients use same software to prepare for the exact same dataset or they belong to a series of LDN applications for the same dataset.

⚠️ CID sharing has been observed.

Other Client Application Total Deals Affected Unique CIDs Approvers
f3wn64jznjqmr3s3qt4tfwynxo6k73wdxkhd6wo2l
4rwpjdyyddirah2ugfaibvlzz2asaj3ehsp7bwc2z
vnia
Jiaxing Yangtze Delta Region Blockchain Technology Research Institute 583.25 TiB 3,194 3Fenbushi-Filecoin
1flyworker
2kernelogic
1NDLABS-OFFICE
1psh0691
1steven004
1xinaxu
f1bstbq5bi72kyovhh7zoo2f6l6uivsjz4ey5dnqq FilSwan 168.03 TiB 2,447 3cryptowhizzard
1IreneYoung
7kernelogic
2liyunzhi-666
1psh0691
f1o54sve7ede7im4caux3ug7lsyjmbue7ss3zzl6y FilSwan 113.47 TiB 1,246 3cryptowhizzard
3IreneYoung
1jamerduhgamer
1Joss-Hua
9kernelogic
2liyunzhi-666
1xingjitansuo
f1r3d25hl2y7rqlsu2mgczdethy4qqjmkfdlmibfq NEXRAD - FilSwan 21.81 TiB 698 1cryptowhizzard
2IreneYoung
1jamerduhgamer
5kernelogic
1liyunzhi-666
1Reiers
1xingjitansuo
f3td7znsz2q6laexewfqogczo74xyiwgbysuf6i75
k2wrdqwzyvln7wvzmxaz3jvqzskwxqerdwaetdvue
jama
Unknown 10.25 TiB 297 Unknown
f3wzua4wihcouehv5datojpgalyapegvdfkhkqamx
dtr4tzcsn7kylkoaapmmxisbp3tzekgijb32lrswf
5t5q
Ninth Heaven Guild 1.00 TiB 32
f1toz5izxdse43peqyd7zktmyqilvhf6u72z74gfq Starboard Networks 288.00 GiB 9
f1xwlfbvtamrhw7bp5paao4vxybgsbqdibfen3h7q Unknown 96.00 GiB 2 Unknown

Footnotes

  1. To manually trigger this report, add a comment with text checker:manualTrigger

@filplus-checker-app
Copy link

DataCap and CID Checker Report1

  • Organization: BigData Exchange
  • Client: f1fkh47gdwclmovclfz2kvorhhdofvgr5y7fjvkmi

Approvers

11ane-1
1cryptowhizzard
6kernelogic
1liyunzhi-666
1MetaWaveInfo
3newwebgroup
1stcouldlisa
2xinaxu

Storage Provider Distribution

The below table shows the distribution of storage providers that have stored data for this client.

If this is the first time a provider takes verified deal, it will be marked as new.

For most of the datacap application, below restrictions should apply.

  • Storage provider should not exceed 30% of total datacap.
  • Storage provider should not be storing duplicate data for more than 20%.
  • Storage provider should have published its public IP address.
  • All storage providers should be located in different regions.

✔️ Storage provider distribution looks healthy.

Provider Location Total Deals Sealed Percentage Unique Data Duplicate Deals
f01919423new Sydney, New South Wales, AU
Andrew Sjoquist Enterprises Pty Ltd
96.98 TiB 4.29% 96.98 TiB 0.00%
f01858429 Boston, Massachusetts, US
Comcast Cable Communications, LLC
49.91 TiB 2.21% 49.84 TiB 0.13%
f09848 Rancho Santa Margarita, California, US
Cox Communications Inc.
90.89 TiB 4.02% 89.98 TiB 1.00%
f01949183 Maywood Park, Oregon, US
Flexential Colorado Corp.
134.70 TiB 5.96% 134.70 TiB 0.00%
f01972364 Maywood Park, Oregon, US
Flexential Colorado Corp.
99.61 TiB 4.41% 99.61 TiB 0.00%
f01972376 Maywood Park, Oregon, US
Flexential Colorado Corp.
64.86 TiB 2.87% 64.86 TiB 0.00%
f01889668new San Jose, California, US
HONG KONG Megalayer Technology Co.,Limited
331.59 TiB 14.67% 328.17 TiB 1.03%
f01518369new San Jose, California, US
HONG KONG Megalayer Technology Co.,Limited
304.30 TiB 13.46% 300.53 TiB 1.24%
f01926686 Hangzhou, Zhejiang, CN
Jiangxi Jiujiang IDC
99.34 TiB 4.40% 99.31 TiB 0.03%
f0187709 Moscow, Moscow, RU
MTS PJSC
11.84 TiB 0.52% 11.84 TiB 0.00%
f01933917 Hong Kong, Central and Western, HK
OneAsia Network Limited
313.69 TiB 13.88% 299.42 TiB 4.55%
f01926635 Hong Kong, Central and Western, HK
OneAsia Network Limited
299.55 TiB 13.25% 299.55 TiB 0.00%
f01999119 Hong Kong, Central and Western, HK
OneAsia Network Limited
132.89 TiB 5.88% 132.77 TiB 0.09%
f01989888 Hong Kong, Central and Western, HK
OneAsia Network Limited
31.05 TiB 1.37% 31.05 TiB 0.00%
f01873432 Las Vegas, Nevada, US
PiKNiK & Company Inc.
53.08 TiB 2.35% 48.41 TiB 8.80%
f01402814 Singapore, Singapore, SG
StarHub Ltd
46.83 TiB 2.07% 46.83 TiB 0.00%
f01908671 Atlanta, Georgia, US
Unitas Global LLC
56.27 TiB 2.49% 56.27 TiB 0.00%
f01119939 Atlanta, Georgia, US
Unitas Global LLC
42.78 TiB 1.89% 42.78 TiB 0.00%

Provider Distribution

Deal Data Replication

The below table shows how each many unique data are replicated across storage providers.

  • No more than 30% of unique data are stored with less than 4 providers.

✔️ Data replication looks healthy.

Unique Data Size Total Deals Made Number of Providers Deal Percentage
7.22 TiB 7.22 TiB 1 0.32%
73.84 TiB 147.84 TiB 2 6.54%
171.02 TiB 513.61 TiB 3 22.72%
75.50 TiB 303.84 TiB 4 13.44%
150.27 TiB 763.38 TiB 5 33.78%
58.75 TiB 354.78 TiB 6 15.70%
18.28 TiB 131.36 TiB 7 5.81%
800.00 GiB 7.78 TiB 8 0.34%
2.59 TiB 28.47 TiB 9 1.26%
160.00 GiB 1.88 TiB 10 0.08%

Replication Distribution

Deal Data Shared with other Clients

The below table shows how many unique data are shared with other clients.
Usually different applications owns different data and should not resolve to the same CID.

However, this could be possible if all below clients use same software to prepare for the exact same dataset or they belong to a series of LDN applications for the same dataset.

⚠️ CID sharing has been observed.

Other Client Application Total Deals Affected Unique CIDs Approvers
f3wn64jznjqmr3s3qt4tfwynxo6k73wdxkhd6wo2l
4rwpjdyyddirah2ugfaibvlzz2asaj3ehsp7bwc2z
vnia
Jiaxing Yangtze Delta Region Blockchain Technology Research Institute 583.25 TiB 3,194 3Fenbushi-Filecoin
1flyworker
2kernelogic
1NDLABS-OFFICE
1psh0691
1steven004
1xinaxu
f1bstbq5bi72kyovhh7zoo2f6l6uivsjz4ey5dnqq FilSwan 168.03 TiB 2,447 3cryptowhizzard
1IreneYoung
7kernelogic
2liyunzhi-666
1psh0691
f1o54sve7ede7im4caux3ug7lsyjmbue7ss3zzl6y FilSwan 113.47 TiB 1,246 3cryptowhizzard
3IreneYoung
1jamerduhgamer
1Joss-Hua
9kernelogic
2liyunzhi-666
1xingjitansuo
f1r3d25hl2y7rqlsu2mgczdethy4qqjmkfdlmibfq NEXRAD - FilSwan 21.81 TiB 698 1cryptowhizzard
2IreneYoung
1jamerduhgamer
5kernelogic
1liyunzhi-666
1Reiers
1xingjitansuo
f3td7znsz2q6laexewfqogczo74xyiwgbysuf6i75
k2wrdqwzyvln7wvzmxaz3jvqzskwxqerdwaetdvue
jama
Unknown 10.25 TiB 297 Unknown
f3wzua4wihcouehv5datojpgalyapegvdfkhkqamx
dtr4tzcsn7kylkoaapmmxisbp3tzekgijb32lrswf
5t5q
Ninth Heaven Guild 1.00 TiB 32
f1toz5izxdse43peqyd7zktmyqilvhf6u72z74gfq Starboard Networks 288.00 GiB 9
f1xwlfbvtamrhw7bp5paao4vxybgsbqdibfen3h7q Unknown 96.00 GiB 2 Unknown

Footnotes

  1. To manually trigger this report, add a comment with text checker:manualTrigger

@github-actions
Copy link

This application has not seen any responses in the last 10 days. This issue will be marked with Stale label and will be closed in 4 days. Comment if you want to keep this application open.

@github-actions github-actions bot added the Stale label Jul 21, 2023
@github-actions
Copy link

This application has not seen any responses in the last 14 days, so for now it is being closed. Please feel free to contact the Fil+ Gov team to re-open the application if it is still being processed. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 26, 2023
@cryptowhizzard
Copy link

checker:manualTrigger

@filplus-checker-app
Copy link

DataCap and CID Checker Report Summary1

Retrieval Statistics

  • Overall Graphsync retrieval success rate: 24.63%
  • Overall HTTP retrieval success rate: 2.23%
  • Overall Bitswap retrieval success rate: 0.00%

Storage Provider Distribution

⚠️ 1 storage providers sealed too much duplicate data - f01999119: 47.36%

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients2

⚠️ CID sharing has been observed. (Top 3)

Full report

Click here to view the CID Checker report.
Click here to view the Retrieval Dashboard.
Click here to view the Retrieval report.

Footnotes

  1. To manually trigger this report, add a comment with text checker:manualTrigger

  2. To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests