-
Notifications
You must be signed in to change notification settings - Fork 107
[Request for Discussion] Software inventory metadata schema and inventory collection #41
Comments
I've asked a handful of developers here at 18F for some feedback on approach and schema. Here are some highlights:
|
Would be interesting to be able to record if projects:
That being said, good to make it as easy as possible for folks to get started. We don't want to overwhelm folks. |
I firmly believe that the proposed schema should include a standards-compliant 3.5mm headphone jack. |
But seriously, some thoughts: Data FormatSome thoughts on proposed data format standards for agency publication (code.gov consumption). NOTE: a related but distinct feature of code.gov should be the publication of its aggregated inventory. There may be value in providing this inventory in many formats (expecting many varied consumers), whereas below I advocate for a single data format (expecting code.gov to be the sole consumer). JSONThis should be the standard, IMO, for all the reasons that JSON has become popular: easily readable, ubiquitous, expressive (i.e. allows for collections (arrays) unlike CSV), and libraries exist in all major languages/platforms for JSON generation. If only one format is supported, I suggest it be JSON. XMLThis wasn't mentioned, but is ubiquitous enough to warrant discussion. As I see it, JSON can do everything XML can do while being more readable and easier to construct and less complex to define (no WSDL, etc.). If the schema were intended for broad consumption, I might suggest discussing XML support, but seeing as the schema is primarily intended for consumption exclusively by code.gov, I don't think the added complexity yields much additional benefit. CSVI don't see any benefits to supporting CSV, which lacks support for expressions like multi-dimensional collections (i.e. arrays) beyond the single dimension of the rows in a CSV table. One could hack around this constraint by supporting dynamic column headers (e.g. One could argue that CSV is simpler to publish when maintaining an inventory by hand (e.g. by exporting an Excel spreadsheet). While this is true, I don't think that benefit outweighs the inherent limits of the format. It also seems that in the long-term, we would want agencies to programmatically generate their inventory file rather than hand-crafting it manually; not supporting CSV may nudge them in the desired direction. YMLFor the sake of completeness -- not mentioned above, but worth discussing. Same attributes as JSON but somewhat more human-readable and somewhat more fragile (white space dependency). I don't see a benefit to supporting YML in addition to JSON. Collection Methodology
Pros and cons either way. A "pull" methodology seems to be the simplest (avoids "push" credential checking, account maintenance, etc.; also puts burden of initiation on code.gov centrally rather than on each agency individually). One benefit of a "push" methodology would be more real-time reporting, though I'm not convinced that real-time reporting is very important in this project or outweighs the additional complexity. CRUDIt's also worth talking about how specific actions should be taken and how certain situations should be interpreted. For example, if a record suddenly stops being included in a agency's reported inventory, what does code.gov do? Delete it? Ignore the omission? Should code.gov assign unique identifiers or require them as a part of inventory submission (to avoid duplication and enable "upserts")? Which inventory actions should be idempotent? Etc. Collected Data1,000% agree with @theresaanna on fully spelling out field names rather than using abbreviated/Hungarian-like naming. Relationships / ReuseIf one agency does start using code from another agency, how is that represented in the code.gov data model? |
If data entry is provided, then the format CSV or JSON doesn't matter, because the view can be exposed either way. The format does matter for bulk-import of metadata, and for that, I'd prefer JSON. I think it's best to ask agencies to submit their inventory to code.gov (this is where that bulk-import feature helps), rather than rely on them to publish on their own site and pull from there (not all government agency's have up-to-date and convenient sites, and if you provide the platform for receiving the information, it'll probably be easier and faster to get the data than requiring them to sustain their own platform for publishing). Some incentive should be provided to ensure project managers submit this data. Using the data to have a "featured projects" page, might be one way to incentivize timely submissions. As for fields,
|
Adding to what @ctubbsii wrote: Last Updated could be renamed to something like Updated and become an array that has more information in there such as LastCommitDate, LastMetadataUpdate, LastPullRequest etc. The Languages should be an array not a comma separated field. It will be easier to index that way IMO I see little value in supporting CSV or XML. As @rossdakin points out, not offering CSV will point people to the right direction :) |
Government approval processes often become roadblocks and cause systems and data to become stale and unreliable for their purposes. I fear the same for this effort. As red tape is added, this data could become so dated that nobody finds it useful. I suggest that Code.gov needs to get in front of this problem before the culture settles. Encourage agencies to push metadata updates as quickly and as often as possible while reducing red tape in these processes. Make the update process responsive by eliminating any approval processes aside from standard security and authorization measures. I would hate to see all this effort reduced to the usual "I technically did my part" checkbox I find in too many government tasks. |
Which data format is the best fit: CSV or JSON?
Is it best to retrieve or ask agencies to submit their inventories?
As I mentioned on our call today, since the majority of our code is behind the NASA firewall, it would reduce perceived risk and increase NASA's adoption of this policy if URLs are optional. Of course for open source repositories such as the ones NASA maintains here: www.github.com/nasa, we would include the URL fields as they are important in this context. I think title, description and POC are all important for code discovery and setting up potential collaborations between government parties Schema comments I have, within the Projects array ... also, from a schema standard we should decide if attributes should be included with NULL values OR if those NULL valued attributed should be omitted. |
If considering a JSON format, it may be useful to follow / look at the npm package file format https://docs.npmjs.com/files/package.json |
And Git Hooks would be a good way to submit this information while pushing to github for projects hosted on that platform. |
Personally, I would prefer not XML for the reason that it isn't as well supported by tools like Jekyll which may be used for the display / web visualization of the data. Another thought, should the fields / the spelling of the fields be aligned with the type of information that can be grabbed from sources like the GitHub REST API? This would allow, at least for open / GitHub repos, the ability to absorb all projects by only knowing the organization names. This is something that I am doing for the @LLNL organization to create a software portal, much like what Code.gov will become, at software.llnl.gov. |
Oh, and I also agree with @jasonduley that the ability for agencies to push into the repository would greatly ease the integration of "inside the firewall" code hosting. |
I want to take a second before responding myself to thank @rossdakin for his detailed post above. He did a great job laying out reasoning behind multiple formats and each delivery mechanism. Thank you for taking the time to share that and add to the conversation. Now, some thoughts in no particular order....
Can you tell me more of how NASA treats internally-resolvable urls as a risk? I'd think as the govt works towards more inner-sourcing and reuse, that being able to go "to" the code will be a big help.
|
@jbjonesjr |
@jasonduley -- Would providing the links, even if they are inaccessible be an issue? It seems like if it were possible to provide the where now, that would assist with identifying where new connections need to be established. @jbjonesjr -- One other thought is that the number of sources for the metadata we (all) would be scraping is fairly limited... There are only so many tools for hosting code. GitHub.com obviously, but also: GitLab, Bitbucket.org, Bitbucket Server, SourceForge, etc. By deciding on a common format and building tools for scraping that data out of these sources, all of the agencies would be able to contribute collaboratively. |
@IanLee1521 |
Makes sense... For what it's worth, I suspect we would have similar issues @LLNL. |
Collected Data Code.gov should reuse data element names and definitions from project open data metadata schema https://project-open-data.cio.gov/v1.1/schema/ where possible. These are based on W3C http://www.w3.org/TR/vocab-dcat/ and dublin core that has been around for many years. Alternatively, if GitHub, GitLab or other code repository has existing data elements and types, this project could use those fields. Code.gov could reuse the following fields from project open data:
An example:
There may be more fields to reuse. I also recommend adding fields which will be good for analytics of what agencies and investments are releasing their code:
By aligning to these fields names, there is also hope in developing common system for storing data sets, data assets and code repositories. For example, we could potentially create an extension for CKAN or DKAN to also store code repositories. You could also reuse existing documentation. |
I think it would be a good idea to follow the process established by the data.gov effort as much as possible. Since most agencies have been working on setting that up they should be familiar with json and have processes for creating and maintaining the json data. Also try not to invent a whole new schema and if possible try to reuse data.gov data descriptions where you are talking about the same thing. |
@mgifford I agree that we should make it easy for folks to get started, but you bring up some valuable data points we might collect. Thanks so much for your feedback. A question that remains for me is whether it's better to have an initial version of the schema that we add onto as agencies feel more comfortable or if it's better to be thorough up front. |
@theresaanna I think you've got a lot of good material in the above discussion, and may have already seen this from some of my colleagues: https://18f.gsa.gov/2016/08/29/data-act-prototype-simplicty-is-key/ "... One of the earliest decisions our team grappled with centered on the data format we would receive from agencies. ... " I wanted to augment some of the earlier comments that it definitely seems like an "and" and making one machine-readable format is a good way to validate another (e.g. CSV to validate a "more formal" JSON/XML/... spec). |
@rossdakin Thank you so much for your thoughtful feedback! You've brought up some great food for thought. I am in agreement with you that a JSON, pull-based system makes the most sense. Some thoughts:
My assumption is that code.gov always reflects the most recent version of agency inventories, meaning we'd delete the record. I don't know if this is a good assumption. Are there cases in which we'd want to hold onto old data? I imagine it'll be normal for software to drop out of inventories as it becomes replaced.
You bring up a great point. I think that for a first version, given the aggressive timeline the policy lays out, we won't be able to tackle this, however, I will add it to our backlog for addressing in the future. I cringe a little to say that, as this is something we'll want to think about sooner rather than later admittedly.
That is a fantastic question! I think we will need to have some discussion around how we might represent that - whether it's in the data model or a layer on top of it. Do you see any benefits to having it in the data model? |
Because new code is often generated in association with research data, we are encouraging data submitters to the Ag Data Commons (https://data.nal.usda.gov) to also submit a pointer and metadata description for their software (which we hope is primarily managed in an open source code repository). Two points to make about this:
I would encourage processes to align as closely as possible with the existing open data.gov processes. I have no problem with additional value-added metadata. |
Apologies if this question has been asked, but has there been discussion around creating a JSON conversion tool similar to the DCOI Strategic Plan: https://datacenters.cio.gov/json-conversion-tool/? |
@jecb and others have brought up making this a tool or process to make generating the code inventories as easy as possible. I think the first step in doing so, is mapping schema fields to some of the web-based repo hosting tools (e.g., github,bitbucket), especially those that have APIs. To that end, I've put together this table which shows what this might look like.
|
Collected Data Maybe "softwareLanguage" as @bbrotsos has listed above would be appropriate usage for this example.... |
@ctubbsii Thank you so much for all of your feedback. You've brought up some great points that are so valuable in helping us think this through. I've replied to much of your comment inline:
So, we will be collecting data about presumably many closed-source projects, and so a public URL may not be available.
I agree that it's not very future-proof. My assumption was that agencies would need a way to get in contact with the project maintainers if this inventory were to be useful. However, I'm not sure that's a good assumption. I'm planning to remove it as a required field unless a good argument is made to the contrary.
Interesting. I had thought of this field as a signifier of the activity of a project, but this would be hard to maintain unless we were pulling project info right from Github or similar. I agree that it would be useful to see when the metadata was last changed. The more I think about this field, though, the less convinced I am that we need it. Until we have a tool to generate the inventory JSON, I imagine this field will fail to be updated with changes, making it unreliable.
Agreed that it is important, but unfortunately not all software will have a license. Along the lines of the suggestion you made about incentives, perhaps there's a way to encourage folks to release code and help them decide which license is right.
There are projects that are built as platforms or to be reused specifically. For example, the eRegulations project. https://eregs.github.io/. This field will allow users to look specifically for these types of projects.
Definitely agreed on preferring to start small. |
@niden these are great suggestions, thank you. I will implement your languages field suggestion - I agree. @okamanda, I'm interested in your thoughts here. Do you see a need for Last Updated that I don't? I worry that it will fail to be updated and then become unreliable data if folks are updating it manually. |
@jbjonesjr I agree with you that sometimes there are version bumps solely for marketing and other purposes; however, nothing prevents someone from performing a pointless update to a code base solely to cause the [1] I'm assuming here that once a project's URL has been submitted to code.gov, then the servers can automatically look for any updated projects and update their databases accordingly. Computers are lousy at determining which changes are important ones, so this would be a trivial trick for an unscrupulous person to make it appear that their project is getting lots of updates. |
One thought on the topology. "Project" here seems synonymous with "repository" — I could see this being confusing when listing projects that have multiple repositories (e.g. a UI, an API, etc.). Possible mitigations:
|
Some feedback, from my personal opinion:
|
+1 for YML per @NoahKunin It is assumed a developer would do this task, but it is left up to the agency how it is accomplished. It would not surprise me some agencies task their Public Affairs or Security Offices to maintain since it is public facing. |
Question: How do you plan to track government contributions to existing public OSS projects that were not started by the government? |
Maybe it was intentional to get a fresh perspective, but it seems like the original discussions on this topic from the policy should be required reading here. See:
Seems like we may be re-hashing many of the same points. In fact, it seems like there are at least four different threads on the metadata schema topic and it's a bit confusing to follow along. Here are the threads I've identified in chronological order:
Where possible, I'd suggest trying to de-duplicate or consolide these threads or at least update the first post on the thread to distinguish the different threads if each is meant to serve a distinct purpose. |
Thanks @philipashlock. Especially with your helpful write up in place, let's consolidate the discussion here. I'm going to close out the other issues and point folks to this thread, which is the most active overall. |
@mattbailey0 -- Would it make sense to start working all of these discussions into the draft guidance, rather than continuing to solely use the issue threads? |
Current StatusWhat's the status of this schema? There's documentation on code.gov that seems somewhat final, but there seem to be a number of important points that haven't been addressed, this issue is still open, and the code.gov site somewhat confusingly says that both the publication of the metadata schema and implementation of the schema by agencies are due December 9th (referring to Section 7.2). Allow for revisionsWhether or not this is final or it's possible to make some minor updates, I would suggest creating some expectations or provisions for a revision within a year or so after there's sufficient experience and feedback from those who have implemented and consumed it. We did this with the Project Open Data Metadata schema and the 1.1 update not only allowed us to address issues that had come up, but to also fully align it with the international standard established by voluntary consensus bodies (DCAT). It's understandable that there was a short timeline to establish this schema, but we don't want to create the impression that this draft will be locked in for perpetuity. One of the ways we addressed this with the Project Open Data schema is in the v1.1. update we required implementors to explicitly state the version of the schema at the top of the file. Use existing standardsWhile it may not seem like the development of this schema is part of a standards making process, it really is if agencies are required to follow it. OMB A-119 sets out basic requirements for the use of standards in government, specifically "this Circular directs agencies to use voluntary consensus standards in lieu of government-unique standards except where inconsistent with law or otherwise impractical." In other words, government should avoid creating government-specific standards unless it has a good reason to do so. Avoiding reinventing the wheel also meets the spirit of reuse set out in this policy. With that in mind, it would be good to review existing standards and document why or why not they are practical to use here. A number of existing schemas and specifications have been raised in this discussion including the Asset Description Metadata Schema for Software ADMS.SW used by federated national software catalogs across Europe - which integrates much of the DCAT vocabulary used for the Project Open Data data.json schema, the civic.json schema (with various flavors that have been used or proposed by the civic tech community in the U.S. including BetaNYC, Code for America, and DC Government), the Schema.org SoftwareSourceCode and SoftwareApplication schemas which appear to be implemented by a relatively small number of websites (10 and less than 50,000 respectively), and the NIST specification for Asset Identification which I think its mostly used to describe software in an operational environment rather than as an autonomous asset ready for reuse. The current schema appears to be largely based on the civic.json specification. The pros of this is that it's something that's already been developed by the community and it's relatively simple. The cons of this is that it's not clear that it's widely been used, well documented, or even proposed consistently enough to enable interoperability. The ADMS.SW specification seems like the most robust standard aligned with the needs of Code.gov. The pros of this is that it's been developed through formal voluntary consensus bodies, is thoroughly documented, aligns with the DCAT schema used for the open data policy, and is implemented in a federated way by European government bodies just as needed by U.S. federal agencies. The cons of this is that it appears overly complex with very dense documentation. You can see a full PDF copy of the ADMS.SW spec here (copied from here) and a presentation about it here The Schema.org schemas are fairly simple, well documented, and developed through a voluntary consensus process. One of the biggest pros is that these are supported by the major search engines which means that they should be indexed by search engines and that's the most likely way people will find software (not on code.gov). The con is that these are not yet well adopted, at least not SoftwareSourceCode, and the search engines do not yet appear to be doing anything special to index these. However, it's totally possible to implement one of the schemas mentioned above while also implementing a schema.org schema, but you'll want to be sure there's a good mapping between the two. We did this with the Project Open Data metadata schema, but it was fairly easy because the POD schema is merely an extension of DCAT and the schema.org Dataset schema was explicitly based on DCAT. None of the major search engines were doing anything special by indexing the schema.org Dataset schema when it was first implemented on Data.gov, but Google is now working on this more and expanding the Dataset schema for the way Google wants to index things like Science Datasets and I think we can expect something similar to happen with software. So while it seems like a fairly final decision to develop something new based on the civic.json schema, I think it's worth considering whether more could be done to leverage the work that's gone into ADMS.SW, to reuse the elements in DCAT already used by the open data policy, to align with a formal voluntary consensus standard, and to allow for interoperability with the federated European software catalog. That said, more should be done to provide a simplified profile of ADMS.SW and to better understand the pros and cons of ADMS.SW in practice. We did this with POD v1.1 and DCAT by working with W3C to make data.json a formal representation of DCAT with JSON-LD and I think we found a good compromise. When POD v1.0 was developed, it was mostly aligned with DCAT, but DCAT had not been finalized. POD v1.1 is now compatible with DCAT and a large portion of national data catalogs around the world use DCAT. The European Union uses DCAT as the basis for their federated Europe-wide data catalog. And even where an existing specification isn't fully packaged to meet all the needs here, you can still assemble fields from existing vocabularies. This allows for field level interoperability and can ensure you reuse properties that are already well defined rather than coin new ones that are vague or inconsistent. Feedback on fieldsIn the meantime, here's some feedback on specific fields (some of this reiterates or emphasizes John's comments)
Missing Fields
Serialization FormatI recommend JSON for many of the reasons other have stated. It has worked for Project Open Data data.json and we have built out the infrastructure to validate and harvest in this format. JSON-LD is also now the format recommended by Google for schema.org schemas and other structured data on webpages. Some have suggested YAML as an alternate because it's more human readable and easy for folks to edit, but this also means it's more likely to result in poor or inconsistent data quality for any data structure with even moderate complexity. With the initial implementation of the Project Open Data data.json schema, many folks attempted to maintain their JSON metadata by hand, but this resulted in the majority of the problems we encountered with regard to harvesting and interoperability. I would strongly suggest that we do not rely on a structured data format that is edited by hand, but agencies are free to allow for this upstream as long as they validate it when compiling their aggregate copy. It's worth noting that JSON is actually a subset of YAML, so agencies could allow either YAML or JSON from individual offices if they're using a YAML parser, but they'll still have to validate it against the final JSON schema requirements and provide a comprehensive JSON version. |
I've attempted an initial mapping between code.json and ADMS.SW. Note that ADMS.SW follows the same conceptual model as DCAT used for Project Open Data data.json:
To clarify these relationships, I created a visual diagram similar to the Schema Object Model Diagram provided for the Project Open Data version of DCAT, but this diagram includes all the fields provided by ADMS.SW rather than paired down to just the required, optional, and extended fields as is the case with the POD diagram. The property mapping and descriptions here are based on the full ADMS.SW documentation PDF and the HTML version of the RDF schema. I would refer to those documents for full property definitions. Also note that some of the properties here are synonymous with those in DCAT even if they use a different property name or namespace. Software RepositoryA Software Repository is a system or service that provides facilities for storage and maintenance of descriptions of Software Projects, Software Releases and Software Packages, and functionality that allows users to search and access these descriptions. A Software Repository will typically contain descriptions of several Software Projects, Software Releases and related Software Packages. An example of a Software Repository is the Apache Software Foundation Project Catalogue
Software ProjectA Software Project is a time-delimited undertaking with the objective to produce one or more software releases, materialised as software packages. Some projects are long-running undertakings, and do not have a clear time-delimited nature or project organisation. In this case, the term ‘software project’ can be interpreted as the result of the work: a collection of related software releases that serve a common purpose. An example of a Software Project is the Apache HTTP Server Project
Software ReleaseA Software Release is an abstract entity that reflects the intellectual content of the software at a particular point in time and represents those characteristics of the software that are independent of its physical embodiment. This abstract entity corresponds to the FRBR entity expression (the intellectual or artistic realization of a work). A release is typically associated with a version number. An example of a Software Release is the Apache HTTP Server 2.22.22 (httpd) release.
Software PackageA Software Package represents a particular physical embodiment of a Software Release, which is an example of the FRBR entity manifestation (the physical embodiment of an expression of a work). A Software Package is typically a downloadable computer file (but in principle it could also be a paper document) that implements the intellectual content of a Software Release. A particular Software Package is associated with one and only one Software Release, while all Packages of an Asset share the same intellectual content in different physical formats. An example of a Software Package is httpd-2.2.22.tar.gz, which represents the Unix Source of the Apache HTTP Server 2.22.22 (httpd) software release. Software often has at least two kinds of physical embodiments: a source code package and a binary package. Binary packages are sometimes compiled for different operating systems or are released under difference licences, e.g. in case of dual licensing. Also scripting languages need some sort of packaging for installation systems used by end users.
|
As @philipashlock noted on November 7,
The first part suggests to me that anything under the Part of my concern is how automated tools will handle the |
@ckaran CC0 is addressed in OSI's FAQs. No decision was made by OSI whether it meets their definition of "Open Source". However, it would be useful to know what definition of "Open Source" to use when completing the My personal recommendation is to avoid "traps" like CC0, where it's "open" with respect to copyright, but patent use rights are explicitly not conferred. MIT and BSD avoid the question entirely (not explicitly conferred), and GPL tends to impose restrictions on consumers that I don't think the government should be in the business of imposing, so I prefer ASL 2.0, myself, for government-released open source projects. (ASL 2.0 also provides a convention to use a NOTICE file for copyright notices, separate from the license, where it would be appropriate to add a brief text noting the license is not applicable domestically for the portions of code produced exclusively by government employees on behalf of the U.S. government.) |
cc/ @benbalter if you have specific thoughts to share: |
@ctubbsii You're right about the problems of patents, etc. with regards to CC0. The lab I work for has been working to avoid the problem by requiring all external contributors sign a contributor license agreement (CLA) before their contributions will be included in any of the lab's projects (you can read the policy here. The lab's lawyers believe that will solve the issues directly related to patents and other IP rights. Note that the policy was adopted by the lab on 19 Dec 2016, but it already has one issue; we can't currently post our CLA, nor can we accept CLAs at the current time as (by design) an executed CLA will contain what can be argued to be personally identifiable information (PII). The lawyers I've talked to tell me that means the lab must obey the Privacy Act, which requires some more work. So, if you read the policy and expect that we'll be able to start accepting contributions immediately, I'm sorry to say that we can't. |
@ctubbsii Honestly, if we could, I would recommend the standard OSI-approved licenses, including the the ASL 2.0 for exactly that reason. Unfortunately, most of the work produced by my lab doesn't have copyright attached, which means that copyright-based licenses may fail in court. |
@ckaran Not sure what you mean by "doesn't have copyright attached". My guess is that you mean "public domain" or you simply mean that nobody is interested in asserting copyright. If it's the former, it probably only applies domestically. The creators still may own copyright internationally, so a license is still worth recommending. If you mean the latter, well, omission of a copyright notice does not disclaim copyrights. If a work isn't covered by copyright (because it's public domain, for instance), an infringement claim would certainly fail in court... but I'm not sure why that matters. That only matters if the creators intend to enforce/assert their copyright claims in the face of a particular infringement, via a lawsuit. If you know the work is public domain in the jurisdiction where the violation occurred... simply don't pursue it with a lawsuit in that case... it's really as simple as that. The license still communicates the limitations of the rights granted in jurisdictions where copyright is applicable (who cares if it's void in jurisdictions where it's not applicable?) and communicates a minimum set of rights guaranteed to everybody else. This instills confidence in the project's users, allowing them to use it according to the license conditions without fear of reprisal. Often (as in the case of ASL 2.0), it also explicitly conveys the rights the project expects contributors to grant, in order for the contributions to be accepted into the project (in other cases, this might be implicit). This is valuable to a project, even if some portions of the project are not subject to copyright protections (public domain). |
@ctubbsii Sorry, I've been talking with our legal counsel for too long. Yes, I mean works that are in the public domain. I've talked with the appropriate people in the Justice Department to see if US Government works have copyright outside of the US. They told me that the US Government's position is that it does, but the lawyer I spoke with wasn't able to find any case law to back that up. What's more, it would have to be litigated in the courts of each nation individually, so there isn't a single 'right' answer. As for why all this is important, it comes down to severability and warranty/liability. Assume that some Government work is licensed under the Apache License 2.0, which is a license that depends on copyright. Someone can sue the Government claiming that the clauses that depend on copyright are void, and (because there is no severability clause), so are all the other clauses. If a court agrees that the license as a whole is void simply because the US Government doesn't have copyright within the US, then that includes the clauses regarding warranty and liability, which means that the Government might be on the hook for damages in some manner[1]. Moreover, downstream users/projects may also have problems[1]. For works that have copyright and are contributed to the Government, I think that the Government would be OK with any of the standard OSI-approved licenses. However, work that is created by Government employees might be in the public domain, so then you have a weird mix of stuff that is protected by the license, and stuff that might not be[1]. Will this cause an issue? I don't know, but I'm not interested in finding out. [1] I'm not a lawyer, this is not legal advice, and as far as I know, this has not yet been litigated in a court. |
@ckaran Oh, I see. Perhaps code.gov should fork ASL 2.0 (which is permitted) and add a severability clause. (Note: I'm currently promoting a discussion on the Apache Mailing Lists about adding this in some future version of the license, perhaps 2.1). |
@ctubbsii I've thought about forking it, but that could also start to fork Open Source (there will be questions about which licenses are compatible with other license, which could be problematic; @massonpj, is this a good assessment?) @ctubbsii I've seen your discussions on the ASL lists; I think that is the best way to go. Not only could everyone (Government and private) use the same license, it would also mean that the license is OSI-approved, which the forked license may not be. The reason this is important is because some journals will only accept code that is under and OSI-approved license; JOSS is one of them. See the discussion here for some of the issues. Basically, what I want are modifications to the standard Open Source licenses that ensure that works that don't have copyright attached have all the following:
[1] Public domain code by definition doesn't have copyright protections, but in a mixed work that has some copyrighted material and some public domain material, the copyrighted material should not be effectively reduced to being public domain; if that was what the authors had intended, then they would have put it in the public domain. That means that license has to be inherently flexible enough to handle this case. IP protections means that public domain work doesn't get hammered by patent headaches from contributions. |
Yes. @ctubbsii, while anyone can create their own license, the OSI's License Review Process, "ensures that licenses and software labeled as 'open source' conforms to existing community norms and expectations." Simply creating a new license and labeling it an "open source software license" is not good. |
@massonpj Obviously, any new license should be approved by both FSF and OSI. The biggest issue I think OSI would have is seeing it as "duplicative" if it's too similar. |
Part of the Federal Source Code Policy requires that federal agencies make available an inventory of metadata describing their custom software. We’re exploring ways for agencies to provide their inventories. We want to implement a solution that works well for agencies and we need your help to do that.
The Federal Source Code Policy describes code.gov as “the primary discoverability portal for custom-developed code intended both for Government-wide reuse and for release as OSS.” The inventory data that agencies provide will be made available through code.gov. The data we collect should make it possible for agencies to find projects relevant to their needs.
There are two primary areas we see where decisions need to be made: the data format and what data is collected.
Data Format
The two options we are considering are CSV and JSON. The assumed benefit to a CSV-based approach is that it is easier for agencies to create and maintain a CSV than JSON. With this approach, we might create a system for agencies to submit their inventory CSV.
With a JSON-based approach, we might ask agencies to make the “inventory.json” available on their website and we would have a system to retrieve inventories as they change. One drawback to JSON is that it is more effort to maintain, takes specialized knowledge, and we may need to provide a tool to build the JSON. On the other hand, JSON is easy to work with programmatically and matches what Data.gov does, meaning many agencies have some familiarity with the process that inventory updating would entail.
The unanswered questions on data format are:
Collected Data
In either data format, we need to determine what data we will collect. Below is a list of fields we are considering accepting.
Proposed required fields:
Proposed optional fields:
For an idea of what the data might look like, we have an early draft of a schema with example content: (https://gist.github.com/theresaanna/a82bfb39b64362bca04e4644706b0ce4)
The questions that we are looking to answer here are:
Thanks for your feedback! It’s crucial for us in meeting our goal of providing a system and schema that are easy to use and meets agencies' needs.
The text was updated successfully, but these errors were encountered: