Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data consistency issues #438

Closed
kurtishouser opened this issue Apr 25, 2019 · 5 comments
Closed

Data consistency issues #438

kurtishouser opened this issue Apr 25, 2019 · 5 comments

Comments

@kurtishouser
Copy link
Contributor

I'll be submitting a pull request this afternoon to correct some errors with the cfr_index property on three files in the avas folder.

I've noticed several other consistency issues and would be happy to correct these as well in a separate pull request. I want to be sure before I spend the time though.

They are:

  • Should all AVAs have a top-level name property? Many do not.
  • Should properties.ava_id be all lower case? Some are not.
  • What is the correct date format? Both YYYY-MM-DD and YYYY/MM/DD are used. The README indicates that it should be the former.
  • I've encountered a lot of escaped newline characters , /n, throughout many of the text properties. Some are trailing and some in the body. Those at the end should definitely be removed. And since this is a dataset we should probably avoid adding any kind of formatting as well such as what I'm seeing in the used_maps properties.

Maybe there should be some detailed documentation regarding how each GeoJSON file should be formatted. We could even write some tests to ensure the integrity of the data. Contributors would have to run the tests first to make sure they all pass before submitting a pull request.

Also, when the individual AVA files are updated. Who's responsible for updating the consolidated versions?

This is a fantastic project. I look forward to contributing to it further.

Thanks,

Kurtis

@MicheleTobias
Copy link
Contributor

Hi Kurtis,
Thanks for you ideas and questions!

I think most of the information you have questions about was automatically created from the Electronic CFR page when we made the templates for the AVAs waiting to be digitized, and I know at least some of the inconsistencies like the CFR Index (See Issue #333 ) originate from there . I suspect the escape characters also came from there. I'd be happy to have your ideas of how to fix some of these issues.

The date format has been a bit of a moving target. Sometimes QGIS seems to reformat the dashes or slashes so I've been letting that slide for now, but ideally they should be consistent.

I'm the person who creates the big file with all the AVAs when a new AVA is submitted. The R script I use is in the rcode folder. I'm realizing that is out of date after a couple of pull request the last few days so I'm going to go update that now.

best,
Michele

@MicheleTobias
Copy link
Contributor

Now that my brain has had a chance to think about all these things, I'm planning to write a script to clean up the format issues you pointed out.

@kurtishouser
Copy link
Contributor Author

I think a good first step would be to document what should be in the GeoJSON files as well as the overall structure. What properties are required, what's optional, data type, format, etc. And also what to do if the information isn't provided. For example with cfr_author and removed I see a lot of N/A, NA, not listed and empty strings. Leaving them null would probably provide the most flexibility.

Be very specific. For example, ava_id is a lower case, string. It is a snake_case version of the official AVA name. Replace spaces and punctuation with an underscore. Ignore periods. I would avoid double underscores. So sta__rita_hills would be sta_rita_hills instead, including the filename.

Also, it might be better to remove the top level name property entirely since it's redundant.

I've also encountered leading spaces in the cfr_author as well as the trailing newlines mentioned above so stripping these out with your scripts would be helpful too.

Additional thoughts (maybe open a separate issue for each to discuss with others?):

  • For properties with multiple values such as county, state, within and contains perhaps convert them to an array instead of a pipe-delimited string.
  • Add property for partial overlap as indicated on https://www.ttb.gov/wine/us_by_ava.shtml.
  • Format (prettify) the GeoJSON files. It will make them easier to edit and any changes will be easier to see with git diff.
{
  "type": "FeatureCollection",
  "features": [{
    "type": "Feature",
    "geometry": {
      "type": "MultiPolygon",
      "coordinates": [
        [
          [
            [-93.8900165363182, 35.5418952876998],
            [-93.9211350905844, 35.5089957584565],
            [-93.858897982052, 35.5089957584565],
            [-93.8900165363182, 35.5418952876998]
          ]
        ]
      ]
    },
    "properties": {
      "ava_id": "altus",
      "name": "Altus",
      "aka": null,
      "created": null,
      "removed": null,
      "county": "Franklin",
      "state": "AR",
      "within": "Arkansas Mountain|Ozark Mountain",
      "contains": null,
      "petitioner": null,
      "cfr_author": null,
      "cfr_index": "9.77",
      "cfr_revision_history": "[T.D. ATF-176, 49 FR 22471, May 30, 1984]",
      "approved_maps": "(1) Ozark Quadrangle, 1966 ...",
      "boundary_description": "General.  The Altus viticultural area ...",
      "used_maps": null,
      "valid_start": null,
      "valid_end": null,
      "lcsh": null,
      "sameas": null
    }
  }]
}

Have a great weekend!

Kurtis

@wildintellect
Copy link

For properties with multiple values such as county, state, within and contains perhaps convert them to an array instead of a pipe-delimited string.

Most GIS software that reads geojson won't be able to handle or edit such attributes

Add property for partial overlap as indicated on https://www.ttb.gov/wine/us_by_ava.shtml.

I think the contains and within are actually in the AVA text descriptions. I'm not sure if any mention overlap. Easy enough to derive if needed.

Format (prettify) the GeoJSON files. It will make them easier to edit and any changes will be easier to see with git diff.

Most GIS software that writes geojson will not preserve this, if a consistent method could be found that could be useful for the diffs.

Fields are defined in the Readme. I'm not sure there's a way to consistently handle empty fields, once again the software used to edit sometimes changes if it's null. Many of these fixes can be applied at the R script after submission.

@MicheleTobias
Copy link
Contributor

I'm going to make individual issues for the pieces of this that need to be addressed and close this issue. I agree with Alex that some of these are things that aren't possible given that the focus here is producing a spatial data format readable by GIS software, however, I am getting started on an R script to fix the major inconsistency issues. See the consistent-formatting branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants