read_lines deals with Carriage Return (?) different #1210

ldecicco-USGS · 2021-05-13T17:24:52Z

I'm seeing a difference in how readr 2.0 parses lines compared to 1.4:

obs_url <- "https://nwis.waterdata.usgs.gov/nwis/qwdata?multiple_site_no=04024430,04024000&multiple_parameter_cds=34247,30234,32104,34220&param_cd_operator=OR&list_of_search_criteria=multiple_site_no,multiple_parameter_cds&group_key=NONE&sitefile_output_format=html_table&column_name=agency_cd&column_name=site_no&column_name=station_nm&inventory_output=0&rdb_inventory_output=file&TZoutput=0&pm_cd_compare=Greater%20than&radio_parm_cds=previous_parm_cds&qw_attributes=0&format=rdb&rdb_qw_attributes=expanded&date_format=YYYY-MM-DD&rdb_compression=value&qw_sample_wide=0&begin_date=2010-11-03"

lines <- readLines(obs_url)
base_meta <- lines[grep("\\#", lines)]
length(base_meta)
[1] 123

packageVersion("readr")
[1] ‘1.4.0.9000’
lines2 <- readr::read_lines(obs_url)
meta_lines <- lines2[grep("\\#", lines2)]
length(meta_lines)
[1] 117

packageVersion("readr")
[1] ‘1.4.0’
lines_OG <- readr::read_lines(obs_url)
meta_lines <- lines_OG[grep("\\#", lines_OG)]
length(meta_lines)
[1] 123

Raw text:

# M  - presence verified but not quantified\r\n# Description of val_qual_tx:\n# b  - value extrapolated at low end\n# c  - see result comment\n# n  - below the reporting level but at or above the detection level\n# t  - below the detection level\n#\r\n

So, the lines that are messed up have an end-of-line as \n, whereas the rest of the end-of-lines (that readr picks up correctly) are \r\n.

The text was updated successfully, but these errors were encountered:

jimhester · 2021-05-13T17:45:17Z

Yeah having a mix of newlines is definitely the issue. I'll see if there is something we can do to work around files like this.

jimhester · 2021-05-17T19:25:04Z

This should be fixed by tidyverse/vroom@2b94f88

library("readr")

obs_url <- "https://nwis.waterdata.usgs.gov/nwis/qwdata?multiple_site_no=04024430,04024000&multiple_parameter_cds=34247,30234,32104,34220&param_cd_operator=OR&list_of_search_criteria=multiple_site_no,multiple_parameter_cds&group_key=NONE&sitefile_output_format=html_table&column_name=agency_cd&column_name=site_no&column_name=station_nm&inventory_output=0&rdb_inventory_output=file&TZoutput=0&pm_cd_compare=Greater%20than&radio_parm_cds=previous_parm_cds&qw_attributes=0&format=rdb&rdb_qw_attributes=expanded&date_format=YYYY-MM-DD&rdb_compression=value&qw_sample_wide=0&begin_date=2010-11-03"
lines2 <- readr::read_lines(obs_url)
meta_lines <- lines2[grep("\\#", lines2)]
length(meta_lines)
#> [1] 123

^{Created on 2021-05-17 by the reprex package (v2.0.0)}

jimhester closed this as completed May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_lines deals with Carriage Return (?) different #1210

read_lines deals with Carriage Return (?) different #1210

ldecicco-USGS commented May 13, 2021

jimhester commented May 13, 2021

jimhester commented May 17, 2021

read_lines deals with Carriage Return (?) different #1210

read_lines deals with Carriage Return (?) different #1210

Comments

ldecicco-USGS commented May 13, 2021

jimhester commented May 13, 2021

jimhester commented May 17, 2021