Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ogr2ogr - handling of isolated double quote characters in lines of a csv file #11660

Closed
klausbraun opened this issue Jan 15, 2025 · 5 comments · Fixed by #11665
Closed

ogr2ogr - handling of isolated double quote characters in lines of a csv file #11660

klausbraun opened this issue Jan 15, 2025 · 5 comments · Fixed by #11665
Assignees

Comments

@klausbraun
Copy link

What is the bug?

I'm in a situation where I try to import the geonames data set from https://download.geonames.org/export/dump/allCountries.zip

https://download.geonames.org/export/dump/allCountries.zip

This file contains information about several points on the earth including place names, elevation, population, coordinates and so on. The format is csv with tabs as delimiters and without header. The number of rows respectivly places is 13005646.

When I add an appropriate header to the file and import it to PostGIS using ogr2ogr I only see 36281745 entries in my table.

I did some investigations and saw that ogr2ogr shows a - at least for me - strange behavior when lines with only one double quote character occur like in the fourth line in the attached example (Ile de l" Union).

In this case ogr2ogr obviously looks for an ending double quote character. If this ending double quote character doesn't occur in the same line, all lines between the line with the starting double quote character and the ending double quote character and the line containing the ending double quote character itself will silently be ignored. This also occurs when I use other formats than PostGIS. In the example attached the command:

ogr2ogr -f "ESRI Shapefile" -oo X_POSSIBLE_NAMES=longitude -oo Y_POSSIBLE_NAMES=latitude -a_srs EPSG:4326 -lco ENCODING=UTF-8 -skipfailures Bugreportogr2ogrExample.shp Bugreportogr2ogrExample.csv
creates a shapefile with five entries where the records with the geonameids 1563246, 1567934, 1567935, 1567936 and 1567937 are missing.

For my understandig I would expect that every line in a csv file corresponds to one record in the translated file. I understand that it might be difficult to decide what the double quote characters are standing for and that there might be situations where they are really used to enquote something. But in the case of the geonames file with tab delimiters all between the tabs belongs to one attribute and enquoting is not neccessary.

At least it might be helpful to have a command line option to define the character used for enquoting if needed and - if possible - to give a warning when the number of lines in the csv file or any other input file with vector data differs from the number of records in the output file.

Best

Klaus

Bugreportogr2ogrExample .csv

Steps to reproduce the issue

  1. Download attached file.

  2. Run

ogr2ogr -f "ESRI Shapefile" -oo X_POSSIBLE_NAMES=longitude -oo Y_POSSIBLE_NAMES=latitude -a_srs EPSG:4326 -lco ENCODING=UTF-8 -skipfailures excerptFromGeonames.shp excerptFromGeonames.csv

  1. Count the lines in the input csv file and the output shapefile.

Versions and provenance

System: OpenSUSE 15.6
GDAL 3.10.0
GDAL latest binary from repo https://download.opensuse.org/repositories/Application:/Geo/15.6/

Additional context

No response

@jratike80
Copy link
Collaborator

It is common in CSV to use double quotes around strings, and they are needed when the string contains special characters, like the one that is also used as a separator. A newline character can also be a part of a string.

I read https://gdal.org/en/stable/drivers/vector/csv.html#format and edited your sample like this:

1563245	Île de l” Union	"Ile de l"" Union"...
1567937	Shui-k”ou Kuan	"Shui-k""ou Kuan"

So:

  • add double quotes on both sides of the strings in the "asciiname" field (hopefully single " does not appear is other fields)
  • duplicate the double quote character in the asciiname string

It is sure a bit tricky to write a sed command or something that pre-processes the data so that GDAL can read it right, but I fear that it is what you need to do. CSV looks simple but it may be the hardest to handle "format" that exists.

@rouault
Copy link
Member

rouault commented Jan 15, 2025

Fix in #11665 . The CSV driver had some logic to handle such situation, but it was partly broken , as it was only working if there was an even number of double-quotes per line

@jratike80
Copy link
Collaborator

I will rename my home village into "OzulanBari in OpenStreetMap and destroy your fix.

rouault added a commit to rouault/gdal that referenced this issue Jan 15, 2025
rouault added a commit to rouault/gdal that referenced this issue Jan 16, 2025
@klausbraun
Copy link
Author

Personally I'd really prefer to leave the geonames data or csv data in genereal untouched unless I know that there is a mistake which should be corrected. Therefore I'm not convinced of the idea of duplicating the double quote character in the asciiname string. Furthermore, to do this by hand in a csv file with 13005646 entries would really be a hard job, and doing it automatically would potentially lead to wrong results.

Especially in the case of Ile de l"Union and according to my understanding of the French language - provided it is french - I guess that the double quote in this case is a typo and should be an apostrophe (U+0027). Using two double quotes then would even be more wrong.

At least it would be helpful to create a warning when the number of lines in the csv file or any other input file differs from the number of records in the output file. Otherwise if people do not check and especially in case of large files records can get lost unnoticed.

Thank's for closing this issue.

@jratike80
Copy link
Collaborator

jratike80 commented Jan 16, 2025

This change would not add another " into the result, the extra " used as an escape would be removed by ogr2ogr.
1563245 Île de l” Union "Ile de l"" Union"...

But the change that Even did will remove the need to edit the geonames data. It is still good to remember that it may be necessary to edit the CSV data before GDAL can interpret it right. You cannot imagine what all sorts of CSV files people have.

I was perhaps wrong when I was remembering that a CSV data may have newlines within the fields and counting the number of lines would not show the number of entries. Newlines may appear in the header, though #9172.

Here is an example from Excel. Table looks like this

Image

and this is the CSV, that GDAL surpisingly seems to parse correctly:

test    "linebreaks
in
a
cell"
value1  "value 2
with
breaks"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants