ogr2ogr - handling of isolated double quote characters in lines of a csv file #11660

klausbraun · 2025-01-15T16:15:00Z

What is the bug?

I'm in a situation where I try to import the geonames data set from https://download.geonames.org/export/dump/allCountries.zip

https://download.geonames.org/export/dump/allCountries.zip

This file contains information about several points on the earth including place names, elevation, population, coordinates and so on. The format is csv with tabs as delimiters and without header. The number of rows respectivly places is 13005646.

When I add an appropriate header to the file and import it to PostGIS using ogr2ogr I only see 36281745 entries in my table.

I did some investigations and saw that ogr2ogr shows a - at least for me - strange behavior when lines with only one double quote character occur like in the fourth line in the attached example (Ile de l" Union).

In this case ogr2ogr obviously looks for an ending double quote character. If this ending double quote character doesn't occur in the same line, all lines between the line with the starting double quote character and the ending double quote character and the line containing the ending double quote character itself will silently be ignored. This also occurs when I use other formats than PostGIS. In the example attached the command:

ogr2ogr -f "ESRI Shapefile" -oo X_POSSIBLE_NAMES=longitude -oo Y_POSSIBLE_NAMES=latitude -a_srs EPSG:4326 -lco ENCODING=UTF-8 -skipfailures Bugreportogr2ogrExample.shp Bugreportogr2ogrExample.csv
creates a shapefile with five entries where the records with the geonameids 1563246, 1567934, 1567935, 1567936 and 1567937 are missing.

For my understandig I would expect that every line in a csv file corresponds to one record in the translated file. I understand that it might be difficult to decide what the double quote characters are standing for and that there might be situations where they are really used to enquote something. But in the case of the geonames file with tab delimiters all between the tabs belongs to one attribute and enquoting is not neccessary.

At least it might be helpful to have a command line option to define the character used for enquoting if needed and - if possible - to give a warning when the number of lines in the csv file or any other input file with vector data differs from the number of records in the output file.

Best

Klaus

Bugreportogr2ogrExample .csv

Steps to reproduce the issue

Download attached file.
Run

ogr2ogr -f "ESRI Shapefile" -oo X_POSSIBLE_NAMES=longitude -oo Y_POSSIBLE_NAMES=latitude -a_srs EPSG:4326 -lco ENCODING=UTF-8 -skipfailures excerptFromGeonames.shp excerptFromGeonames.csv

Count the lines in the input csv file and the output shapefile.

Versions and provenance

System: OpenSUSE 15.6
GDAL 3.10.0
GDAL latest binary from repo https://download.opensuse.org/repositories/Application:/Geo/15.6/

Additional context

No response

The text was updated successfully, but these errors were encountered:

jratike80 · 2025-01-15T21:47:00Z

It is common in CSV to use double quotes around strings, and they are needed when the string contains special characters, like the one that is also used as a separator. A newline character can also be a part of a string.

I read https://gdal.org/en/stable/drivers/vector/csv.html#format and edited your sample like this:

1563245	Île de l” Union	"Ile de l"" Union"...
1567937	Shui-k”ou Kuan	"Shui-k""ou Kuan"

So:

add double quotes on both sides of the strings in the "asciiname" field (hopefully single " does not appear is other fields)
duplicate the double quote character in the asciiname string

It is sure a bit tricky to write a sed command or something that pre-processes the data so that GDAL can read it right, but I fear that it is what you need to do. CSV looks simple but it may be the hardest to handle "format" that exists.

Fixes OSGeo#11660

rouault · 2025-01-15T22:22:30Z

Fix in #11665 . The CSV driver had some logic to handle such situation, but it was partly broken , as it was only working if there was an even number of double-quotes per line

jratike80 · 2025-01-15T22:28:47Z

I will rename my home village into "OzulanBari in OpenStreetMap and destroy your fix.

Fixes OSGeo#11660

klausbraun · 2025-01-16T15:03:15Z

Personally I'd really prefer to leave the geonames data or csv data in genereal untouched unless I know that there is a mistake which should be corrected. Therefore I'm not convinced of the idea of duplicating the double quote character in the asciiname string. Furthermore, to do this by hand in a csv file with 13005646 entries would really be a hard job, and doing it automatically would potentially lead to wrong results.

Especially in the case of Ile de l"Union and according to my understanding of the French language - provided it is french - I guess that the double quote in this case is a typo and should be an apostrophe (U+0027). Using two double quotes then would even be more wrong.

At least it would be helpful to create a warning when the number of lines in the csv file or any other input file differs from the number of records in the output file. Otherwise if people do not check and especially in case of large files records can get lost unnoticed.

Thank's for closing this issue.

jratike80 · 2025-01-16T21:15:50Z

This change would not add another " into the result, the extra " used as an escape would be removed by ogr2ogr.
1563245 Île de l” Union "Ile de l"" Union"...

But the change that Even did will remove the need to edit the geonames data. It is still good to remember that it may be necessary to edit the CSV data before GDAL can interpret it right. You cannot imagine what all sorts of CSV files people have.

I was perhaps wrong when I was remembering that a CSV data may have newlines within the fields and counting the number of lines would not show the number of entries. Newlines may appear in the header, though #9172.

Here is an example from Excel. Table looks like this

and this is the CSV, that GDAL surpisingly seems to parse correctly:

test    "linebreaks
in
a
cell"
value1  "value 2
with
breaks"

Fixes #11660

rouault self-assigned this Jan 15, 2025

rouault added a commit to rouault/gdal that referenced this issue Jan 15, 2025

CSV: fix parsing files with double-quote inside a field value

2a995f5

Fixes OSGeo#11660

rouault mentioned this issue Jan 15, 2025

CSV: fix parsing files with double-quote inside a field value #11665

Merged

rouault added a commit to rouault/gdal that referenced this issue Jan 15, 2025

CSV: fix parsing files with double-quote inside a field value

dbf913c

Fixes OSGeo#11660

rouault added a commit to rouault/gdal that referenced this issue Jan 16, 2025

CSV: fix parsing files with double-quote inside a field value

bc1afe8

Fixes OSGeo#11660

rouault closed this as completed in #11665 Jan 16, 2025

rouault added a commit that referenced this issue Jan 16, 2025

CSV: fix parsing files with double-quote inside a field value

0d73e62

Fixes #11660

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ogr2ogr - handling of isolated double quote characters in lines of a csv file #11660

ogr2ogr - handling of isolated double quote characters in lines of a csv file #11660

klausbraun commented Jan 15, 2025

jratike80 commented Jan 15, 2025

rouault commented Jan 15, 2025

jratike80 commented Jan 15, 2025

klausbraun commented Jan 16, 2025

jratike80 commented Jan 16, 2025 •

edited

Loading

ogr2ogr - handling of isolated double quote characters in lines of a csv file #11660

ogr2ogr - handling of isolated double quote characters in lines of a csv file #11660

Comments

klausbraun commented Jan 15, 2025

What is the bug?

Steps to reproduce the issue

Versions and provenance

Additional context

jratike80 commented Jan 15, 2025

rouault commented Jan 15, 2025

jratike80 commented Jan 15, 2025

klausbraun commented Jan 16, 2025

jratike80 commented Jan 16, 2025 • edited Loading

jratike80 commented Jan 16, 2025 •

edited

Loading