-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ogr2ogr - handling of isolated double quote characters in lines of a csv file #11660
Comments
It is common in CSV to use double quotes around strings, and they are needed when the string contains special characters, like the one that is also used as a separator. A newline character can also be a part of a string. I read https://gdal.org/en/stable/drivers/vector/csv.html#format and edited your sample like this:
So:
It is sure a bit tricky to write a sed command or something that pre-processes the data so that GDAL can read it right, but I fear that it is what you need to do. CSV looks simple but it may be the hardest to handle "format" that exists. |
Fix in #11665 . The CSV driver had some logic to handle such situation, but it was partly broken , as it was only working if there was an even number of double-quotes per line |
I will rename my home village into |
Personally I'd really prefer to leave the geonames data or csv data in genereal untouched unless I know that there is a mistake which should be corrected. Therefore I'm not convinced of the idea of duplicating the double quote character in the asciiname string. Furthermore, to do this by hand in a csv file with 13005646 entries would really be a hard job, and doing it automatically would potentially lead to wrong results. Especially in the case of Ile de l"Union and according to my understanding of the French language - provided it is french - I guess that the double quote in this case is a typo and should be an apostrophe (U+0027). Using two double quotes then would even be more wrong. At least it would be helpful to create a warning when the number of lines in the csv file or any other input file differs from the number of records in the output file. Otherwise if people do not check and especially in case of large files records can get lost unnoticed. Thank's for closing this issue. |
This change would not add another But the change that Even did will remove the need to edit the geonames data. It is still good to remember that it may be necessary to edit the CSV data before GDAL can interpret it right. You cannot imagine what all sorts of CSV files people have. I was perhaps wrong when I was remembering that a CSV data may have newlines within the fields and counting the number of lines would not show the number of entries. Newlines may appear in the header, though #9172. Here is an example from Excel. Table looks like this and this is the CSV, that GDAL surpisingly seems to parse correctly:
|
What is the bug?
I'm in a situation where I try to import the geonames data set from https://download.geonames.org/export/dump/allCountries.zip
https://download.geonames.org/export/dump/allCountries.zip
This file contains information about several points on the earth including place names, elevation, population, coordinates and so on. The format is csv with tabs as delimiters and without header. The number of rows respectivly places is 13005646.
When I add an appropriate header to the file and import it to PostGIS using ogr2ogr I only see 36281745 entries in my table.
I did some investigations and saw that ogr2ogr shows a - at least for me - strange behavior when lines with only one double quote character occur like in the fourth line in the attached example (Ile de l" Union).
In this case ogr2ogr obviously looks for an ending double quote character. If this ending double quote character doesn't occur in the same line, all lines between the line with the starting double quote character and the ending double quote character and the line containing the ending double quote character itself will silently be ignored. This also occurs when I use other formats than PostGIS. In the example attached the command:
ogr2ogr -f "ESRI Shapefile" -oo X_POSSIBLE_NAMES=longitude -oo Y_POSSIBLE_NAMES=latitude -a_srs EPSG:4326 -lco ENCODING=UTF-8 -skipfailures Bugreportogr2ogrExample.shp Bugreportogr2ogrExample.csv
creates a shapefile with five entries where the records with the geonameids 1563246, 1567934, 1567935, 1567936 and 1567937 are missing.
For my understandig I would expect that every line in a csv file corresponds to one record in the translated file. I understand that it might be difficult to decide what the double quote characters are standing for and that there might be situations where they are really used to enquote something. But in the case of the geonames file with tab delimiters all between the tabs belongs to one attribute and enquoting is not neccessary.
At least it might be helpful to have a command line option to define the character used for enquoting if needed and - if possible - to give a warning when the number of lines in the csv file or any other input file with vector data differs from the number of records in the output file.
Best
Klaus
Bugreportogr2ogrExample .csv
Steps to reproduce the issue
Download attached file.
Run
ogr2ogr -f "ESRI Shapefile" -oo X_POSSIBLE_NAMES=longitude -oo Y_POSSIBLE_NAMES=latitude -a_srs EPSG:4326 -lco ENCODING=UTF-8 -skipfailures excerptFromGeonames.shp excerptFromGeonames.csv
Versions and provenance
System: OpenSUSE 15.6
GDAL 3.10.0
GDAL latest binary from repo https://download.opensuse.org/repositories/Application:/Geo/15.6/
Additional context
No response
The text was updated successfully, but these errors were encountered: