Saya sedang membaca artikel ini dan saya ingin tahu jawaban yang tepat untuk pertanyaan ini.
Satu-satunya hal yang terlintas di pikiran saya mungkin di beberapa negara pemisah desimal adalah koma, dan mungkin ada masalah saat berbagi data dalam CSV , tapi saya tidak begitu yakin dengan jawaban saya.
project-management
David Gasquez
sumber
sumber
Jawaban:
CSV format specification is defined in RFC 4180. This specification was published because
Unfortunately, since 2005 (date of publishing the RFC), nothing has changed. We still have a wide variety of implementations. The general approach defined in RFC 4180 is to enclose fields containing characters such as commas in quotation marks, this recommendation however is not always meet by different software.
The problem is that in various European locales comma character serves as the decimal point, so you write
0,005
instead of0.005
. Yet in other cases, commas are used instead of spaces to signal digit groups, e.g.4,000,000.00
(see here). In both cases using commas would possibly lead to errors in reading data from csv files because your software does not really know if0,005, 0,1
are two numbers or four different numbers (see example here).Last but not least, if you store text in your data file, then commas are much more common in text than, for example, semicolons, so if your text is not enclosed in quotation marks, that such data can also be easily read with errors.
Nothing makes commas better, or worse field separators as far as CSV files are used in accordance with recommendations as RFC 4180 that guard from the problems described above. However if there is a risk of using the simplified CSV format that does not enclose fields in quotation marks, or the recommendation could be used inconsistently, then other separators (e.g. semicolon) seem to be safer approach.
sumber
,
instead of a rarer separator bloats the data because you have to escape it all the time is true though. And obviously there's all those people who think they know how CSV works but really don't.Technically comma is as good as any other character to be used as a separator. The name of the format directly refers that values are comma separated (Comma-Separated Values).
The description of CSV format is using comma as an separator.
Any field containing comma should be double-quoted. So that does not cause a problem for reading data in. See the point 6 from the description:
For example the functions
read.csv
andwrite.csv
from R by default are using comma as a separator.sumber
values
that are comma separated. Others alluding to europeanformatting
of numbers, this is not an issue for the csvstandard
, as you correctly cite point 6 above. Divergences from "correct use" exist with any data format. The point is - know your data. Others mentiontab
or;
delimited, however these can have the same issues as commas when you're dealing with data that is user-entered (perhaps via a form and captured by a database - I've had to wrangle with free text entry fields that people have fat fingered intab
... it sucks)In addition to being a digit separator in numbers, it is also forms part of address (such as customer address etc) in many countries. While some countries have short well-define addresses, many others have, long-winding addresses including, sometimes two commas in the same line. Good CSV files enclose all such data in double quotes. But over-simplistic, poorly written parsers don't provide for reading and differentiating such. (Then, there is the problem of using double quotes as part of the data, such as quote from a poem).
sumber
While @Tim s answer is correct - I would like to add that "csv" as a whole has no common standard - especially the escaping rules are not defined at all, leading to "formats" which are readable in one program, but not another. This is excarberated by the fact that every "programmer" under the sun just thinks "oooh csv- I will build my own parser!" and then misses all of the edge cases.
Moreover, csv totally lacks the abillity to store metadata or even the data type of a column - leading to at several documents which you must read to unterstand the data.
sumber
If you can ditch the comma delimiter and use a tab character you will have much better success. You can leave the file named .CSV and importing into most programs is usually not a problem. Just specify TAB delimited rather than comma when you import your file. If there are commas in your data you WILL have a problem when specifying comma delimited as you are well aware.
sumber
|
as a delimiter in home-brewed csv-like text files of records (with book titles and other document metadata).|
never occurs in the data I work with, so I can just write perl scripts that simply split/join without checking for quoting of any kind. This was for a one-off project that just involves processing metadata saved from an MS Access database. For any larger project, or if you plan to keep data in this file-format long-term, pick something more robust! I could always tweak something if this month's batch broke something.split
command for Stata I looked at, among other things, the Perl equivalent to see what it did and didn't do. Not the source code, just the functionality offered.cut
,sort
, anduniq
.ASCII provides us with four "separator" characters, as shown below in a snippet from the ascii(7) *nix man page:
This answer provides a decent overview of their intended usage.
Of course, these control codes lack the human-friendliness (readability and input) of more popular delimiters, but are acceptable choices for internal and/or ephemeral exchange of data between programs.
sumber
The problem is not the comma; the problem is quoting. Regardless of which record and field delimiters you use, you need to be prepared for meeting them in the content. So you need a quoting mechanism. AND THEN you need a way for the quoting character(s) to appear too.
Following the RFC 4180 standard makes everything simpler for everybody.
I have personally had to write a script to probably fix the output from a program that got this wrong, so I am a bit militant about it. "probably fix" means that it worked for MY data, but I can see situations where it would fail. (In that program's defense, it was written before the standard.)
sumber