CSV CSV

View the latest, best formatted and most complete version of this manual online at http://mindprod.com/application/csv.manual.html.

Introduction

This package lets you read write, transform, align, change case, dedup, dump, entify, deentify, pack, patch, reshape, sort, template … comma, tab and semicolon- separated variable files, commonly known as CSV (Comma-Separated Value) files.

The csv package consists of a 20 Java classes/utilities CSVReader, CSVWriter, CSVAlign, CSVChangeCase, CSVCommaToTab CSVCondense, CSVDedup, CSVDeEntify, CSVDump, CSVEntify, CSVISBN, CSVPack, CSVPatch, CSVReshape, CSVSort, CSVTemplate, CSVTabToComma, CSVToSRS, CSVToTable, CSVTuple, LinesToCSV and TableToCSV. It also handles tab-separated and semicolon-separated files. This is the format use by Microsoft Word and other Microsoft products. This version now supports # embedded comments.

Complete Java source is provided.

You can use them either as standalone utilities or as classes you can embed in your own Java programs.

The CSV utility suite are Java command line utilities, not GUIs (Graphic User Interfaces) with menus. You use them at the command prompt. Clicking them in menus will not work. Just typing their names on the command line will not work. You must use them as shown in the examples below. They will not work unless you have installed a Java JRE (Java Runtime Environment).

Note that CSV files are perhaps 10 times slower to process than binary files. CSV files are for data interchange with other languages or when human-readability or editability is important. If you want speed, use binary format files, e.g. DataInputStream or possibly the convenient but slower ObjectInputStream.

The CSV utilities work on text files. You need a text editor to create and view them, not a word processor. e.g. notepad, Visual Slick Edit or other suitable text editor. You must use a monspaced Fonts (aka fixed pitch, aka programmer font) to view your files, or they won’t look aligned.

I found that by converting data to CSV format, then I can often no longer need to write any programs to massage my data. I just use combinations of the CSV utility suite. In particular, I used it in hundreds of ways to extract the data from myriad web pages, Excel spreadsheets, pdfs etc to collect the information about state, county and city sales taxes used by the AmericanTax program.

Format of CSV files

You can configure the separator characters, but the usual format of a CSV file looks like this:
## favourite, alts, quotation
apple, "pear, peach", "he said ""ouch"", then stopped" # sample comment

, separates fields. Fields may contain spaces or 's.
"’s enclose fields that contain ,s or "s.
"s are doubled that occur incidentally inside "s.
# starts a comment on the end of a line.
## is a special comment that provides names for the various fields separated by commas, no quotes allowed.

I found that when composing CSV files by hand, it is best to put in the quotes even when they are not needed. This has two benefits: If I accidentally leave out a comma, the parser will detect the error. If I fail to notice a comma embedded in a field, it will still work.

Summary of Components

Click on the name to get more information about each component. The package consists of:

Summary of CSV Utilities and Classes
Name Use as Class? Use as Utility? Purpose
CSV displays statistics about a CSV file, length in bytes, length in lines, number of comment lines, shortest line, longest line.
CSVReader reads a CSV file. Only of use to Java programmers.
CSVWriter writes a CSV file. Only of use to Java programmers.
CSVAlign aligns a CSV file in columns.
CSVChangeCase converts selected columns in a CSV file to lower, upper or book title case.
CSVCommaToTab converts a comma-separated file to a tab-separated file.
CSVCondense removes blank lines and condenses multiple spaces in fields to one space. Unlike CSVPack it removes entire blank lines.
CSVDeDup removes “duplicate’ records in a CSV file.
CSVDedDupField removes duplicate (repeated) letters within a field of a CSV file. It does not deDup entire records. It is designed to tidy fields that are composed of option letters such as the HTMLmacros embellishment.csv file.
CSVDeEntify convert HTML (Hypertext Markup Language) entities back to UTF-8 characters.
CSVDump dumps out a CSV file on the console.
CSVEntify convert awkward characters to HTML entities.
CSVISBN convert ISBN10 to ISBN13 or ISBN13 to ISBN10 in selected columns.
CSVPack removes all excess spaces and quotes from a file. Unlike CSVCondense it does not remove blank lines.
CSVPatch patches/updates columns is a CSV file using a CSV file of from:to pairs.
CSVReshape add, remove, rearrange the order of fields in a CSV file.
CSVSort sort a CSV file alphabetically, case/case insensitively, or numerically on multiple columns.
CSVSortField sorts duplicate (repeated) letters within a field of a CSV file. It does not sort entire records. It is designed to tidy fields that are composed of option letters such as the HTMLmacros embellishment.csv file.
CSVTabToComma converts a tab-separated file to a comma-separated file.
CSVToSRS converts a CSV file to a Funduc Search and Replace script.
CSVTemplate Expands boilerplate with variable CSV data.
CSVToTable Converts a CSV file to an HTML table with awkward characters entified.
CSVTuple Breaks lines into fixed length tuples.
LinesToCSV converts a text file of lines into a CSV file blocking multiple lines into one CSV record.
TableToCSV Converts an HTML table to a UTF-8 CSV file with entities converted to Unicode and tags stripped.

CSV

CSV displays statistics about a CSV file such as length in bytes and lines, count of comment line, shortest and longest line.

java.exe com.mindprod.csv.CSV somefile.csv

alternatively:

java.exe -jar csv.jar somefile.csv

or even:

csv.jar somefile.csv

and if you have Jet:

csv.exe somefile.csv

For this last technique to work, you must have the *.jar extension associated to invoke java.exe.

CSVReader

CSVReader reads a CSV file. CSVReader is of interest only to Java programmers. You don’t have to be a programmer to use the utilities in the csv package. Here is how you typically use CSVReader:

For more information, read the enclosed Javadoc or look at the documentation in the CSVReader and CSVWriter source on the constructors. Also look at the Java source and Javadoc for the various utilities that use CSVReader and CSVWriter.

Tips

CSVWriter

CSVWriter writes a CSV file. CSVWriter is of interest only to Java programmers. You don’t have to be a programmer to use the utilities in the csv package. Here is how some code that shows off some of the things you can do with a CSVWriter:

CSVAlign

CSVAlign aligns the fields in a CSV file in columns for easier proofreading. CSVAlign is done purely in RAM (Random Access Memory), so is not suitable for giant files. It does its job very quickly without any console output. It left aligns alpha data and right aligns numeric data. It does not align on decimal points.

java.exe com.mindprod.csv.CSVAlign somefile.csv

alternatively:

java.exe -jar csvalign.jar somefile.csv

or even:

csvalign.jar somefile.csv

and if you have Jet:

csvalign.exe somefile.csv

For this last technique to work, you must have the *.jar extension associated to invoke java.exe.

The output overwrites the input file, so if you have any doubts about your parameters, make a backup copy first.

There is a companion program called Align for aligning text files that are not CSV files. It breaks columns on commas or spaces, (except quoted fields) and discards the commas in the output.

CSVChangeCase

Change the case of selected fields in a CSV file. List the name of the file to be modified on the command line followed by the 0-based columns to change. After each column, insert either the letter l for lower case, u for upper case or t for Book Title Case. Don’t use CSVChangeCase on columns containing entities. That will just change the case of the entity, not the case of the character the entity represents. CSVDeEntify them first, then CSVChangeCase, then CSVEntify them again.

or alternatively 

CSVCommaToTab

CSVCommaToTab Converts a comma-separated file to a tab-separated file.

Here is how you use it:

java.exe com.mindprod.csv.CSVCommaToTab somefile.csv

alternatively:

java.exe -jar csvcommatotab.jar somefile.csv

The output overwrites the input file, so if you have any doubts it will work, make a backup copy first.

CSVDedup

Removes “duplicate’ records in a CSV file. Currently there is no way just to be notified of dups without eliminating them. To get that effect make a backup copy and compare it with the deduped version. In the simplest case you use it like this:

java.exe com.mindprod.csv.CSVDedup somefile.csv

rem ignore 0-based cols 3 and 4 when comparing for duplicates.
java.exe com.mindproc.csv.CSVDedup somefile.csv -keepfirst  3 4

alternatively:

java.exe -jar csvdedup.jar somefile.csv -keeplast

If adjacent records in the file are identical, all but the first will be eliminated from the output file. If CSV file contains identical records, but they are not adjacent, they will be left intact. To eliminate them, you must sort the file first with CSVSort to arrange that duplicates are adjacent and the one you want to keep comes first. The comparison is case sensitive, i.e. CSVDedup will not eliminate records that differ only in case.

The output overwrites the input file, so if you have any doubts about your parameters, make a backup copy first.

It is also possible to eliminate records that are almost identical. You specify a list of 0-based columns to ignore when comparing adjacent records. If the other columns match, then the record is eliminated. You could use this for example to collapse a list of countries, cities, populations to a list of just countries, by sorting by country, then doing a CSVDedup asking it to ignore columns 1 and 2 so that records with the same country but different cities or populations would be considered duplicated. You could then use CSVReshape to chop off the remaining city and population columns.

java.exe com.mindprod.csv.CSVDedup countrycitypop.csv -delete 1 2

alternatively:

java.exe -jar csvdeup.jar countrycitypop.csv 1 2

The output overwrites the input file, so if you have any doubts about your parameters, make a backup copy first.

CSVDeDup supports a DeDupStrategy parameter -keepfirst -keeplast and -delete:

CSVDeDupField

CSVDeDupField sorts a letters with in single field of a CSV file in ascending order using a case-sensitive sort, then it removed duplicate letters. It does not reorder the records. It both removes spaces and deDups. It probably not of wide interest. Here is what it does:

Before

After

CSVCondense

Like CSVPack, CSVCondense removes all excess spaces and quotes from a CSV file to make it as compact as possible, but in addition it removes blank lines and collapses multiple spaces inside fields down to one. Unlike CSVPack, it removes entire blank lines.

Here is how you use it:

java.exe com.mindprod.csv.CSVCondense somefile.csv

alternatively:

java.exe -jar csvcondense.jar somefile.csv

The output overwrites the input file, so if you have any doubts about your parameters, make a backup copy first.

CSVDeEntify

CSVDeEntify converts HTML entities in a UTF-8 encoded CSV file to UTF-8 characters. You select which 0-based columns you want to be converted. The first column is column 0.

Here is how you use it:

alternatively:

The output overwrites the input file, so if you have any doubts about your parameters, make a backup copy first.

CSVDump

CSVDump dumps out the contents of the CSV file to the console. You might just as easily dump it out with a word processor or text editor. This is primarily a debugging tool to check that all your punctuation is correctly placed, since it will complain if the CSV file is malformed.

Here is how you use it:

alternatively:

java.exe -jar csvdump.jar somefile.csv

It leaves the input file unchanged.

CSVEntify

CSVEntify converts awkward characters in a UTF-8 encoded CSV file to HTML entities. You select which 0-based columns you want to be converted. The first column is column 0.

Here is how you use it:

rem entify columns 0 4 and 5
java.exe com.mindprod.csv.CSVEntify somefile.csv 0 4 5

alternatively:

rem entify columns 0 4 and 5
java.exe -jar csventify.jar somefile.csv  0 4 5

The output overwrites the input file, so if you have any doubts about your parameters, make a backup copy first.

CSVISBN

CSVISBN converts ISBN10 to ISBN13 or ISBN13 to ISBN10 in selected columns. You select which 0-based columns you want to be converted. The first column is column 0.

Here is how you use it:

rem convert ISBNs ISBN10->ISBN13 and ISBN13->ISBN10  in columns 0 4 and 5
java.exe com.mindprod.csv.CSVISBN somefile.csv 0 4 5

alternatively:

rem convert ISBNs ISBN10->ISBN13 and ISBN13->ISBN10  in columns 0 4 and 5
java.exe -jar csvisbn.jar somefile.csv  0 4 5

The output overwrites the input file, so if you have any doubts about your parameters, make a backup copy first.

CSVPack

CSVPack removes all excess spaces and quotes from a CSV file to make it as compact as possible. Unlike CSVCondense, it does not remove blank lines.

Here is how you use it:

java.exe com.mindprod.csv.CSVPack somefile.csv

alternatively:

java.exe -jar csvpack.jar somefile.csv

The output overwrites the input file, so if you have any doubts about your parameters, make a backup copy first.

CSVPatch

CSVPatch patches/updates 0-based columns in a CSV file using a CSV file of from:to pairs. It is like a column-limited, case-sensitive, multiple source-target pair, search/replace that operates on a single CSV file.

Here is how you use it:

alternatively:

java.exe -jarl csvpatch somefiletopatch.csv somefileoffromtopairs.csv 0 3 2

You specify the columns you want to patch. The column numbers are 0-based (the first column in column 0 ). Typically this would be used to update a column of URLs (Uniform Resource Locators) with new values for some of the URLS, The from-to pairs of old and new values live in the second file. They are applied to each field in the column in the first file to bring those values up to date. There must be a perfect case-insensitive match of the complete field for any replacement to take place. If the fields are URLs, then both must be in & form or plain & form to match properly. Fields are stripped of lead and trailing spaces before comparing/replacing. This utility is useful in conjunction with the program BrokenLinks to automatically refresh URLs in CSV files that have recently been redirected. The patch file must have at least two complete columns. If there are duplicates in column 0, the first duplicate will be the one used.

Both columns and the files must use a consistent &/& convention for & in URLs if you are replacing URLs in HTML text. CSVPatch just blindly replaces perfect matches. Similarly sometimes various utilities, browsers or servers interconvert spaces in URLs to/from %20. CSVPatch does not account for this. It insists on an exact match. If CVSPatch fails, look in the file and adjust either the patch file or the file to be patched so the strings match.

CSVReshape

CSVReshape adds, removes and rearranges the order of fields in a CSV file.

Here is how you use it:

java.exe com.mindprod.csv.CSVReshape somefile.csv 0 3 2

alternatively:

java.exe -jar csvreshape.jar somefile.csv 0 3 2

You specify the columns you want retained in the order you want them to appear. The column numbers are 0-based (the first column in column 0). Any columns you do not mention will be dropped. You can duplicate columns. You can create new empty columns simply by using a column number larger than any column you used.

The output overwrites the input file, so if you have any doubts about your parameters, make a backup copy first.

CSVSort

CSVSort sorts a CSV file alphabetically, case/case insensitively or numerically on multiple columns. It works by reading all the data into memory, sorting it, then writing it out. The size of the file it can sort it thus limited by your virtual RAM capacity.

Use it like this:

java.exe com.mindprod.csv.CSVSort somefile.csv 0s+ 2i- 5n+ 3x+

alternatively:

java.exe -jar csvsort.jar somefile.csv 0s+ 2i- 5n+ 3x+

You give a list of the columns you want sort, 0-based.

s case sensitive (lower case are sorted after upper case)
i case insensitive (lower and upper case are intermixed as equivalent)
l sort by the length of the field rather than the contents. Longer fields sort later.
d double or float, usually with a decimal point, possibly with an exponent.
n numeric without a decimal point (long or int). Empty fields are treated as 0.
x hex (e.g. 08f6b or \ue7ff or 0x123456789abcdef up to 64-bits) without a decimal point.
f family (e.g. Charlie Brown) sorts by family/surname then given name.
+ ascending
- descending
99 0-based column number. First column in column 0.

Empty columns are treated as 0 or "". The output overwrites the input file, so if you have any doubts about your parameters, make a backup copy first. CSVSort preserves the position of # comments at the top and bottom of the file. # comments just prior to a data record are moved to stay just ahead of the data record in its new sorted position. Files need not be aligned prior to sorting, however, you will usually need to align them after sorting.

The sort is done purely in RAM, son is not suitable for giant files. The sort produces packed, not aligned files. If you want aligned, you will need to use CSVAlign on the sorted result.

CSVTabToComma

CSVTabToComma Converts a tab-separated file to a comma-separated file.

Here is how you use it:

java.exe com.mindprod.csv.CSVTabToComma somefile.csv

alternatively:

java.exe -jar csvtabtocomma.jar somefile.csv

The output overwrites the input file, so if you have any doubts it will work, make a backup copy first.

CSVSortField

CSVSortField sorts a letters with in single field of a CSV file in ascending order using a case-sensitive sort. It does not reorder the records. It removes spaces, but does not deDup. It probably not of wide interest. You could think of it as like putting the Scrabble tiles for word in alphabetical order. Here is what it does:

Before

After

CSVToSRS

Converts a CSV file to a Funduc Search and Replace script. It converts a CSV file of string pairs, source/target to produce a corresponding *.srs file that you can tweak to create Funduc Search Replace Script to create a script to do bulk search/replaces over many files. It is of no use if you don’t own Funduc Search/Replace, other than as an example for your own utilities to export CSV data to other formats.

java.exe com.mindprod.csv.CSVToSRS somepairs.csv

alternatively:

java.exe -jar csvtosrs.jar somepairs.csv

In this example, the output will be somepairs.srs in the same directory as somepairs.csv. You don’t explicitly specify the name of the output file. The input file is left unchanged.

CSVTemplate

Expands boilerplate with variable CSV data. The output appears in a file with the same name as the input file and the same extension as the template file. Be careful. If you called files script.csv and script.txt the output would go into script.txt, overwriting your template. The template file is just a text file with strings of the form %0 %1… embedded in it. Each line expands the template replacing the %x string with the value of the corresponding column. The template may span several lines. The template is expanded once for each line in the CSV file, with all the expansions glued together in the output file.

or alternatively 

Here is a sample template that displays three ISBN (International Standard Book Number) numbers as links to amazon.com.

CSVToTable

Converts a CSV file to an HTML table with awkward characters entified. It takes a *.csv file and produces a correspondingly named *.html file.

or alternatively 
java.exe -jar csvtotable.jar somefile.csv  alignedmiddle date  number ""
You can optionally provide css classes for the tr and each td column.

CSVTuple

Converts a CSV file into fixed length tuples. If your file looked like this, and you asked for tuples of length 4 you would see this transformation:

a,b,c,d,e,f
x,y

becomes:
a,b,c,d
a,b,c,e
a,b,c,f
x,y,,

java.exe com.mindprod.csv.CSVTuple somefile.csv 4
or alternatively 
java.exe -jar tuple.jar somefile.csv 4

You can specify how many fields you want included in each tuple. Short lines are padded with empty fields.

LinesToCSV

converts a text file of lines into a CSV file blocking multiple lines into one CSV record.

java.exe com.mindprod.csv.LinesToCSV somefile.txt 3

alternatively:

java.exe -jar linestocsv.jar somefile.txt 3

You specify the number of lines that should be grouped together as fields on a single CSV file line. The text file should contain an even multiple of this number of lines.

Note that the input file is not a CSV file, but an ordinary *.txt text file. The output is a CSV file with a matching name, e.g. somefile.csv in this example.

Let us say the input file somefile.txt looked like this: The first line is the name of a species, the second is the number of legs, and the third, the colour. This pattern of three lines repeats for each animal.

lion
4
brown
ostrich
2
black and white
zebra
4
striped black and white

The output somefile.csv would look like this:

lion,4,brown
ostrich,2,black and white
zebra,4,striped black and white

TableToCSV

Converts a HTML file containing tables to a UTF-8 CSV with entities converted to their equivalent Unicode and tags stripped. It takes a *.html file and produces a correspondingly named *.csv file.

java.exe com.mindprod.csv.TableToCSV somefile.html
or alternatively 
java.exe -jar tabletocsv.jar somefile.html

Futures

CSVAlign will align any ## comments.

Eventually I would like to figure out a way to control all the utilities using field names rather than field numbers. It takes two passes though the input, but does not require holding all the data in RAM, thus it can handle arbitrarily large files.

Acquiring CSV

PackageVersionReleasedLicenceLanguageNotes 
csv
CSVReader/Writer
6.9 2013-02-19 free Java
more infoprecismanualscreenshotbrowse source repository
for the current version of CSVReader/Writer.
Classes to read, write, align, condense, sort, convert, export, pack… CSV (comma-separated value) files.
download 2.2MB zip for CSVReader/Writer Java source, compiled class files, jar and documentation to run on your own machine as an application.

Runs on any OS that supports Java e.g. W2K/XP/W2003/Vista/W2008/W7-32/W7-64/W8-32/W8-64/W2012/Linux/LinuxARM/LinuxX86/LinuxX64/Ubuntu/Solaris/SolarisSPARC/SolarisSPARC64/SolarisX86/SolarisX64/OSX.

First install the most recent Java.

To install, extract the zip download with WinZip, (or similar unzip utility) into any directory you please, often J:\ — ticking off the use folder names option.

To check out the corresponding source from the Subversion repository, use the TortoiseSVN repo-browser to
access csv source in repository with [Tortoise] Subversion client on wush.net/svn/mindprod/com/mindprod/csv/.

After you have installed the jar, you can run it as an application. Type:

java -jar J:\com\mindprod\csv\csv.jar parms

adjusting as necessary to account for where the jar file is.

download ASP PAD XML program description for the current version of CSVReader/Writer.

$1989.00 US donated so far. If the CMP utilities solved your problem, please donate a buck or two, or donate to one of the charities featured in the footer public service ads throughout the website and get a tax receipt.

CSVReader/Writer is free. Full source included. You may even include the source code, modified or unmodified in free/commercial open source/proprietary programs that you write and distribute. Non-military use only.
 
 
download CSV
HTML Broken link fixer student project
monospaced Fonts
text editor

available on the web at:

http://mindprod.com/application/csv.manual.html
ClustrMaps is down

optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\application\csv.manual.html
logo
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy.
Blog
IP:[65.110.21.43]
Your face IP:[67.202.56.112]
You are visitor number 11.