Read Csv File and Store in Datatable Python
CSV (comma-separated value) files are a common file format for transferring and storing data. The power to read, manipulate, and write data to and from CSV files using Python is a key skill to primary for any information scientist or business analysis. In this post, we'll get over what CSV files are, how to read CSV files into Pandas DataFrames, and how to write DataFrames back to CSV files post assay.
Pandas is the most popular data manipulation package in Python, and DataFrames are the Pandas data blazon for storing tabular 2D information.
- Load CSV files to Python Pandas
- one. File Extensions and File Types
- 2. Data Representation in CSV files
- Other Delimiters / Separators – TSV files
- Delimiters in Text Fields – Quotechar
- iii. Python – Paths, Folders, Files
- Finding your Python Path
- File Loading: Accented and Relative Paths
- 4. Pandas CSV File Loading Errors
- Advanced Read CSV Files
- Specifying Information Types
- Skipping and Picking Rows and Columns From File
- Custom Missing Value Symbols
- CSV Format Advantages and Disadvantages
- Additional Reading
Load CSV files to Python Pandas
The basic process of loading data from a CSV file into a Pandas DataFrame (with all going well) is achieved using the "read_csv" function in Pandas:
# Load the Pandas libraries with alias 'pd' import pandas every bit pd # Read data from file 'filename.csv' # (in the aforementioned directory that your python procedure is based) # Control delimiters, rows, column names with read_csv (meet later) data = pd.read_csv("filename.csv") # Preview the beginning 5 lines of the loaded data data.head()
While this code seems uncomplicated, an agreement of three primal concepts is required to fully grasp and debug the operation of the data loading procedure if you run into issues:
- Understanding file extensions and file types – what do the letters CSV actually mean? What's the difference between a .csv file and a .txt file?
- Understanding how data is represented inside CSV files – if you open a CSV file, what does the data actually look like?
- Agreement the Python path and how to reference a file – what is the absolute and relative path to the file you are loading? What directory are you working in?
- CSV information formats and errors – common errors with the role.
Each of these topics is discussed below, and we finish this tutorial by looking at some more advanced CSV loading mechanisms and giving some wide advantages and disadvantages of the CSV format.
ane. File Extensions and File Types
The showtime step to working with comma-separated-value (CSV) files is understanding the concept of file types and file extensions.
- Data is stored on your computer in individual "files", or containers, each with a dissimilar name.
- Each file contains data of unlike types – the internals of a Word document is quite different from the internals of an image.
- Computers determine how to read files using the "file extension", that is the lawmaking that follows the dot (".") in the filename.
- And so, a filename is typically in the form "<random proper noun>.<file extension>". Examples:
- project1.DOCX – a Microsoft Give-and-take file called Project1.
- shanes_file.TXT – a unproblematic text file called shanes_file
- IMG_5673.JPG – An paradigm file chosen IMG_5673.
- Other well known file types and extensions include: XLSX: Excel, PDF: Portable Document Format, PNG – images, Nada – compressed file format, GIF – blitheness, MPEG – video, MP3 – music etc. Come across a consummate list of extensions here.
- A CSV file is a file with a ".csv" file extension, eastward.one thousand. "data.csv", "super_information.csv". The "CSV" in this instance lets the computer know that the information contained in the file is in "comma separated value" format, which we'll discuss below.
File extensions are subconscious by default on a lot of operating systems. The first step that any self-respecting engineer, software engineer, or data scientist will do on a new computer is to ensure that file extensions are shown in their Explorer (Windows) or Finder (Mac) windows.
To check if file extensions are showing in your organization, create a new text certificate with Notepad (Windows) or TextEdit (Mac) and save information technology to a folder of your pick. If y'all can't run into the ".txt" extension in your folder when you view it, you will have to change your settings.
- In Microsoft Windows: Open up Control Panel > Appearance and Personalization. At present, click on Folder Options or File Explorer Selection, as it is now called > View tab. In this tab, under Advance Settings, you will see the choice Hibernate extensions for known file types. Uncheck this option and click on Apply and OK.
- In Mac Os: Open Finder > In menu, click Finder > Preferences, Click Avant-garde, Select the checkbox for "Show all filename extensions".
2. Data Representation in CSV files
A "CSV" file, that is, a file with a "csv" filetype, is a basic text file. Any text editor such every bit NotePad on windows or TextEdit on Mac, can open a CSV file and show the contents. Sublime Text is a wonderful and multi-functional text editor choice for whatsoever platform.
CSV is a standard for storing tabular information in text format, where commas are used to dissever the unlike columns, and newlines (carriage return / press enter) used to separate rows. Typically, the starting time row in a CSV file contains the names of the columns for the information.
And case table data set and the corresponding CSV-format information is shown in the diagram below.
Note that near any tabular data can exist stored in CSV format – the format is popular because of its simplicity and flexibility. Yous can create a text file in a text editor, salvage it with a .csv extension, and open that file in Excel or Google Sheets to run across the table class.
Other Delimiters / Separators – TSV files
The comma separation scheme is by far the most popular method of storing tabular data in text files.
Withal, the option of the ',' comma grapheme to delimiters columns, however, is arbitrary, and can be substituted where needed. Popular alternatives include tab ("\t") and semi-colon (";"). Tab-separate files are known as TSV (Tab-Separated Value) files.
When loading data with Pandas, the read_csv function is used for reading any delimited text file, and by changing the delimiter using the sep
parameter.
Delimiters in Text Fields – Quotechar
One complication in creating CSV files is if you have commas, semicolons, or tabs actually in one of the text fields that you lot desire to store. In this case, it's important to use a "quote character" in the CSV file to create these fields.
The quote character can exist specified in Pandas.read_csv using the quotechar
statement. By default (every bit with many systems), information technology'southward set equally the standard quotation marks ("). Any commas (or other delimiters as demonstrated below) that occur between two quote characters volition be ignored as column separators.
In the example shown, a semicolon-delimited file, with quotation marks as a quotechar is loaded into Pandas, and shown in Excel. The use of the quotechar allows the "NickName" column to contain semicolons without being split into more columns.
3. Python – Paths, Folders, Files
When you specify a filename to Pandas.read_csv, Python volition expect in your "electric current working directory". Your working directory is typically the directory that y'all started your Python process or Jupyter notebook from.
Finding your Python Path
Your Python path can exist displayed using the built-in os
module. The Bone module is for operating system dependent functionality into Python programs and scripts.
To find your current working directory, the function required is bone.getcwd()
. Theos.listdir()
function can be used to display all files in a directory, which is a good check to encounter if the CSV file you are loading is in the directory as expected.
# Find out your current working directory import os print(os.getcwd()) # Out: /Users/shane/Documents/blog # Display all of the files establish in your current working directory impress(os.listdir(os.getcwd()) # Out: ['test_delimted.ssv', 'CSV Blog.ipynb', 'test_data.csv']
In the instance above, my electric current working directory is in the '/Users/Shane/Document/web log' directory. Any files that are places in this directory will be immediately available to the Python file open() function or the Pandas read csv role.
Instead of moving the required information files to your working directory, you tin can too alter your current working directory to the directory where the files reside usingos.chdir()
.
File Loading: Absolute and Relative Paths
When specifying file names to the read_csv function, y'all tin can supply both absolute or relative file paths.
- A relative pathis the path to the file if y'all start from your current working directory. In relative paths, typically the file volition be in a subdirectory of the working directory and the path volition not get-go with a drive specifier, e.g. (data/test_file.csv). The characters '..' are used to move to a parent directory in a relative path.
- An absolute pathis the complete path from the base of operations of your file system to the file that y'all want to load, e.g. c:/Documents/Shane/data/test_file.csv. Accented paths will offset with a drive specifier (c:/ or d:/ in Windows, or '/' in Mac or Linux)
It'south recommended and preferred to use relative paths where possible in applications, considering absolute paths are unlikely to work on different computers due to different directory structures.
iv. Pandas CSV File Loading Errors
The most mutual mistake's you lot'll go while loading information from CSV files into Pandas will be:
-
FileNotFoundError: File b'filename.csv' does not exist
A File Non Found mistake is typically an issue with path setup, current directory, or file name confusion (file extension tin play a function here!) -
UnicodeDecodeError: 'utf-viii' codec can't decode byte in position : invalid continuation byte
A Unicode Decode Fault is typically caused by not specifying the encoding of the file, and happens when you have a file with not-standard characters. For a quick fix, try opening the file in Sublime Text, and re-saving with encoding 'UTF-viii'. -
pandas.parser.CParserError: Error tokenizing information.
Parse Errors can exist acquired in unusual circumstances to do with your data format – try to add the parameter "engine='python'" to the read_csv function call; this changes the data reading function internally to a slower but more stable method.
Advanced Read CSV Files
At that place are some additional flexible parameters in the Pandas read_csv() office that are useful to accept in your arsenal of data science techniques:
Specifying Information Types
As mentioned before, CSV files practice not comprise any type information for data. Data types are inferred through examination of the pinnacle rows of the file, which tin lead to errors. To manually specify the data types for different columns, thedtype parameter can be used with a dictionary of column names and information types to be practical, for example:dtype={"proper noun": str, "age": np.int32}
.
Note that for dates and date times, the format, columns, and other behaviour can exist adjusted using parse_dates, date_parser, dayfirst, keep_dateparameters.
Skipping and Picking Rows and Columns From File
Thenrows parameter specifies how many rows from the elevation of CSV file to read, which is useful to take a sample of a large file without loading completely. Similarly theskiprowsparameter allows yous to specify rows to leave out, either at the commencement of the file (provide an int), or throughout the file (provide a list of row indices). Similarly, theusecolsparameter can be used to specify which columns in the data to load.
Custom Missing Value Symbols
When information is exported to CSV from different systems, missing values can be specified with unlike tokens. Thena_values parameter allows yous to customise the characters that are recognised equally missing values. The default values interpreted as NA/NaN are: '', '#N/A', '#N/A Northward/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', 'Due north/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'nil'.
# Advanced CSV loading example data = pd.read_csv( "data/files/complex_data_example.tsv", # relative python path to subdirectory sep='\t' # Tab-separated value file. quotechar="'", # unmarried quote immune as quote character dtype={"salary": int}, # Parse the salary column as an integer usecols=['name', 'birth_date', 'bacon']. # Only load the three columns specified. parse_dates=['birth_date'], # Intepret the birth_date cavalcade as a date skiprows=10, # Skip the first ten rows of the file na_values=['.', '??'] # Have any '.' or '??' values every bit NA )
CSV Format Advantages and Disadvantages
As with all technical decisions, storing your data in CSV format has both advantages and disadvantages. Exist aware of the potential pitfalls and issues that you volition run into as you load, shop, and commutation data in CSV format:
On the plus side:
- CSV format is universal and the information can be loaded by almost any software.
- CSV files are uncomplicated to empathize and debug with a basic text editor
- CSV files are quick to create and load into retentivity earlier assay.
Nonetheless, the CSV format has some negative sides:
- There is no data type information stored in the text file, all typing (dates, int vs float, strings) are inferred from the information only.
- At that place's no formatting or layout information storable – things similar fonts, borders, column width settings from Microsoft Excel volition be lost.
- File encodings can become a problem if there are non-ASCII uniform characters in text fields.
- CSV format is inefficient; numbers are stored as characters rather than binary values, which is wasteful. You will detect however that your CSV data compresses well using zippo pinch.
Every bit and aside, in an try to counter some of these disadvantages, ii prominent data science developers in both the R and Python ecosystems, Wes McKinney and Hadley Wickham, recently introduced the Feather Format, which aims to be a fast, simple, open, flexible and multi-platform data format that supports multiple data types natively.
Boosted Reading
- Official Pandas documentation for the read_csv function.
- Python 3 Notes on file paths, working directories, and using the OS module.
- Datacamp Tutorial on loading CSV files, including some additional OS commands.
- PythonHow Loading CSV tutorial.
Source: https://www.shanelynn.ie/python-pandas-read-csv-load-data-from-csv-files/
0 Response to "Read Csv File and Store in Datatable Python"
Publicar un comentario