Getting started with pandas Reshaping and pivoting Save pandas dataframe to a csv file Creating DataFrames Indexing and selecting data Grouping Data Missing Data Series Pandas Datareader Merge, join, and concatenate Reading files into pandas DataFrame Duplicated data Resampling Read SQL Server to Dataframe String manipulation Pandas IO tools (reading and saving data sets)Data Types Meta: Documentation Guidelines Graphs and Visualizations MultiIndex Categorical data Map Values Grouping Time Series Data JSON Analysis: Bringing it all together and making decisions IO for Google BigQuery Computational Tools Dealing with categorical variables Gotchas of pandas Appending to DataFrame Simple manipulation of DataFrames Getting information about DataFrames pd.DataFrame.apply Working with Time Series Using .ix, .iloc, .loc, .at and .iat to access a DataFrame Shifting and Lagging Data Holiday Calendars Making Pandas Play Nice With Native Python Datatypes Cross sections of different axes with MultiIndex Read MySQL to DataFrame Boolean indexing of dataframes

Pandas IO tools (reading and saving data sets)

Remarks:

The pandas official documentation includes a page on IO Tools with a list of relevant functions to read and write to files, as well as some examples and common parameters.

Reading csv file into DataFrame

Example for reading file data_file.csv such as:

File:

index,header1,header2,header3
1,str_data,12,1.4
3,str_data,22,42.33
4,str_data,2,3.44
2,str_data,43,43.34

7, str_data, 25, 23.32

Code:

pd.read_csv('data_file.csv')

Output:

   index    header1  header2  header3
0      1   str_data       12     1.40
1      3   str_data       22    42.33
2      4   str_data        2     3.44
3      2   str_data       43    43.34
4      7   str_data       25    23.32

Some useful arguments:

sep The default field delimiter is a comma ,. Use this option if you need a different delimiter, for instance pd.read_csv('data_file.csv', sep=';')

index_col With index_col = n (n an integer) you tell pandas to use column n to index the DataFrame. In the above example:

pd.read_csv('data_file.csv',  index_col=0)

Output:

          header1  header2  header3
index
 1       str_data       12     1.40
 3       str_data       22    42.33
 4       str_data        2     3.44
 2       str_data       43    43.34
 7       str_data       25    23.32

skip_blank_lines By default blank lines are skipped. Use skip_blank_lines=False to include blank lines (they will be filled with NaN values)

pd.read_csv('data_file.csv',  index_col=0,skip_blank_lines=False)

Output:

         header1  header2  header3
index
 1      str_data       12     1.40
 3      str_data       22    42.33
 4      str_data        2     3.44
 2      str_data       43    43.34
NaN          NaN      NaN      NaN
 7      str_data       25    23.32

parse_dates Use this option to parse date data.

File:

date_begin;date_end;header3;header4;header5
1/1/2017;1/10/2017;str_data;1001;123,45
2/1/2017;2/10/2017;str_data;1001;67,89
3/1/2017;3/10/2017;str_data;1001;0

Code to parse columns 0 and 1 as dates:

pd.read_csv('f.csv', sep=';', parse_dates=[0,1])

Output:

  date_begin   date_end   header3  header4 header5
0 2017-01-01 2017-01-10  str_data     1001  123,45
1 2017-02-01 2017-02-10  str_data     1001   67,89
2 2017-03-01 2017-03-10  str_data     1001       0

By default, the date format is inferred. If you want to specify a date format you can use for instance

dateparse = lambda x: pd.datetime.strptime(x, '%d/%m/%Y')
pd.read_csv('f.csv', sep=';',parse_dates=[0,1],date_parser=dateparse)

Output:

  date_begin   date_end   header3  header4 header5
0 2017-01-01 2017-10-01  str_data     1001  123,45
1 2017-01-02 2017-10-02  str_data     1001   67,89
2 2017-01-03 2017-10-03  str_data     1001       0

More information on the function's parameters can be found in the official documentation.

Basic saving to a csv file

raw_data = {'first_name': ['John', 'Jane', 'Jim'],
            'last_name': ['Doe', 'Smith', 'Jones'],
            'department': ['Accounting', 'Sales', 'Engineering'],}
df = pd.DataFrame(raw_data,columns=raw_data.keys())
df.to_csv('data_file.csv')

Parsing dates when reading from csv

You can specify a column that contains dates so pandas would automatically parse them when reading from the csv

pandas.read_csv('data_file.csv', parse_dates=['date_column'])

Spreadsheet to dict of DataFrames

with pd.ExcelFile('path_to_file.xls) as xl:
    d = {sheet_name: xl.parse(sheet_name) for sheet_name in xl.sheet_names}

Read a specific sheet

pd.read_excel('path_to_file.xls', sheetname='Sheet1')

There are many parsing options for read_excel (similar to the options in read_csv.

pd.read_excel('path_to_file.xls',
              sheetname='Sheet1', header=[0, 1, 2],
              skiprows=3, index_col=0)  # etc.

Testing read_csv

import pandas as pd
import io

temp=u"""index; header1; header2; header3
1; str_data; 12; 1.4
3; str_data; 22; 42.33
4; str_data; 2; 3.44
2; str_data; 43; 43.34
7; str_data; 25; 23.32"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),  
                 sep = ';', 
                 index_col = 0,
                 skip_blank_lines = True)
print (df)
         header1   header2   header3
index                               
1       str_data        12      1.40
3       str_data        22     42.33
4       str_data         2      3.44
2       str_data        43     43.34
7       str_data        25     23.32

List comprehension

All files are in folder files. First create list of DataFrames and then concat them:

import pandas as pd
import glob

#a.csv
#a,b
#1,2
#5,8

#b.csv
#a,b
#9,6
#6,4

#c.csv
#a,b
#4,3
#7,0

files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp) for fp in files]

#duplicated index inherited from each Dataframe
df = pd.concat(dfs)
print (df)
   a  b
0  1  2
1  5  8
0  9  6
1  6  4
0  4  3
1  7  0
#'reseting' index
df = pd.concat(dfs, ignore_index=True)
print (df)
   a  b
0  1  2
1  5  8
2  9  6
3  6  4
4  4  3
5  7  0
#concat by columns
df1 = pd.concat(dfs, axis=1)
print (df1)
   a  b  a  b  a  b
0  1  2  9  6  4  3
1  5  8  6  4  7  0
#reset column names
df1 = pd.concat(dfs, axis=1, ignore_index=True)
print (df1)
   0  1  2  3  4  5
0  1  2  9  6  4  3
1  5  8  6  4  7  0

Read in chunks

import pandas as pd    

chunksize = [n]
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)
    delete(chunk)

Save to CSV file

Save with default parameters:

df.to_csv(file_name)

Write specific columns:

df.to_csv(file_name, columns =['col'])

Difault delimiter is ',' - to change it:

df.to_csv(file_name,sep="|")

Write without the header:

df.to_csv(file_name, header=False)

Write with a given header:

df.to_csv(file_name, header = ['A','B','C',...]

To use a specific encoding (e.g. 'utf-8') use the encoding argument:

df.to_csv(file_name, encoding='utf-8')

Parsing date columns with read_csv

Date always have a different format, they can be parsed using a specific parse_dates function.

This input.csv:

2016 06 10 20:30:00    foo
2016 07 11 19:45:30    bar
2013 10 12 4:30:00     foo

Can be parsed like this :

mydateparser = lambda x: pd.datetime.strptime(x, "%Y %m %d %H:%M:%S")
df = pd.read_csv("file.csv", sep='\t', names=['date_column', 'other_column'], parse_dates=['date_column'], date_parser=mydateparser)

parse_dates argument is the column to be parsed
date_parser is the parser function

Read & merge multiple CSV files (with the same structure) into one DF

import os
import glob
import pandas as pd

def get_merged_csv(flist, **kwargs):
    return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)

path = 'C:/Users/csvfiles'
fmask = os.path.join(path, '*mask*.csv')

df = get_merged_csv(glob.glob(fmask), index_col=None, usecols=['col1', 'col3'])

print(df.head())

If you want to merge CSV files horizontally (adding columns), use axis=1 when calling pd.concat() function:

def merged_csv_horizontally(flist, **kwargs):
    return pd.concat([pd.read_csv(f, **kwargs) for f in flist], axis=1)

Reading cvs file into a pandas data frame when there is no header row

If the file does not contain a header row,

File:

1;str_data;12;1.4
3;str_data;22;42.33
4;str_data;2;3.44
2;str_data;43;43.34

7; str_data; 25; 23.32

you can use the keyword names to provide column names:

df = pandas.read_csv('data_file.csv', sep=';', index_col=0,
                     skip_blank_lines=True, names=['a', 'b', 'c'])

df
Out: 
           a   b      c
1   str_data  12   1.40
3   str_data  22  42.33
4   str_data   2   3.44
2   str_data  43  43.34
7   str_data  25  23.32

Using HDFStore

import string
import numpy as np
import pandas as pd

generate sample DF with various dtypes

df = pd.DataFrame({
     'int32':    np.random.randint(0, 10**6, 10),
     'int64':    np.random.randint(10**7, 10**9, 10).astype(np.int64)*10,
     'float':    np.random.rand(10),
     'string':   np.random.choice([c*10 for c in string.ascii_uppercase], 10),
     })

In [71]: df
Out[71]:
      float   int32       int64      string
0  0.649978  848354  5269162190  DDDDDDDDDD
1  0.346963  490266  6897476700  OOOOOOOOOO
2  0.035069  756373  6711566750  ZZZZZZZZZZ
3  0.066692  957474  9085243570  FFFFFFFFFF
4  0.679182  665894  3750794810  MMMMMMMMMM
5  0.861914  630527  6567684430  TTTTTTTTTT
6  0.697691  825704  8005182860  FFFFFFFFFF
7  0.474501  942131  4099797720  QQQQQQQQQQ
8  0.645817  951055  8065980030  VVVVVVVVVV
9  0.083500  349709  7417288920  EEEEEEEEEE

make a bigger DF (10 * 100.000 = 1.000.000 rows)

df = pd.concat([df] * 10**5, ignore_index=True)

create (or open existing) HDFStore file

store = pd.HDFStore('d:/temp/example.h5')

save our data frame into `h5` (HDFStore) file, indexing [int32, int64, string] columns:

store.append('store_key', df, data_columns=['int32','int64','string'])

show HDFStore details

In [78]: store.get_storer('store_key').table
Out[78]:
/store_key/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "int32": Int32Col(shape=(), dflt=0, pos=2),
  "int64": Int64Col(shape=(), dflt=0, pos=3),
  "string": StringCol(itemsize=10, shape=(), dflt=b'', pos=4)}
  byteorder := 'little'
  chunkshape := (1724,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "int32": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "string": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "int64": Index(6, medium, shuffle, zlib(1)).is_csi=False}

show indexed columns

In [80]: store.get_storer('store_key').table.colindexes
Out[80]:
{
    "int32": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "string": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "int64": Index(6, medium, shuffle, zlib(1)).is_csi=False}

close (flush to disk) our store file

store.close()

Read Nginx access log (multiple quotechars)

For multiple quotechars use regex in place of sep:

df = pd.read_csv(log_file,
              sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
              engine='python',
              usecols=[0, 3, 4, 5, 6, 7, 8],
              names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
              na_values='-',
              header=None
                )