Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
The official Pandas documentation can be found here.
Should we include the non-documented ffill
and bfill
?
The Pandas datareader is a sub package that allows one to create a dataframe from various internet datasources, currently including:
For more information, see here.
The pandas official documentation includes a page on IO Tools with a list of relevant functions to read and write to files, as well as some examples and common parameters.
dtypes are not native to pandas. They are a result of pandas close architectural coupling to numpy.
the dtype of a column does not in any way have to correlate to the python type of the object contained in the column.
Here we have a pd.Series
with floats. The dtype will be float
.
Then we use astype
to "cast" it to object.
pd.Series([1.,2.,3.,4.,5.]).astype(object)
0 1
1 2
2 3
3 4
4 5
dtype: object
The dtype is now object, but the objects in the list are still float. Logical if you know that in python, everything is an object, and can be upcasted to object.
type(pd.Series([1.,2.,3.,4.,5.]).astype(object)[0])
float
Here we try "casting" the floats to strings.
pd.Series([1.,2.,3.,4.,5.]).astype(str)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
dtype: object
The dtype is now object, but the type of the entries in the list are string.
This is because numpy
does not deal with strings, and thus acts as if they are just objects and of no concern.
type(pd.Series([1.,2.,3.,4.,5.]).astype(str)[0])
str
Do not trust dtypes, they are an artifact of an architectural flaw in pandas. Specify them as you must, but do not rely on what dtype is set on a column.
This meta post is similar to the python version http://stackoverflow.com/documentation/python/394/meta-documentation-guidelines#t=201607240058406359521.
Please make edit suggestions, and comment on those (in lieu of proper comments), so we can flesh out/iterate on these suggestions :)
it should be mentioned that if the key value does not exist then this will raise KeyError
, in those situations it maybe better to use merge
or get
which allows you to specify a default value if the key doesn't exist
Gotcha in general is a construct that is although documented, but not intuitive. Gotchas produce some output that is normally not expected because of its counter-intuitive character.
Pandas package has several gotchas, that can confuse someone, who is not aware of them, and some of them are presented on this documentation page.