pandas

1. pandas

1. pandas

1.1. Resources

https://pandastutor.com/ → run pandas code step by step
“Fluent Pandas” → good, comprehensive review
https://github.com/jvns/pandas-cookbook
https://github.com/tommyod/awesome-pandas

1.2. Pandas column typing

Aunque Python tiene Static Typing, no se aplica de momento a las columnas de Pandas
https://koalas.readthedocs.io/en/latest/user_guide/typehints.html

1.2.1. panderas

1.2.2. GitHub - pandas-dev/pandas-stubs: Public type stubs for pandas

1.3. Pandas: Setting no. of max rows - Stack Overflow

from IPython.display import display
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10, 'display.max_colwidth', 100):
    display(df) #need display to show the dataframe when using with in jupyter
    #some pandas stuff

1.4. API para arrays y dataframes

1.5. Declarative pipeline in pandas idea

In YAML

1.5.1. kmiller96/conduits: Declarative Data Pipelines

https://github.com/kmiller96/conduits

1.5.2. Recopilation of generic functions

1.6. Pandas inplace

1.7. Pandas tricks

Crear un índice de fechas rápidamente con pandas

  aa = pd.date_range("2021-01-01", "2021-01-02").to_pydatetime().tolist()
  [x.strftime("%Y%m%d") for x in aa]

fechas en csv
Se pierde el formato de la fecha

    df.date = pd.to_datetime(df.date, format="%Y-%m-%d")
    df.date = [x.date() for x in df.date.dt.to_pydatetime()]

Aplicar a 2 columnas

    df['point'] = df[[[['lat', 'lng']]]].apply(Point, axis=1)

Media móvil hacia delante
rolling hace la media móvil hacia atrás y no acepta argumentos negativos, así que para hacerlo hacia delante le tienes que dar la vuelta al dataframe, calcular la media, y volver a darle la vuelta:
```
  candidate["time_diff"][::-1].rolling(N, closed="left").mean()[::-1]
```

df.isin

    import pandas as pd
    df.isin({'columna': <lista>})
    # Comprueba si cada elemento de <lista> esta en df['columna'], es equivalente a un in

df.where
Si estamos trabajando con algún DataFrame que tenga nulos, hay que tener cuidado de no tirarlos cuando hagamos dropna(), ya que por defecto tiene parámetro how’any=’ y tira todas las filas que tengan un NaN en alguna columna.
```
    df.where(df['a'] > 1).dropna(how='all')
```
filtros temporales
- between_time(’0:15’, ’0:45’)
- at_time
- Si se usa una fecha como índice, se pueden hacer cosas como
  df['2020-01':'2020-02']
df.groupby
- Luego para volver a tener un índice tienes que hacer .reset_index()

pd.melt
https://en.wikipedia.org/wiki/Wide_and_narrow_data

    pd.melt(df, id_vars=[, ..., ], value_vars=[, ... , ])
    # Sirve para pasar de formato ancho a formato largo/estrecho
    # id_vars son las variables que quieres que sigan como columnas
    # value_vars son las variables que quieres que se conviertan en 2 columnas sólamente:
    # 1. Columna llamda variable que toma todos los valores que le pases a value_vars
    # 2. Columna llamada value que tiene los valores que antes estaban distribuidos en las columnas value_vars

    df.pivot(index=None, columns=None, values=None)
    # Es la función inversa

Pequeños trucos

    nsmallest() and nlargest() → Cuando tienes un dataframe grande, en vez de sort_values, te saca los N primeros/últimos
    ne() → not equal
    idxmin, idxmax → te saca el primer índice/último que cumple una condición

One-hot encode y decode

  s = pd.Series(['dog', 'cat', 'dog', 'bird', 'fox', 'dog'])
  pd.get_dummies(s) # one-hot encoding
  pd.get_dummies(s).idxmax(1) # one-hot decoding

df.query
```
    filter = df.query(f'id = 1149361 and date < {current_date}')
```
Hace dropna() automáticamente, así que hay que tener cuidado si hay algún NaN
Documentación de query
merge_asof
merge por distancia mínima

utilidades

df.to_clipboard
pd.options.mode.chained_assignment = None para quitar warning

  def get_dups(df, pkeys):
      df = df.copy()
      grouped = df.groupby(pkeys).count()
      return df.merge(grouped[grouped > 1].dropna(how='all').reset_index()[pkeys])

Sumar todos los campos salvo uno que es el id

  df.join(pd.DataFrame({'sum': df.copy().drop(columns='id').sum(axis=1)}))

1.7.0.1. Use itertuples instead of iterrows

https://www.linkedin.com/posts/dennisbakhuis_datascience-machinelearning-pandas-activity-6787643416915398657-PH1H

itertuples es lo mas rápido, imagino que iterrows hará una copia y por eso tarda más

1.8. pandas ecosystem

Qué hacer si te quedas sin RAM para procesar todos tus datos, quieres encadenar pipes de pandas, paralelizar operaciones…

Pandas ecosystem Otras librerías/plugins de pandas para procesamiento en paralelo o en disco
Blaze ecosystem
The Blaze ecosystem is a set of libraries that help users store, describe, query and process data. It is composed of the following core projects:
- Blaze: An interface to query data on different storage systems
- Dask: Parallel computing through task scheduling and blocked algorithms
- Datashape: A data description language
- DyND: A C++ library for dynamic, multidimensional arrays
- Odo: Data migration between different storage systems
cudf GPU DataFrames https://docs.rapids.ai/api/cudf/stable/
pdpipe Empipar pandas
koalas Pandas+Spark
Swifter con análisis de tiempo
Pandas es más lento que numpy para hacer cálculos pesados, pero usando Swifter va mucho más rápido
h2oai/datatable: A Python package for manipulating 2-dimensional tabular data structures
Exports to DataFrame

1.8.1. Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS

Modin, with Ray as a backend. By installing these, you might see significant benefit by changing just a single line (import pandas as pd to import modin.pandas as pd). Unlike the other tools, Modin aims to reach full compatibility with Pandas.
Dask, a larger and hence more complicated project. But Dask also provides Dask.dataframe, a higher-level, Pandas-like library that can help you deal with out-of-core datasets.
Vaex, which is designed to help you work with large data on a standard laptop. Its Pandas replacement covers some of the Pandas API, but it’s more focused on exploration and visualization.
RAPIDS, if you have access to NVIDIA graphics cards

1.8.2. Dask

Avoid duplicated columns! (pandas is fine with them, Dask is not)

1.8.2.1. https://github.com/geopandas/dask-geopandas

1.8.3. Vaex

1.8.3.1. Uncluster Your Data Science Using Vaex • Maarten Breddels & Jovan Veljanoski • GOTO 2021 - YouTube

Store the expresion, not the result, to save memory

las expresiones que referencia a una columna no se calculan y no se guardan sino que se quedan como expresiones y por lo tanto no gasta memoria
Get data from AWS (and more clouds) bucket directly

dice cómo utilizar apis de nubes públicas para poder subir un archivo enorme y trabajar con el estando el archivo en un servidor remoto

1.8.4. InvestmentSystems/static-frame: Immutable and grow-only Pandas-like DataFrames with a more explicit and consistent interface.

Los DataFrames que son grow-only pueden utilizarse para definir una jerarquía de tipos de datos: un tipo extiende el siguiente

1.8.5. https://pypi.org/project/mapply/

1.8.6. pyjanitor

https://stackoverflow.com/questions/74497801/import-janitor-as-jn-typeerror-type-object-is-not-subscriptable
Para versiones antiguas de Python:
pip install pyjanitor==0.23.1

1.9. pandas to dask

import dask.dataframe as dd
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:password@localhost/database_name")
iterate = pd.read_sql(query, engine, chunksize=100000)
df = dd.from_pandas(data=next(iterate), npartitions=1)
for chunk in iterate:
    df2 = dd.from_pandas(chunk, npartitions=1)
    # To avoid lazy evaluation, otherwise df, df2 won't be up to date:
    df = df.compute()
    df2 = df2.compute()
    df = dd.concat([df, df2], axis=0)

1.9.1. Add metadata (ValuError: Metadata inference failed in …)

Error → ValueError: Metadata inference failed in

Understanding Dask’s meta keyword argument

1.9.2. Cannot hash with pandas

pd.util.hash_pandas_object

1.10. pandas to geodask

Not working yet
Maybe see this:

import geopandas as gpd
import dask_geopandas as dg


def read_sql_daskgeo(query, engine, chunksize=1000000, npartitions=1, geometry="geom", **kwargs):
    iterate = pd.read_sql(query, engine, chunksize=chunksize)

    df = dg.from_geopandas(data=gpd.GeoDataFrame(next(iterate), geometry=geometry), npartitions=npartitions)
    for chunk in iterate:
        df2 = dg.from_geopandas(gpd.GeoDataFrame(chunk, geometry=geometry), npartitions=npartitions)
        # To avoid lazy evaluation, otherwise df, df2 won't be up to date:
        df = df.compute()
        df2 = df2.compute()
        df = dg.concat([df, df2], axis=0)

        del df2, chunk
    del iterate

    return df

1.10.1. Ideas to make it work

Instead of creating a dask dataframe and then a

1.11. pandas reproducibility

https://pandas.pydata.org/docs/reference/api/pandas.util.hash_pandas_object.html

#!/bin/bash

python -c 'import pandas as pd; df = pd.read_csv("_file.csv"); print(pd.util.hash_pandas_object(df).sum())'
python -c 'import pandas as pd; print(pd.__version__)'
for dir in ~/.local/share/venvs/*; do
    echo -en "$dir $(readlink $dir/project)\n"
    source $dir/venv/bin/activate
        python -c 'import pandas as pd; print(pd.__version__)'
        python -c 'import pandas as pd; df = pd.read_csv("_file.csv"); print(pd.util.hash_pandas_object(df).sum())'
    deactivate
done

python -c 'import pandas as pd; df = pd.read_csv("file.csv"); print(pd.util.hash_pandas_object(df))' # Calculate hash by row
python -c 'import pandas as pd; df = pd.read_csv("file.csv"); print(pd.util.hash_pandas_object(df.T))' # Calculate hash by column

Changes if Libreoffice opens and saves the file (rounding errors)
- Apply a .round(6) before calculating the hash
- print(pd.util.hash_pandas_object(df.round(6)).sum())

You colud iterate over all numeric types and round float values
https://stackoverflow.com/questions/58942810/how-to-loop-over-numeric-column-in-pandas-dataframe-and-filter-values
https://github.com/pandas-dev/pandas/blob/v1.5.2/pandas/_testing/asserters.py#L20-L28

  from pandas.core.dtypes.common import (
      is_bool,
      is_categorical_dtype,
      is_extension_array_dtype,
      is_interval_dtype,
      is_number,                  # ←
      is_numeric_dtype,           # ←
      needs_i8_conversion,

1.12. Apache Arrow

Pandas done well

Apache Arrow and the “10 Things I Hate About pandas” - Wes McKinney
pandas 2.0 and the Arrow revolution (part I) Interoperability with polars

1.13. Polars

Pandas in Rust

1.13.1. Review [2023-02-02 Thu]

no df.query
df.select instead of df[["column", "column2"]]
Lazy evaluation: NAIVE QUERY PLAN creates a graph detailing execution. Use
.collect() to evaluate
.show_graph() to get more detailed plan graph

cleaner aggregation

    import polars as pl
    pl.read_csv("sales data-set.csv")
    .groupby("Store")
    .agg(
        [
            pl.col("Weekly_Sales").min().alias("Weekly_Sales_min"),
            pl.col("Weekly_Sales").mean().alias("Weekly_Sales_mean"),
            pl.col("Weekly_Sales").max().alias("Weekly_Sales_max"),
            pl.col("Weekly_Sales").min().alias("Dept_min"),
            pl.col("Weekly_Sales").mean().alias("Dept_mean"),
            pl.col("Weekly_Sales").max().alias("Dept_max")
        ]
    )

joining on expressions

1.13.2. Awesome Polars

1.13.3. polars 0.27 release : rust examples and docs in rust, most are in python

1.13.4. Using Polars over Pandas or PySpark : dataengineering

1.13.5. Polars Ecosystem

Polars equilvalents of Pandas libaries
geopandas alternative: geopolars
panderas alternative: patito