Pandas utilities¶
This module contains utilities for data frame manipulation.
This module differs from ml_utils and others because this module treats pandas data frames more like database tables which hold various types of records. The other modules tend to treat data frames as data matrices (in a statistical/machine learning sense).
Manipulating and processing data frames¶
apply (df, func, *args, progress_bar, **kwargs) |
Apply func to each row in the data frame |
apply_groups (groups, func, *args, …) |
Apply func to each group in groups |
split_df (df, num_groups, chunk_size) |
Split df into roughly equal-sized groups |
join_df_list (dfs, join_col, List[str]], …) |
Join a list of data frames on a common column |
filter_rows (df_filter, df_to_keep, …) |
Filter rows from df_to_keep which have matches in df_filter |
group_and_chunk_df (df, groupby_field, chunk_size) |
Group df using then given field, and then create “groups of groups” with chunk_size groups in each outer group |
get_group_extreme (df, ex_field, ex_type, …) |
Find the row in each group of df with an extreme value for ex_field |
Converting to and from data frames¶
dict_to_dataframe (dic, key_name, value_name) |
Convert a dictionary into a two-column data frame using the given column names. |
dataframe_to_dict (df, key_field, value_field) |
Convert two columns of a data frame into a dictionary |
Other pandas helpers¶
get_series_union (*pd_series) |
Take the union of values from the list of series |
groupby_to_generator (groups) |
Convert the groupby object to a generator of data frames |
Reading and writing data frames¶
read_df (filename, filetype, sheet, **kwargs) |
Read a data frame from a file |
write_df (df, out, create_path, filetype, …) |
Writes a data frame to a file of the specified type |
append_to_xlsx (df, xlsx[, sheet]) |
Append df to xlsx |
Definitions¶
This module contains utilities for data frame manipulation.
This module differs from ml_utils and others because this module treats pandas data frames more like database tables which hold various types of records. The other modules tend to treat data frames as data matrices (in a statistical/machine learning sense).
-
pyllars.pandas_utils.
append_to_xlsx
(df: pandas.core.frame.DataFrame, xlsx: str, sheet='Sheet_1', **kwargs) → None[source]¶ Append df to xlsx
If the sheet already exists, it will be overwritten. If the file does not exist, it will be created.
- N.B. This will not work with an open file handle! The xlsx argument
- must be the path to the file.
Parameters: - df (pandas.DataFrame) – The data frame to write
- xlsx (str) – The path to the excel file
- sheet (str) – The name of the sheet, which will be truncated to 31 characters
- kwargs – Keyword arguments to pass to the appropriate “write” function.
Returns: None – The sheet is appended to the excel file
Return type:
-
pyllars.pandas_utils.
apply
(df: pandas.core.frame.DataFrame, func: Callable, *args, progress_bar: bool = False, **kwargs) → List[source]¶ Apply func to each row in the data frame
Unlike
pandas.DataFrame.apply()
, this function does not attempt to “interpret” the results and cast them back to a data frame, etc.Parameters: - df (pandas.DataFrame) – the data frame
- func (typing.Callable) – The function to apply to each row in data_frame
- args – Positional arguments to pass to func
- kwargs – Keyword arguments to pass to func
- progress_bar (bool) – Whether to show a progress bar when waiting for results.
Returns: results – The result of each function call
Return type:
-
pyllars.pandas_utils.
apply_groups
(groups: pandas.core.groupby.generic.DataFrameGroupBy, func: Callable, *args, progress_bar: bool = False, **kwargs) → List[source]¶ Apply func to each group in groups
Unlike
pandas.core.groupby.GroupBy.apply()
, this function does not attempt to “interpret” the results by casting to a data frame, etc.Parameters: - groups (pandas.core.groupby.GroupBy) – The result of a call to groupby on a data frame
- func (function pointer) – The function to apply to each group in groups
- args – Positional arguments to pass to func
- kwargs – Keyword arguments to pass to func
- progress_bar (bool) – Whether to show a progress bar when waiting for results.
Returns: results – The result of each function call
Return type:
-
pyllars.pandas_utils.
dataframe_to_dict
(df: pandas.core.frame.DataFrame, key_field: str, value_field: str) → Dict[source]¶ Convert two columns of a data frame into a dictionary
Parameters: - df (pandas.DataFrame) – The data frame
- key_field (str) – The field to use as the keys in the dictionary
- value_field (str) – The field to use as the values
Returns: the_dict – A dictionary which has one entry for each row in the data frame, with the keys and values as indicated by the fields
Return type:
-
pyllars.pandas_utils.
dict_to_dataframe
(dic: Dict, key_name: str = 'key', value_name: str = 'value') → pandas.core.frame.DataFrame[source]¶ Convert a dictionary into a two-column data frame using the given column names. Each entry in the data frame corresponds to one row.
Parameters: - dic (typing.Dict) – A dictionary
- key_name (str) – The name to use for the column for the keys
- value_name (str) – The name to use for the column for the values
Returns: df – A data frame in which each row corresponds to one entry in dic
Return type:
-
pyllars.pandas_utils.
filter_rows
(df_filter: pandas.core.frame.DataFrame, df_to_keep: pandas.core.frame.DataFrame, filter_on: List[str], to_keep_on: List[str], drop_duplicates: bool = True) → pandas.core.frame.DataFrame[source]¶ Filter rows from df_to_keep which have matches in df_filter
N.B. The order of the the columns in filter_on and to_keep_on must match.
This is adapted from: https://stackoverflow.com/questions/44706485.
Parameters: - df_filter (pandas.DataFrame) – The rows which will be used as the filter
- df_to_keep (pandas.DataFrame) – The rows which will be kept, unless they appear in df_filter
- filter_on (typing.List[str]) – The columns from df_filter to use for matching
- to_keep_on (typing.List[str]) – The columns from df_to_keep to use for matching
- drop_duplicates (bool) – Whether to remove duplicate rows from the filtered data frame
Returns: df_filtered – The rows of df_to_keep which do not appear in df_filter (considering only the given columns)
Return type:
-
pyllars.pandas_utils.
get_group_extreme
(df: pandas.core.frame.DataFrame, ex_field: str, ex_type: str = 'max', group_fields: Union[str, List[str], None] = None, groups: Optional[pandas.core.groupby.groupby.GroupBy] = None) → pandas.core.frame.DataFrame[source]¶ Find the row in each group of df with an extreme value for ex_field
“ex_type” must be either “max” or “min” and indicated which type of extreme to consider. Either the “group_field” or “groups” must be given.
Parameters: - df (pd.DataFrame) – The original data frame. Even if the groups are created externally, the original data frame must be given.
- ex_field (str) – The field to find for which to find the extreme values
- ex_type (str {"max" or "min"}, case-insensitive) – The type of extreme to consider.
- groups (typing.Optional[pandas.core.groupby.GroupBy]) – If not None, then these groups will be used to find the maximum values.
- group_fields (typing.Optional[typing.Union[str, typing.Sequence[str]]]) – If not None, then the field(s) by which to group the data frame. This value must be something which can be interpreted by pd.DataFrame.groupby.
Returns: ex_df – A data frame with rows which contain the extreme values for the indicated groups.
Return type:
-
pyllars.pandas_utils.
get_group_sizes
(df: Optional[pandas.core.frame.DataFrame] = None, group_fields: Union[str, List[str], None] = None, groups: Optional[pandas.core.groupby.groupby.GroupBy] = None) → pandas.core.frame.DataFrame[source]¶ Create a data frame containing the size of each group
Parameters: - df (pandas.DataFrame) – The data frame. If groups are given, then df is not needed.
- group_fields (typing.Optional[typing.Union[str, typing.Sequence[str]]]) – If not None, then the field(s) by which to group the data frame. This value must be something which can be interpreted by pd.DataFrame.groupby.
- groups (typing.Optional[pandas.core.groupby.GroupBy]) – If not None, then these groups will be used to find the counts.
Returns: df_counts – The data frame containing the counts. It contains one column for each of the group_fields (or whatever the index is if groups is instead provided), as well as a count column.
Return type:
-
pyllars.pandas_utils.
get_series_union
(*pd_series) → Set[source]¶ Take the union of values from the list of series
Parameters: pd_series (typing.Iterable[pandas.Series]) – The list of pandas series Returns: set_union – The union of the values in all series Return type: typing.Set
-
pyllars.pandas_utils.
group_and_chunk_df
(df: pandas.core.frame.DataFrame, groupby_field: str, chunk_size: int) → pandas.core.groupby.generic.DataFrameGroupBy[source]¶ Group df using then given field, and then create “groups of groups” with chunk_size groups in each outer group
Parameters: - df (pandas.DataFrame) – The data frame
- groupby_field (str) – The field for creating the initial grouping
- chunk_size (int) – The size of each outer group
Returns: groups – The groups
Return type: pandas.core.groupby.GroupBy
-
pyllars.pandas_utils.
groupby_to_generator
(groups: pandas.core.groupby.groupby.GroupBy) → Generator[source]¶ Convert the groupby object to a generator of data frames
Parameters: groups (pandas.core.groupby.GroupBy) – The groups Returns: group_generator – A generator over the data frames in groups Return type: typing.Generator
-
pyllars.pandas_utils.
intersect_masks
(masks: Sequence[numpy.ndarray]) → numpy.ndarray[source]¶ Take the intersection sof all masks in the list
-
pyllars.pandas_utils.
join_df_list
(dfs: List[pandas.core.frame.DataFrame], join_col: Union[str, List[str]], *args, **kwargs) → pandas.core.frame.DataFrame[source]¶ Join a list of data frames on a common column
Parameters: - dfs (typing.Iterable[pandas.DataFrame]) – The data frames
- join_col (str or typing.List[str]) – The name of the column(s) to use for joining. All of the data frames in dfs must have this column (or all columns in the list).
- args – Positional arguments to pass to
pandas.merge()
- kwargs – Keyword arguments to pass to
pandas.merge()
Returns: joined_df – The data frame from joining all of those in the list on join_col. This function does not especially handle other columns which appear in all data frames, and their names in the joined data frame will be adjusted according to the standard pandas suffix approach.
Return type:
-
pyllars.pandas_utils.
read_df
(filename: str, filetype: str = 'AUTO', sheet: str = None, **kwargs) → pandas.core.frame.DataFrame[source]¶ Read a data frame from a file
By default, this function attempts to guess the type of the file based on its extension. Alternatively, the filetype can be exlicitly specified. The supported types and extensions used for guessing are:
- excel: xls, xlsx
- hdf5: hdf, hdf5, h5, he5
- parquet: parq
- csv: all other extensions
N.B. In principle, matlab data files are hdf5, so this function should be able to read them. This has not been thoroughly tested, though.
Parameters: - filename (str) – The input file
- filetype (str) – The type of file, which determines which pandas read function will be called. If AUTO, the function uses the extensions mentioned above to guess the filetype.
- sheet (str) – For excel or hdf5 files, this will be passed to extract the desired
information from the file. Please see
pandas.read_excel()
orpandas.read_hdf()
for more information on how values are interpreted. - kwargs – Keyword arguments to pass to the appropriate read function.
Returns: df – The data frame
Return type:
-
pyllars.pandas_utils.
split_df
(df: pandas.core.frame.DataFrame, num_groups: int = None, chunk_size: int = None) → pandas.core.groupby.generic.DataFrameGroupBy[source]¶ Split df into roughly equal-sized groups
The size of the groups can be specified by either giving the number of groups (num_groups) or the size of each group (chunk_size).
The groups are contiguous rows in the data frame.
Parameters: - df (pandas.DataFrame) – The data frame
- num_groups (int) – The number of groups
- chunk_size (int) – The size of each group. If given, num_groups groups has precedence over chunk_size
Returns: groups – The groups
Return type: pandas.core.groupby.GroupBy
-
pyllars.pandas_utils.
union_masks
(masks: Sequence[numpy.ndarray]) → numpy.ndarray[source]¶ Take the union of all masks in the list
-
pyllars.pandas_utils.
write_df
(df: pandas.core.frame.DataFrame, out, create_path: bool = False, filetype: str = 'AUTO', sheet: str = 'Sheet_1', compress: bool = True, **kwargs) → None[source]¶ Writes a data frame to a file of the specified type
Unless otherwise specified, csv files are gzipped when written. By default, the filetype will be guessed based on the extension. The supported types and extensions used for guessing are:
- excel: xls, xlsx
- hdf5: hdf, hdf5, h5, he5
- parquet: parq
- csv: all other extensions (e.g., “gz” or “bed”)
Additionally, the filetype can be specified as ‘excel_writer’. In this case, the out object is taken to be a pd.ExcelWriter, and the df is appended to the writer. AUTO will also guess this correctly.
N.B. The hdf5 filetype has not been thoroughly tested.
Parameters: - df (pandas.DataFrame) – The data frame
- out (str or pandas.ExcelWriter) –
The (complete) path to the file.
The file name WILL NOT be modified. In particular, “.gz” WILL NOT be added if the file is to be zipped. As mentioned above, if the filetype is passed as ‘excel_writer’, then this is taken to be a pd.ExcelWriter object.
- create_path (bool) –
Whether to create the path directory structure to the file if it does not already exist.
N.B. This will not attempt to create the path to an excel_writer since it is possible that it does not yet have one specified.
- filetype (str) – The type of output file to write. If AUTO, the function uses the extensions mentioned above to guess the filetype.
- sheet (str) – The name of the sheet (excel) or key (hdf5) to use when writing the file. This argument is not used for csv. For excel, the sheet is limited to 31 characters. It will be trimmed if necessary.
- compress (bool) – Whether to compress the output. This is only used for csv files.
- kwargs – Keyword arguments to pass to the appropriate “write” function.
Returns: None – The file is created as specified
Return type: