Physionet utilities

This module contains functions for working with datasets from physionet, including MIMIC and the Computing in Cardiology Challenge 2012 datasets.

In the future, this module may be renamed and updated to also work with the eICU dataset.

Please see the respective documentation for more details:

MIMIC-III utilities

fix_mimic_icds(icds) Add the decimal to the correct location for the given ICD codes

Definitions

This module contains functions for working with datasets from physionet, including MIMIC and the Computing in Cardiology Challenge 2012 datasets.

In the future, this module may be renamed and updated to also work with the eICU dataset.

Please see the respective documentation for more details:

pyllars.physionet_utils._fix_mimic_icd(icd)[source]

Add the decimal to the correct location for the ICD code

From the mimic documentation (https://mimic.physionet.org/mimictables/diagnoses_icd/):

> The code field for the ICD-9-CM Principal and Other Diagnosis Codes > is six characters in length, with the decimal point implied between > the third and fourth digit for all diagnosis codes other than the V > codes. The decimal is implied for V codes between the second and third > digit.
pyllars.physionet_utils._get_cinc_2012_record_descriptor(record_file_df)[source]

Given the record file data frame, use the first six rows to extract the descriptor information. See the documentation (https://physionet.org/challenge/2012/, “General descriptors”) for more details.

pyllars.physionet_utils.create_followups_table(mimic_base, progress_bar=True)[source]

Create the FOLLOWUPS table, based on the admissions

In particular, the table has the following columns:
  • HADM_ID
  • FOLLOWUP_HADM_ID
  • FOLLOWUP_TIME: the difference between the discharge time of the
    first admission and the admit time of the second admission
  • SUBJECT_ID
Parameters:
  • mimic_base (path-like) – The path to the main MIMIC folder
  • progress_bar (bool) – Whether to show a progress bar for creating the table
Returns:

df_followups – The data frame constructed as described above. Currently, there is no need to create this table more than once. It can just be written to disk and loaded using get_followups after the initial creation.

Return type:

pd.DataFrame

pyllars.physionet_utils.fix_mimic_icds(icds: Iterable[str]) → List[str][source]

Add the decimal to the correct location for the given ICD codes

Since adding the decimals is a string-based operation, it can be somewhat slow. Thus, it may make sense to perform any filtering before fixing the (possibly much smaller number of) ICD codes.

Parameters:icds (typing.Iterable[str]) – ICDs from the various mimic ICD columns
Returns:fixed_icds – The ICD codes with decimals in the correct location
Return type:List[str]
pyllars.physionet_utils.get_admissions(mimic_base: str, **kwargs) → pandas.core.frame.DataFrame[source]

Load the ADMISSIONS table

This function automatically treats the following columns as date-times:

  • ADMITTIME
  • DISCHTIME
  • DEATHTIME
  • EDREGTIME
  • EDOUTTIME
Parameters:
  • mimic_base (str) – The path to the main MIMIC folder
  • kwargs (<key>=<value> pairs) – Additional key words to pass to read_df
Returns:

admissions – The admissions table as a pandas data frame

Return type:

pandas.DataFrame

pyllars.physionet_utils.get_cinc_2012_outcomes(cinc_2012_base, to_pandas=True)[source]

Load the Outcomes-a.txt file.

N.B. This file is assumed to be named “Outcomes-a.txt” and located directly in the cinc_2012_base directory

Parameters:
  • cinc_2012_base (path-like) – The path to the main folder for this CinC challenge
  • to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
Returns:

cinc_2012_base – The “Outcomes-a” table as either a pandas or dask data frame, depending on the value of to_pandas. It comtains the following columns:

  • HADM_ID: string, a key into the record table
  • SAPS-I: integer, the SAPS-I score
  • SOFA: integer, the SOFA score
  • ADMISSION_ELAPSED_TIME: pd.timedelta, time in the hospital, in days
  • SURVIVAL: time between ICU admission and observed death.
    If the patient survived (or the death was not recorded), then the value is np.nan.
  • EXPIRED: bool, whether the patient died in the hospital

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_cinc_2012_record(cinc_2012_base, record_id, wide=True)[source]

Load the record file for the given id.

N.B. This file is assumed to be named “<record_id>.txt” and located in the “<cinc_2012_base>/set-a” directory.

Parameters:
  • cinc_2012_base (path-like) – The path to the main folder for this CinC challenge
  • record_id (string-like) – The identifier for this record, e.g., “132539”
  • wide (bool) – Whether to return a “long” or “wide” data frame
  • According to the specification (https (N.B.) –
  • descriptors"), six descriptors are recorded only when the patients ("General) –
  • admitted to the ICU and are included only once at the beginning of the (are) –
  • record.
Returns:

record_descriptors

The six descriptors:

  • HADM_ID: string, the record id. We call it “HADM_ID” to
    keep the nomenclature consistent with the MIMIC data
  • ICU_TYPE: string [“coronary_care_unit”,
    ”cardiac_surgery_recovery_unit”, “medical_icu”,”surgical_icu”]
  • GENDER: string [‘female’, ‘male’]
  • AGE: float (or np.nan for missing)
  • WEIGHT: float (or np.nan for missing)
  • HEIGHT: float (or np.nan for missing)

Return type:

dictionary

observations: pd.DataFrame

The remaining time series entries for this record. This is returned as either a “long” or “wide” data frame with columns:

  • HADM_ID: string (added for easy joining, etc.)
  • ELAPSED_TIME: timedelta64[ns]
  • MEASUREMENT: the name of the measurement
  • VALUE: the value of the measurement

For a wide data frame, there is instead one column for each measurement.

pyllars.physionet_utils.get_diagnosis_icds(mimic_base: str, drop_incomplete_records: bool = False, fix_icds: bool = False) → pandas.core.frame.DataFrame[source]

Load the DIAGNOSES_ICDS table

Parameters:
  • mimic_base (str) – The path to the main MIMIC folder
  • drop_incomplete_records (bool) – Some of the ICD codes are missing. If this flag is True, then those records will be removed.
  • fix_icds (bool) – Whether to add the decimal point in the correct position for the ICD codes
Returns:

diagnosis_icds – The diagnosis ICDs table as a pandas data frame

Return type:

pandas.DataFrame

pyllars.physionet_utils.get_followups(mimic_base: str) → pandas.core.frame.DataFrame[source]

Load the (constructed) FOLLOWUPS table

Parameters:mimic_base (str) – The path to the main MIMIC folder
Returns:df_followups – A data frame containing the followup information
Return type:pandas.DataFrame
pyllars.physionet_utils.get_icu_stays(mimic_base, to_pandas=True, **kwargs)[source]

Load the ICUSTAYS table

Parameters:
  • mimic_base (path-like) – The path to the main MIMIC folder
  • to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
  • kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function
Returns:

patients

The patients table as either a pandas or dask data frame,

depending on the value of to_pandas

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_lab_events(mimic_base, to_pandas=True, drop_missing_admission=False, parse_dates=True, **kwargs)[source]

Load the LABEVENTS table

Parameters:
  • mimic_base (path-like) – The path to the main MIMIC folder
  • to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
  • drop_missing_admission (bool) – About 20% of the lab events do not have an associated HADM_ID. If this flag is True, then those will be removed.
  • parse_dates (bool) – Whether to directly parse CHARTTIME as a date. The main reason to skip this (when parse_dates is False) is if the CHARTTIME column is skipped (using the usecols parameter).
  • kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function
Returns:

lab_events

The notes table as either a pandas or dask data frame,

depending on the value of to_pandas

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_lab_items(mimic_base, to_pandas=True, **kwargs)[source]

Load the D_LABITEMS table

Parameters:
  • mimic_base (path-like) – The path to the main MIMIC folder
  • to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
  • kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function
Returns:

diagnosis_icds

The notes table as either a pandas or dask data frame,

depending on the value of to_pandas

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_notes(mimic_base, to_pandas=True, **kwargs)[source]

Load the NOTEEVENTS table

Parameters:
  • mimic_base (path-like) – The path to the main MIMIC folder
  • to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
  • kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function
Returns:

diagnosis_icds

The notes table as either a pandas or dask data frame,

depending on the value of to_pandas

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_patients(mimic_base, to_pandas=True, **kwargs)[source]

Load the PATIENTS table

Parameters:
  • mimic_base (path-like) – The path to the main MIMIC folder
  • to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
  • kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function
Returns:

patients

The patients table as either a pandas or dask data frame,

depending on the value of to_pandas

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_procedure_icds(mimic_base, to_pandas=True)[source]

Load the PROCEDURES_ICD table

Parameters:
  • mimic_base (path-like) – The path to the main MIMIC folder
  • to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
Returns:

procedure_icds

The procedure ICDs table as either a pandas or dask data frame,

depending on the value of to_pandas

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_transfers(mimic_base, to_pandas=True, **kwargs)[source]

Load the TRANSFERS table

Parameters:
  • mimic_base (path-like) – The path to the main MIMIC folder
  • to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
kwargs: key=value pairs
Additional keywords to pass to the appropriate read function
Returns:transfers
The transfers table as either a pandas or dask data frame,
depending on the value of to_pandas
Return type:pd.DataFrame or dd.DataFrame
pyllars.physionet_utils.parse_rdsamp_datetime(fname, version=2)[source]

Extract the identifying information from the filename of the MIMIC-III header (*hea) files

In this project, we refer to each of these files as an “episode”.

Parameters:fname (string) –

The name of the file. It should be of the form:

version 1:
/path/to/my/s09870-2111-11-04-12-36.hea /path/to/my/s09870-2111-11-04-12-36n.hea
version 2:
/path/to/my/p000020-2183-04-28-17-47.hea /path/to/my/p000020-2183-04-28-17-47n.hea
Returns:episode_timestap – A dictionary containing the time stamp and subject id for this episode. Specifically, it includes the following keys:
  • SUBJECT_ID: the patient identifier
  • EPISODE_ID: the identifier for this episode
  • EPISODE_BEGIN_TIME: the beginning time for this episode
Return type:dict