Physionet utilities¶

This module contains functions for working with datasets from physionet, including MIMIC and the Computing in Cardiology Challenge 2012 datasets.

In the future, this module may be renamed and updated to also work with the eICU dataset.

Please see the respective documentation for more details:

MIMIC-III: https://mimic.physionet.org/about/mimic/
CINC 2012: https://physionet.org/challenge/2012/
eICU: https://eicu-crd.mit.edu/about/eicu/

MIMIC-III utilities¶

fix_mimic_icds(icds) Add the decimal to the correct location for the given ICD codes

Definitions¶

This module contains functions for working with datasets from physionet, including MIMIC and the Computing in Cardiology Challenge 2012 datasets.

In the future, this module may be renamed and updated to also work with the eICU dataset.

Please see the respective documentation for more details:

MIMIC-III: https://mimic.physionet.org/about/mimic/
CINC 2012: https://physionet.org/challenge/2012/
eICU: https://eicu-crd.mit.edu/about/eicu/

pyllars.physionet_utils._fix_mimic_icd(icd)[source]¶

Add the decimal to the correct location for the ICD code

From the mimic documentation (https://mimic.physionet.org/mimictables/diagnoses_icd/):

> The code field for the ICD-9-CM Principal and Other Diagnosis Codes > is six characters in length, with the decimal point implied between > the third and fourth digit for all diagnosis codes other than the V > codes. The decimal is implied for V codes between the second and third > digit.

pyllars.physionet_utils._get_cinc_2012_record_descriptor(record_file_df)[source]¶: Given the record file data frame, use the first six rows to extract the descriptor information. See the documentation (https://physionet.org/challenge/2012/, “General descriptors”) for more details.

pyllars.physionet_utils.create_followups_table(mimic_base, progress_bar=True)[source]¶

Create the FOLLOWUPS table, based on the admissions

In particular, the table has the following columns:

HADM_ID
FOLLOWUP_HADM_ID
FOLLOWUP_TIME: the difference between the discharge time of the

first admission and the admit time of the second admission
SUBJECT_ID

Parameters:	mimic_base (path-like) – The path to the main MIMIC folder progress_bar (bool) – Whether to show a progress bar for creating the table
Returns:	df_followups – The data frame constructed as described above. Currently, there is no need to create this table more than once. It can just be written to disk and loaded using get_followups after the initial creation.
Return type:	pd.DataFrame

pyllars.physionet_utils.fix_mimic_icds(icds: Iterable[str]) → List[str][source]¶

Add the decimal to the correct location for the given ICD codes

Since adding the decimals is a string-based operation, it can be somewhat slow. Thus, it may make sense to perform any filtering before fixing the (possibly much smaller number of) ICD codes.

Parameters:	icds (typing.Iterable[str]) – ICDs from the various mimic ICD columns
Returns:	fixed_icds – The ICD codes with decimals in the correct location
Return type:	List[str]

pyllars.physionet_utils.get_admissions(mimic_base: str, **kwargs) → pandas.core.frame.DataFrame[source]¶

Load the ADMISSIONS table

This function automatically treats the following columns as date-times:

ADMITTIME
DISCHTIME
DEATHTIME
EDREGTIME
EDOUTTIME

Parameters:	mimic_base (str) – The path to the main MIMIC folder kwargs (<key>=<value> pairs) – Additional key words to pass to read_df
Returns:	admissions – The admissions table as a pandas data frame
Return type:	pandas.DataFrame

pyllars.physionet_utils.get_cinc_2012_outcomes(cinc_2012_base, to_pandas=True)[source]¶

Load the Outcomes-a.txt file.

N.B. This file is assumed to be named “Outcomes-a.txt” and located directly in the cinc_2012_base directory

Parameters:

cinc_2012_base (path-like) – The path to the main folder for this CinC challenge
to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame

Returns:

cinc_2012_base – The “Outcomes-a” table as either a pandas or dask data frame, depending on the value of to_pandas. It comtains the following columns:

HADM_ID: string, a key into the record table

SAPS-I: integer, the SAPS-I score

SOFA: integer, the SOFA score

ADMISSION_ELAPSED_TIME: pd.timedelta, time in the hospital, in days

SURVIVAL: time between ICU admission and observed death.

If the patient survived (or the death was not recorded), then the value is np.nan.

EXPIRED: bool, whether the patient died in the hospital

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_cinc_2012_record(cinc_2012_base, record_id, wide=True)[source]¶

Load the record file for the given id.

N.B. This file is assumed to be named “<record_id>.txt” and located in the “<cinc_2012_base>/set-a” directory.

Parameters:

cinc_2012_base (path-like) – The path to the main folder for this CinC challenge
record_id (string-like) – The identifier for this record, e.g., “132539”
wide (bool) – Whether to return a “long” or “wide” data frame
According to the specification (https (N.B.) –
descriptors"), six descriptors are recorded only when the patients ("General) –
admitted to the ICU and are included only once at the beginning of the (are) –
record. –

Returns:

record_descriptors –

The six descriptors:

HADM_ID: string, the record id. We call it “HADM_ID” to

keep the nomenclature consistent with the MIMIC data

ICU_TYPE: string [“coronary_care_unit”,

”cardiac_surgery_recovery_unit”, “medical_icu”,”surgical_icu”]

GENDER: string [‘female’, ‘male’]

AGE: float (or np.nan for missing)

WEIGHT: float (or np.nan for missing)

HEIGHT: float (or np.nan for missing)

Return type:

dictionary

observations: pd.DataFrame

The remaining time series entries for this record. This is returned as either a “long” or “wide” data frame with columns:

HADM_ID: string (added for easy joining, etc.)

ELAPSED_TIME: timedelta64[ns]

MEASUREMENT: the name of the measurement

VALUE: the value of the measurement

For a wide data frame, there is instead one column for each measurement.

pyllars.physionet_utils.get_diagnosis_icds(mimic_base: str, drop_incomplete_records: bool = False, fix_icds: bool = False) → pandas.core.frame.DataFrame[source]¶

Load the DIAGNOSES_ICDS table

Parameters:	mimic_base (str) – The path to the main MIMIC folder drop_incomplete_records (bool) – Some of the ICD codes are missing. If this flag is True, then those records will be removed. fix_icds (bool) – Whether to add the decimal point in the correct position for the ICD codes
Returns:	diagnosis_icds – The diagnosis ICDs table as a pandas data frame
Return type:	pandas.DataFrame

pyllars.physionet_utils.get_followups(mimic_base: str) → pandas.core.frame.DataFrame[source]¶

Load the (constructed) FOLLOWUPS table

Parameters:	mimic_base (str) – The path to the main MIMIC folder
Returns:	df_followups – A data frame containing the followup information
Return type:	pandas.DataFrame

pyllars.physionet_utils.get_icu_stays(mimic_base, to_pandas=True, **kwargs)[source]¶

Load the ICUSTAYS table

Parameters:

mimic_base (path-like) – The path to the main MIMIC folder
to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function

Returns:

patients –

The patients table as either a pandas or dask data frame,: depending on the value of to_pandas

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_lab_events(mimic_base, to_pandas=True, drop_missing_admission=False, parse_dates=True, **kwargs)[source]¶

Load the LABEVENTS table

Parameters:

mimic_base (path-like) – The path to the main MIMIC folder
to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
drop_missing_admission (bool) – About 20% of the lab events do not have an associated HADM_ID. If this flag is True, then those will be removed.
parse_dates (bool) – Whether to directly parse CHARTTIME as a date. The main reason to skip this (when parse_dates is False) is if the CHARTTIME column is skipped (using the usecols parameter).
kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function

Returns:

lab_events –

The notes table as either a pandas or dask data frame,: depending on the value of to_pandas

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_lab_items(mimic_base, to_pandas=True, **kwargs)[source]¶

Load the D_LABITEMS table

Parameters:

mimic_base (path-like) – The path to the main MIMIC folder
to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function

Returns:

diagnosis_icds –

The notes table as either a pandas or dask data frame,: depending on the value of to_pandas

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_notes(mimic_base, to_pandas=True, **kwargs)[source]¶

Load the NOTEEVENTS table

Parameters:

mimic_base (path-like) – The path to the main MIMIC folder
to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function

Returns:

diagnosis_icds –

The notes table as either a pandas or dask data frame,: depending on the value of to_pandas

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_patients(mimic_base, to_pandas=True, **kwargs)[source]¶

Load the PATIENTS table

Parameters:

mimic_base (path-like) – The path to the main MIMIC folder
to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function

Returns:

patients –

The patients table as either a pandas or dask data frame,: depending on the value of to_pandas

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_procedure_icds(mimic_base, to_pandas=True)[source]¶

Load the PROCEDURES_ICD table

Parameters:

mimic_base (path-like) – The path to the main MIMIC folder
to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame

Returns:

procedure_icds –

The procedure ICDs table as either a pandas or dask data frame,: depending on the value of to_pandas

Return type:

pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.get_transfers(mimic_base, to_pandas=True, **kwargs)[source]¶

Load the TRANSFERS table

Parameters:	mimic_base (path-like) – The path to the main MIMIC folder to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame

kwargs: key=value pairs: Additional keywords to pass to the appropriate read function

Returns:	transfers – The transfers table as either a pandas or dask data frame, depending on the value of to_pandas
Return type:	pd.DataFrame or dd.DataFrame

pyllars.physionet_utils.parse_rdsamp_datetime(fname, version=2)[source]¶

Extract the identifying information from the filename of the MIMIC-III header (*hea) files

In this project, we refer to each of these files as an “episode”.

Parameters:

fname (string) –

The name of the file. It should be of the form:

version 1:

/path/to/my/s09870-2111-11-04-12-36.hea /path/to/my/s09870-2111-11-04-12-36n.hea

version 2:

/path/to/my/p000020-2183-04-28-17-47.hea /path/to/my/p000020-2183-04-28-17-47n.hea

Returns:

episode_timestap – A dictionary containing the time stamp and subject id for this episode. Specifically, it includes the following keys:

SUBJECT_ID: the patient identifier
EPISODE_ID: the identifier for this episode
EPISODE_BEGIN_TIME: the beginning time for this episode

Return type: dict