Physionet utilities¶
This module contains functions for working with datasets from physionet, including MIMIC and the Computing in Cardiology Challenge 2012 datasets.
In the future, this module may be renamed and updated to also work with the eICU dataset.
Please see the respective documentation for more details:
- MIMIC-III: https://mimic.physionet.org/about/mimic/
- CINC 2012: https://physionet.org/challenge/2012/
- eICU: https://eicu-crd.mit.edu/about/eicu/
MIMIC-III utilities¶
fix_mimic_icds (icds) |
Add the decimal to the correct location for the given ICD codes |
Definitions¶
This module contains functions for working with datasets from physionet, including MIMIC and the Computing in Cardiology Challenge 2012 datasets.
In the future, this module may be renamed and updated to also work with the eICU dataset.
Please see the respective documentation for more details:
- MIMIC-III: https://mimic.physionet.org/about/mimic/
- CINC 2012: https://physionet.org/challenge/2012/
- eICU: https://eicu-crd.mit.edu/about/eicu/
-
pyllars.physionet_utils.
_fix_mimic_icd
(icd)[source]¶ Add the decimal to the correct location for the ICD code
From the mimic documentation (https://mimic.physionet.org/mimictables/diagnoses_icd/):
> The code field for the ICD-9-CM Principal and Other Diagnosis Codes > is six characters in length, with the decimal point implied between > the third and fourth digit for all diagnosis codes other than the V > codes. The decimal is implied for V codes between the second and third > digit.
-
pyllars.physionet_utils.
_get_cinc_2012_record_descriptor
(record_file_df)[source]¶ Given the record file data frame, use the first six rows to extract the descriptor information. See the documentation (https://physionet.org/challenge/2012/, “General descriptors”) for more details.
-
pyllars.physionet_utils.
create_followups_table
(mimic_base, progress_bar=True)[source]¶ Create the FOLLOWUPS table, based on the admissions
- In particular, the table has the following columns:
- HADM_ID
- FOLLOWUP_HADM_ID
- FOLLOWUP_TIME: the difference between the discharge time of the
- first admission and the admit time of the second admission
- SUBJECT_ID
Parameters: - mimic_base (path-like) – The path to the main MIMIC folder
- progress_bar (bool) – Whether to show a progress bar for creating the table
Returns: df_followups – The data frame constructed as described above. Currently, there is no need to create this table more than once. It can just be written to disk and loaded using get_followups after the initial creation.
Return type: pd.DataFrame
-
pyllars.physionet_utils.
fix_mimic_icds
(icds: Iterable[str]) → List[str][source]¶ Add the decimal to the correct location for the given ICD codes
Since adding the decimals is a string-based operation, it can be somewhat slow. Thus, it may make sense to perform any filtering before fixing the (possibly much smaller number of) ICD codes.
Parameters: icds (typing.Iterable[str]) – ICDs from the various mimic ICD columns Returns: fixed_icds – The ICD codes with decimals in the correct location Return type: List[str]
-
pyllars.physionet_utils.
get_admissions
(mimic_base: str, **kwargs) → pandas.core.frame.DataFrame[source]¶ Load the ADMISSIONS table
This function automatically treats the following columns as date-times:
- ADMITTIME
- DISCHTIME
- DEATHTIME
- EDREGTIME
- EDOUTTIME
Parameters: - mimic_base (str) – The path to the main MIMIC folder
- kwargs (<key>=<value> pairs) – Additional key words to pass to read_df
Returns: admissions – The admissions table as a pandas data frame
Return type:
-
pyllars.physionet_utils.
get_cinc_2012_outcomes
(cinc_2012_base, to_pandas=True)[source]¶ Load the Outcomes-a.txt file.
N.B. This file is assumed to be named “Outcomes-a.txt” and located directly in the cinc_2012_base directory
Parameters: - cinc_2012_base (path-like) – The path to the main folder for this CinC challenge
- to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
Returns: cinc_2012_base – The “Outcomes-a” table as either a pandas or dask data frame, depending on the value of to_pandas. It comtains the following columns:
- HADM_ID: string, a key into the record table
- SAPS-I: integer, the SAPS-I score
- SOFA: integer, the SOFA score
- ADMISSION_ELAPSED_TIME: pd.timedelta, time in the hospital, in days
- SURVIVAL: time between ICU admission and observed death.
- If the patient survived (or the death was not recorded), then the value is np.nan.
- EXPIRED: bool, whether the patient died in the hospital
Return type: pd.DataFrame or dd.DataFrame
-
pyllars.physionet_utils.
get_cinc_2012_record
(cinc_2012_base, record_id, wide=True)[source]¶ Load the record file for the given id.
N.B. This file is assumed to be named “<record_id>.txt” and located in the “<cinc_2012_base>/set-a” directory.
Parameters: - cinc_2012_base (path-like) – The path to the main folder for this CinC challenge
- record_id (string-like) – The identifier for this record, e.g., “132539”
- wide (bool) – Whether to return a “long” or “wide” data frame
- According to the specification (https (N.B.) –
- descriptors"), six descriptors are recorded only when the patients ("General) –
- admitted to the ICU and are included only once at the beginning of the (are) –
- record. –
Returns: record_descriptors –
The six descriptors:
- HADM_ID: string, the record id. We call it “HADM_ID” to
- keep the nomenclature consistent with the MIMIC data
- ICU_TYPE: string [“coronary_care_unit”,
- ”cardiac_surgery_recovery_unit”, “medical_icu”,”surgical_icu”]
- GENDER: string [‘female’, ‘male’]
- AGE: float (or np.nan for missing)
- WEIGHT: float (or np.nan for missing)
- HEIGHT: float (or np.nan for missing)
Return type: dictionary
- observations: pd.DataFrame
The remaining time series entries for this record. This is returned as either a “long” or “wide” data frame with columns:
- HADM_ID: string (added for easy joining, etc.)
- ELAPSED_TIME: timedelta64[ns]
- MEASUREMENT: the name of the measurement
- VALUE: the value of the measurement
For a wide data frame, there is instead one column for each measurement.
-
pyllars.physionet_utils.
get_diagnosis_icds
(mimic_base: str, drop_incomplete_records: bool = False, fix_icds: bool = False) → pandas.core.frame.DataFrame[source]¶ Load the DIAGNOSES_ICDS table
Parameters: Returns: diagnosis_icds – The diagnosis ICDs table as a pandas data frame
Return type:
-
pyllars.physionet_utils.
get_followups
(mimic_base: str) → pandas.core.frame.DataFrame[source]¶ Load the (constructed) FOLLOWUPS table
Parameters: mimic_base (str) – The path to the main MIMIC folder Returns: df_followups – A data frame containing the followup information Return type: pandas.DataFrame
-
pyllars.physionet_utils.
get_icu_stays
(mimic_base, to_pandas=True, **kwargs)[source]¶ Load the ICUSTAYS table
Parameters: - mimic_base (path-like) – The path to the main MIMIC folder
- to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
- kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function
Returns: patients –
- The patients table as either a pandas or dask data frame,
depending on the value of to_pandas
Return type: pd.DataFrame or dd.DataFrame
-
pyllars.physionet_utils.
get_lab_events
(mimic_base, to_pandas=True, drop_missing_admission=False, parse_dates=True, **kwargs)[source]¶ Load the LABEVENTS table
Parameters: - mimic_base (path-like) – The path to the main MIMIC folder
- to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
- drop_missing_admission (bool) – About 20% of the lab events do not have an associated HADM_ID. If this flag is True, then those will be removed.
- parse_dates (bool) – Whether to directly parse CHARTTIME as a date. The main reason to skip this (when parse_dates is False) is if the CHARTTIME column is skipped (using the usecols parameter).
- kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function
Returns: lab_events –
- The notes table as either a pandas or dask data frame,
depending on the value of to_pandas
Return type: pd.DataFrame or dd.DataFrame
-
pyllars.physionet_utils.
get_lab_items
(mimic_base, to_pandas=True, **kwargs)[source]¶ Load the D_LABITEMS table
Parameters: - mimic_base (path-like) – The path to the main MIMIC folder
- to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
- kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function
Returns: diagnosis_icds –
- The notes table as either a pandas or dask data frame,
depending on the value of to_pandas
Return type: pd.DataFrame or dd.DataFrame
-
pyllars.physionet_utils.
get_notes
(mimic_base, to_pandas=True, **kwargs)[source]¶ Load the NOTEEVENTS table
Parameters: - mimic_base (path-like) – The path to the main MIMIC folder
- to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
- kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function
Returns: diagnosis_icds –
- The notes table as either a pandas or dask data frame,
depending on the value of to_pandas
Return type: pd.DataFrame or dd.DataFrame
-
pyllars.physionet_utils.
get_patients
(mimic_base, to_pandas=True, **kwargs)[source]¶ Load the PATIENTS table
Parameters: - mimic_base (path-like) – The path to the main MIMIC folder
- to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
- kwargs (key=value pairs) – Additional keywords to pass to the appropriate read function
Returns: patients –
- The patients table as either a pandas or dask data frame,
depending on the value of to_pandas
Return type: pd.DataFrame or dd.DataFrame
-
pyllars.physionet_utils.
get_procedure_icds
(mimic_base, to_pandas=True)[source]¶ Load the PROCEDURES_ICD table
Parameters: - mimic_base (path-like) – The path to the main MIMIC folder
- to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
Returns: procedure_icds –
- The procedure ICDs table as either a pandas or dask data frame,
depending on the value of to_pandas
Return type: pd.DataFrame or dd.DataFrame
-
pyllars.physionet_utils.
get_transfers
(mimic_base, to_pandas=True, **kwargs)[source]¶ Load the TRANSFERS table
Parameters: - mimic_base (path-like) – The path to the main MIMIC folder
- to_pandas (bool) – Whether to read the table as a pandas (True) or dask (False) data frame
- kwargs: key=value pairs
- Additional keywords to pass to the appropriate read function
Returns: transfers – - The transfers table as either a pandas or dask data frame,
- depending on the value of to_pandas
Return type: pd.DataFrame or dd.DataFrame
-
pyllars.physionet_utils.
parse_rdsamp_datetime
(fname, version=2)[source]¶ Extract the identifying information from the filename of the MIMIC-III header (*hea) files
In this project, we refer to each of these files as an “episode”.
Parameters: fname (string) – The name of the file. It should be of the form:
- version 1:
- /path/to/my/s09870-2111-11-04-12-36.hea /path/to/my/s09870-2111-11-04-12-36n.hea
- version 2:
- /path/to/my/p000020-2183-04-28-17-47.hea /path/to/my/p000020-2183-04-28-17-47n.hea
Returns: episode_timestap – A dictionary containing the time stamp and subject id for this episode. Specifically, it includes the following keys: - SUBJECT_ID: the patient identifier
- EPISODE_ID: the identifier for this episode
- EPISODE_BEGIN_TIME: the beginning time for this episode
Return type: dict