Model Details

Model Name: NumericBERT

Model Type: Transformer

Architecture: BERT

Training Method: Masked Language Modeling (MLM)

Training Data: MIMIC IV Lab values data

Training Hyperparameters:

Optimizer: AdamW Learning Rate: 5e-5 Masking Rate: 20% Tokenization Tokenizer: Custom numeric-to-text mapping using the TextEncoder class

Text Encoding Process:

The process converts non-negative integers into uppercase letter-based representations. This mapping allows numerical values to be expressed as sequences of letters. Subsequently, a method is applied to scale numerical values and convert them into corresponding letters based on a predefined mapping. Finally, a text encoding is executed to add the corresponding lab ID using the numeric values in specified columns ('Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc').

Training Data Preprocessing

Column Selection: Numerical values from the following lab values represented as: 'Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc'. Text Encoding: The numeric values are encoded into text. Masking: 20% of the data is randomly masked during training.

Model Output

The model outputs predictions for masked values during training. The output contains the encoded text.

Limitations and Considerations

Numeric Data Representation: The model relies on a custom text representation of numeric data, which might have limitations in capturing complex patterns present in the original numeric data. Training Data Source: The model is trained on MIMIC IV numeric data, and its performance might be influenced by the characteristics and biases present in that dataset.

Contact Information

For inquiries or additional information, please contact:

David Restrepo [email protected] MIT Critical Data

dsrestrepo
/

BERT_Lab_Values