Model Details
Model Name: NumericBERT
Model Type: Transformer
Architecture: BERT
Training Method: Masked Language Modeling (MLM)
Training Data: MIMIC IV Lab values data
Training Hyperparameters:
Optimizer: AdamW Learning Rate: 5e-5 Masking Rate: 20% Tokenization Tokenizer: Custom numeric-to-text mapping using the TextEncoder class
Text Encoding Process:
The process converts non-negative integers into uppercase letter-based representations. This mapping allows numerical values to be expressed as sequences of letters. Subsequently, a method is applied to scale numerical values and convert them into corresponding letters based on a predefined mapping. Finally, a text encoding is executed to add the corresponding lab ID using the numeric values in specified columns ('Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc').
Training Data Preprocessing
Column Selection: Numerical values from the following lab values represented as: 'Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc'. Text Encoding: The numeric values are encoded into text. Masking: 20% of the data is randomly masked during training.
Model Output
The model outputs predictions for masked values during training. The output contains the encoded text.
Limitations and Considerations
Numeric Data Representation: The model relies on a custom text representation of numeric data, which might have limitations in capturing complex patterns present in the original numeric data. Training Data Source: The model is trained on MIMIC IV numeric data, and its performance might be influenced by the characteristics and biases present in that dataset.
Contact Information
For inquiries or additional information, please contact:
David Restrepo [email protected] MIT Critical Data
license: mit
- Downloads last month
- 5