big final models do (mostly flux in oss, i think sd3.5 has a bit but not nearly as strong?)
most random pony or sdxl loras arent though, none of the trainers support it and its all hidden in research codebases that are impossible to use
big final models do (mostly flux in oss, i think sd3.5 has a bit but not nearly as strong?)
most random pony or sdxl loras arent though, none of the trainers support it and its all hidden in research codebases that are impossible to use
in a somewhat similar vein (not rly though), has anyone over there experimented with taking a current encoder arch (ie modernbert), ripping out the transformers, replacing them with something like mamba2/griffin temporal mixing layers, then distilling the original model onto it? seems like it could be a lot less lossy than a straight static embedding layer but still better complexity-wise than self-attention
i was trying this earlier but the first shape error in the forward pass made me give up ๐ญ
4gb can-only-tune-gpt2-small-locally represent ๐ช
hey i was just trying to clarify the misinformation about dropout (and, to be completely fair, scheduling a change in dropout probabilities like you would LR during training might be a new concept), idk what y'all are arguing about now
Perhaps my tone is too harsh, apologies.
Regardless, you really should've known this before spouting misinformation about what pretty commonly known ML concepts are. Unless you'd like to argue that everyone else is wrong about what dropout is
@TuringsSolutions Dropout noise is not static, it's randomized. According to the Hinton et. al paper that introduced the term:
On each presentation of each training case, each hidden unit is randomly omitted
from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units
being present.
Refer to https://arxiv.org/abs/1207.0580.
Obviously this likely wasn't the first usage of a similar concept, but either you're right and literally every other developer ever has been lying about what dropout is, or you're lying to hype up your own nonsense jargon.
the transformers and llama.cpp impl was busted when you wrote this. even now l.cpp still doesn't implement SWA, so >4k context doesn't work. similar teething issues to llama 3, really; too early to say that anything is trash
you can try it urself if you just curl the endpoint, something likecurl 'https://huggingface.co./settings/hardware-items' -X PUT --data-raw '[{"sku":["CPU","AMD","Ryzen Zen 2 3000 (Ryzen 5)"],"mem":16,"num":-96417}]'
weird. i wonder why it doesnt display properly
You don't either?
@nroggendorff
wth, i do on my end D:
i bet u aint got negative tflops though. ๐
wow, i can't believe they finally figured out that LLMs are good at following patterns! /s
feels a bit disingenous to try and claim that it's an "Open Cerebrum" to me? the entire point of cerebrum's work, from my perspective, is their dataset in the first place w/ its relatively small size, targeted concepts, and (presumably) human-written-ness (or at least it's what they imply). a collection of synthetic data from random datasets, even with care taken to filter things around, doesn't reaaaally feel very close to me?
regardless, nice work! even if it's not an exact replication in my book it could always be useful for something