Dr. Joao Paulo Schwarz Schuler PRO

schuler

AI & ML interests

artificial intelligence

Recent Activity

View all activity

Organizations

None yet

schuler's activity

reacted to AdinaY's post with ๐Ÿ‘ 4 days ago
view post
Post
2402
Two AI startups, DeepSeek & Moonshot AI , keep moving in perfect sync ๐Ÿ‘‡

โœจ Last December: DeepSeek & Moonshot AI released their reasoning models on the SAME DAY.
DeepSeek: deepseek-ai/DeepSeek-R1
MoonShot: https://github.com/MoonshotAI/Kimi-k1.5

โœจ Last week: Both teams published papers on modifying attention mechanisms on the SAME DAY AGAIN.
DeepSeek: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention (2502.11089)
Moonshot: MoBA: Mixture of Block Attention for Long-Context LLMs (2502.13189)

โœจ TODAY:
DeepSeek unveiled Flash MLA: a efficient MLA decoding kernel for NVIDIA Hopper GPUs, optimized for variable-length sequences.
https://github.com/deepseek-ai/FlashMLA

Moonshot AI introduces Moonlight: a 3B/16B MoE trained on 5.7T tokens using Muon, pushing the Pareto frontier with fewer FLOPs.
moonshotai/Moonlight-16B-A3B

What's next? ๐Ÿ‘€
reacted to onekq's post with ๐Ÿ‘€ 4 days ago
view post
Post
1813
Huge disappointment to Claude Sonnet 3.7 ๐Ÿ˜ž Big performance regression. Worse than the June version in 2024. ๐Ÿ‘Ž
onekq-ai/WebApp1K-models-leaderboard

I'm sure though this version improves on something, only not the thing my leaderboard measures. This proves the point that no model can be the best on everything.
  • 2 replies
ยท
posted an update 4 days ago
view post
Post
1902
๐Ÿ“ข Old Research Alert: Making Computer Vision Models Smaller & Smarter!

Years ago, I coded an optimization in the first layers of a convolutional neural network (computer vision) and ended never posting here. The optimization decreases the number of parameters while increasing accuracy. The optimization relies in separating (branching) chromatic and achromatic information through the layers of a neural network.

YouTube videos:
https://www.youtube.com/watch?v=u4vZZmBMFLw
https://www.youtube.com/watch?v=-BD293yqdKI

Source codes:
https://github.com/joaopauloschuler/two-branch-plant-disease
https://github.com/joaopauloschuler/two-path-noise-lab-plant-disease

Research papers:
https://www.researchgate.net/publication/361511874_Color-Aware_Two-Branch_DCNN_for_Efficient_Plant_Disease_Classification
https://www.researchgate.net/publication/355215213_Reliable_Deep_Learning_Plant_Leaf_Disease_Classification_Based_on_Light-Chroma_Separated_Branches

May the force be with you.
posted an update 12 days ago
view post
Post
3371
๐Ÿ”ฎ GPT-3 implemented in pure Free Pascal!
https://github.com/joaopauloschuler/gpt-3-for-pascal

This implementation follows the GPT-3 Small architecture from the landmark paper "Language Models are Few-Shot Learners":
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     Input Layer       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Token & Positional    โ”‚
โ”‚     Embedding         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   12x Transformer     โ”‚
โ”‚      Blocks           โ”‚
โ”‚  - 12 heads           โ”‚
โ”‚  - 768 hidden dims    โ”‚
โ”‚  - 3072 intermediate  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   Output Layer        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Clean Pascal Implementation
for CntLayer := 1 to {Layers=}12 do
begin
  Result.AddTransformerBlockCAI(
    {Heads=}12, 
    {intermediate dimensions=}4*768, 
    {NoForward=}true, 
    {HasNorm=}true, 
    false
  );
end;

replied to their post 16 days ago
view reply

In the case that you run into any roadblock at modifying an existing model with this optimization so you can train the optimized model from scratch, please feel free to ask for help.

posted an update 19 days ago
view post
Post
7222
๐Ÿ“ข New Research Alert: Making Language Models Smaller & Smarter!

Thrilled to share the latest technical report demonstrating how to reduce language model parameters by 77% while maintaining performance.

The secret? Grouped pointwise convolutions. Yes. We brought a method from computer vision to the transformers arena.

๐Ÿ”‘ Key Findings:
โ€ข 77% parameter reduction.
โ€ข Maintained model capabilities.
โ€ข Improved generalization.

Paper: https://www.researchgate.net/publication/388835829_SAVING_77_OF_THE_PARAMETERS_IN_LARGE_LANGUAGE_MODELS_TECHNICAL_REPORT
Code: https://github.com/joaopauloschuler/less-parameters-llm
  • 2 replies
ยท