Dr. Joao Paulo Schwarz Schuler's picture

6 7

Dr. Joao Paulo Schwarz Schuler PRO

schuler

https://www.researchgate.net/profile/Joao-Paulo-Schwarz-Schuler

joaopauloschuler

AI & ML interests

artificial intelligence

Recent Activity

reacted to AdinaY's post with 👍 4 days ago

Two AI startups, DeepSeek & Moonshot AI , keep moving in perfect sync 👇 ✨ Last December: DeepSeek & Moonshot AI released their reasoning models on the SAME DAY. DeepSeek: https://huggingface.co./deepseek-ai/DeepSeek-R1 MoonShot: https://github.com/MoonshotAI/Kimi-k1.5 ✨ Last week: Both teams published papers on modifying attention mechanisms on the SAME DAY AGAIN. DeepSeek: https://huggingface.co./papers/2502.11089 Moonshot: https://huggingface.co./papers/2502.13189 ✨ TODAY: DeepSeek unveiled Flash MLA: a efficient MLA decoding kernel for NVIDIA Hopper GPUs, optimized for variable-length sequences. https://github.com/deepseek-ai/FlashMLA Moonshot AI introduces Moonlight: a 3B/16B MoE trained on 5.7T tokens using Muon, pushing the Pareto frontier with fewer FLOPs. https://huggingface.co./moonshotai/Moonlight-16B-A3B What's next? 👀

reacted to onekq's post with 👀 4 days ago

Huge disappointment to Claude Sonnet 3.7 😞 Big performance regression. Worse than the June version in 2024. 👎 https://huggingface.co./spaces/onekq-ai/WebApp1K-models-leaderboard I'm sure though this version improves on something, only not the thing my leaderboard measures. This proves the point that no model can be the best on everything.

posted an update 4 days ago

📢 Old Research Alert: Making Computer Vision Models Smaller & Smarter! Years ago, I coded an optimization in the first layers of a convolutional neural network (computer vision) and ended never posting here. The optimization decreases the number of parameters while increasing accuracy. The optimization relies in separating (branching) chromatic and achromatic information through the layers of a neural network. YouTube videos: https://www.youtube.com/watch?v=u4vZZmBMFLw https://www.youtube.com/watch?v=-BD293yqdKI Source codes: https://github.com/joaopauloschuler/two-branch-plant-disease https://github.com/joaopauloschuler/two-path-noise-lab-plant-disease Research papers: https://www.researchgate.net/publication/361511874_Color-Aware_Two-Branch_DCNN_for_Efficient_Plant_Disease_Classification https://www.researchgate.net/publication/355215213_Reliable_Deep_Learning_Plant_Leaf_Disease_Classification_Based_on_Light-Chroma_Separated_Branches May the force be with you.

View all activity

Organizations

None yet

schuler's activity

reacted to AdinaY's post with 👍 4 days ago

Post

2402

Two AI startups, DeepSeek & Moonshot AI , keep moving in perfect sync 👇

✨ Last December: DeepSeek & Moonshot AI released their reasoning models on the SAME DAY.
DeepSeek: deepseek-ai/DeepSeek-R1
MoonShot: https://github.com/MoonshotAI/Kimi-k1.5

✨ Last week: Both teams published papers on modifying attention mechanisms on the SAME DAY AGAIN.
DeepSeek: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention (2502.11089)
Moonshot: MoBA: Mixture of Block Attention for Long-Context LLMs (2502.13189)

✨ TODAY:
DeepSeek unveiled Flash MLA: a efficient MLA decoding kernel for NVIDIA Hopper GPUs, optimized for variable-length sequences.
https://github.com/deepseek-ai/FlashMLA

Moonshot AI introduces Moonlight: a 3B/16B MoE trained on 5.7T tokens using Muon, pushing the Pareto frontier with fewer FLOPs.
moonshotai/Moonlight-16B-A3B

What's next? 👀

reacted to onekq's post with 👀 4 days ago

Post

1813

Huge disappointment to Claude Sonnet 3.7 😞 Big performance regression. Worse than the June version in 2024. 👎
onekq-ai/WebApp1K-models-leaderboard

I'm sure though this version improves on something, only not the thing my leaderboard measures. This proves the point that no model can be the best on everything.

2 replies

posted an update 4 days ago

Post

1902

📢 Old Research Alert: Making Computer Vision Models Smaller & Smarter!

Years ago, I coded an optimization in the first layers of a convolutional neural network (computer vision) and ended never posting here. The optimization decreases the number of parameters while increasing accuracy. The optimization relies in separating (branching) chromatic and achromatic information through the layers of a neural network.

YouTube videos:
https://www.youtube.com/watch?v=u4vZZmBMFLw
https://www.youtube.com/watch?v=-BD293yqdKI

Source codes:
https://github.com/joaopauloschuler/two-branch-plant-disease
https://github.com/joaopauloschuler/two-path-noise-lab-plant-disease

Research papers:
https://www.researchgate.net/publication/361511874_Color-Aware_Two-Branch_DCNN_for_Efficient_Plant_Disease_Classification
https://www.researchgate.net/publication/355215213_Reliable_Deep_Learning_Plant_Leaf_Disease_Classification_Based_on_Light-Chroma_Separated_Branches

May the force be with you.

posted an update 12 days ago

Post

3371

🔮 GPT-3 implemented in pure Free Pascal!
https://github.com/joaopauloschuler/gpt-3-for-pascal

This implementation follows the GPT-3 Small architecture from the landmark paper "Language Models are Few-Shot Learners":

┌─────────────────────────┐
│     Input Layer       │
├─────────────────────────┤
│ Token & Positional    │
│     Embedding         │
├─────────────────────────┤
│   12x Transformer     │
│      Blocks           │
│  - 12 heads           │
│  - 768 hidden dims    │
│  - 3072 intermediate  │
├─────────────────────────┤
│   Output Layer        │
└─────────────────────────┘

Clean Pascal Implementation

for CntLayer := 1 to {Layers=}12 do
begin
  Result.AddTransformerBlockCAI(
    {Heads=}12, 
    {intermediate dimensions=}4*768, 
    {NoForward=}true, 
    {HasNorm=}true, 
    false
  );
end;

replied to their post 16 days ago

In the case that you run into any roadblock at modifying an existing model with this optimization so you can train the optimized model from scratch, please feel free to ask for help.

posted an update 19 days ago

Post

7222

📢 New Research Alert: Making Language Models Smaller & Smarter!

Thrilled to share the latest technical report demonstrating how to reduce language model parameters by 77% while maintaining performance.

The secret? Grouped pointwise convolutions. Yes. We brought a method from computer vision to the transformers arena.

🔑 Key Findings:
• 77% parameter reduction.
• Maintained model capabilities.
• Improved generalization.

Paper: https://www.researchgate.net/publication/388835829_SAVING_77_OF_THE_PARAMETERS_IN_LARGE_LANGUAGE_MODELS_TECHNICAL_REPORT
Code: https://github.com/joaopauloschuler/less-parameters-llm

2 replies