ksecurity's picture

ksecurity

ksecurity45

AI & ML interests

None yet

Recent Activity

replied to s-emanuilov's post about 2 months ago

Hey HF community! 👋 Excited to share Monkt - a tool I built to solve the eternal headache of processing documents for ML/AI pipelines. What it does: Converts PDFs, Word, PowerPoint, Excel, Web pages or raw HTML into clean Markdown or structured JSON. Great for: ✔ LLM training dataset preparation; ✔ Knowledge base construction; ✔ Research paper processing; ✔ Technical documentation management. It has API access for integration into ML pipelines. Check it out at https://monkt.com/ if you want to save time on document processing infrastructure. Looking forward to your feedback!

replied to as-cle-bert's post about 2 months ago

🎉𝐄𝐚𝐫𝐥𝐲 𝐍𝐞𝐰 𝐘𝐞𝐚𝐫 𝐫𝐞𝐥𝐞𝐚𝐬𝐞𝐬🎉 Hi HuggingFacers🤗, I decided to ship early this year, and here's what I came up with: 𝐏𝐝𝐟𝐈𝐭𝐃𝐨𝐰𝐧 (https://github.com/AstraBert/PdfItDown) - If you're like me, and you have all your RAG pipeline optimized for PDFs, but not for other data formats, here is your solution! With PdfItDown, you can convert Word documents, presentations, HTML pages, markdown sheets and (why not?) CSVs and XMLs in PDF format, for seamless integration with your RAG pipelines. Built upon MarkItDown by Microsoft GitHub Repo 👉 https://github.com/AstraBert/PdfItDown PyPi Package 👉 https://pypi.org/project/pdfitdown/ 𝐒𝐞𝐧𝐓𝐫𝐄𝐯 𝐯𝟏.𝟎.𝟎 (https://github.com/AstraBert/SenTrEv/tree/v1.0.0) - If you need to evaluate the 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 performance of your 𝘁𝗲𝘅𝘁 𝗲𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴 models, I have good news for you🥳🥳 The new release for 𝐒𝐞𝐧𝐓𝐫𝐄𝐯 now supports 𝗱𝗲𝗻𝘀𝗲 and 𝘀𝗽𝗮𝗿𝘀𝗲 retrieval (thanks to FastEmbed by Qdrant) with 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 𝗳𝗶𝗹𝗲 𝗳𝗼𝗿𝗺𝗮𝘁𝘀 (.docx, .pptx, .csv, .html, .xml, .md, .pdf) and new 𝗿𝗲𝗹𝗲𝘃𝗮𝗻𝗰𝗲 𝗺𝗲𝘁𝗿𝗶𝗰𝘀! GitHub repo 👉 https://github.com/AstraBert/SenTrEv Release Notes 👉 https://github.com/AstraBert/SenTrEv/releases/tag/v1.0.0 PyPi Package 👉 https://pypi.org/project/sentrev/ Happy New Year and have fun!🥂

View all activity

Organizations

None yet

models

None public yet

datasets

None public yet