-
My current CPU is a Ryzen 7 1700x, GPU Radeon 5700 XT 8GB with RAM 32GB, I'm running Tabby through Vulkan and I mainly code in Python and JavaScript. I tried several models and beetwen low latency and answer accuracy the ones that worked better were StarCoder-3B and DeepSeekCoder-1.3B. I looked at the benchmarks of each model in their repo and I saw that DeepSeekCoder outperforms StarCode but in my tests (creating an example JS file that can handle axios API calls and data to CSV conversion) StarCoder gave me more accurate suggestions but being a bit slower. I also checked the ML leaderboards provided by Tabby. I saw that there are plans to add StarCoder2-3B to the Tabby model registry (not yet available when I checked). I'm relatively new to using AI models to help me work so maybe that can also be one of the reasons of my doubts. Also, I tend to prefers OSS, I checked both licenses and seems similar to me. I don't know if I should trust the benchmarks and use DeepSeekCoder-1.3B or trust what I felt in my tests and stick to StarCoder-3B? Or change to StarCoder2-3B when it becomes available? For chat I'm currently using WizardCoder-3B. Benchmarks I checked:
Thanks.
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
LLM evaluation is actually the most mystic part of the entire ecosystem, a combination of scientific and intuitive feelings. :) In general, we receive pretty good feedback regarding the model performance of DeepSeekCoder-6.7B (which tops the leaderboard). We also see some mixed feedback between DeepSeekCoder-1.3B and StarCoder-3B. If you feel that a particular model runs better in your environment but performs poorly on the leaderboard, it's likely that your working setup is more accustomed to that model's training environment. Anyway, our suggestion is to use the leaderboard as a reference and stick with the model that you feel is the best. Lastly, FYI for enterprise or team-wise use cases, we do offer consultative support on fine-tuning models for better performance. |
Beta Was this translation helpful? Give feedback.
-
I'm interested if there are different models (or could be) for different sorts of user experience. One of the most annoying things (and there are many) about copilot is that it tends to talk down to me and try to explain everything as if I don't know - and half the time, it i hallucinating or doesn't know but won't admit it. I think it's been vastly oversold - but maybe it really is worth $10 a month or whatever if you don't know how to program properly and have only written scripts in Python/JS. But it's seriously annoying if you're someone like me having spent 15 years working in a variety of languages, in particular in low-level and systems programming. Stupid comments like "Open the file" right before an open syscall actively waste my time in removing them, and they might be strong with Python, but I regularly have to tell them why what they've just written in C++ flat out won't build or is dangerous. I'd very much like it if there were models that are biased towards experienced users, because copilot treats me like a novice - I want to use a tool like this when I haven't find the information easily through google, when it's a complex problem and there won't be a simple solution - and I really have to break it down and explain it in very small pieces so that it will understand, or when I want it just to shut up and do the boring work for me, not explain the code I have sent it that I already understand very well because I wrote it - e.g. refactoring or writing something I know how to do but I want it to do it faster than I can type. I understand tabby is separate from the models, which take many man-hours to finetune, but do people think this sort of distinction is something that might come around in the future? I haven't looked into all the benchmarks but I presume they are often tested on accuracy and ability to converse across a broad range - but not specifically on different classes of users. I'm also concerned some of these models have been on large sets of code (like all of github), which makes sense obviously, but not all those repositories have that high-quality code. Even for whatever large and mature FOSS projects I use, or proprietary code I've come across, very few I've worked on have what I personally would consider nearly enough meaningful comments or well-named variables. I spend a lot of time better writing what the model has come out with. But that's another whole question for a different day 😅 |
Beta Was this translation helpful? Give feedback.
LLM evaluation is actually the most mystic part of the entire ecosystem, a combination of scientific and intuitive feelings. :)
In general, we receive pretty good feedback regarding the model performance of DeepSeekCoder-6.7B (which tops the leaderboard). We also see some mixed feedback between DeepSeekCoder-1.3B and StarCoder-3B. If you feel that a particular model runs better in your environment but performs poorly on the leaderboard, it's likely that your working setup is more accustomed to that model's training environment.
Anyway, our suggestion is to use the leaderboard as a reference and stick with the model that you feel is the best. Lastly, FYI for enterprise or team-wise use cas…