How are the "primary languages" for a file extension decided? #5345
-
Some file extensions can belong to different programming languages (like I ask because |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
How Linguist Works details how files are assessed in isolation through each of the strategies, in the order they appear in the list. Essentially Linguist works like a funnel: a lot of languages go in the top and it tries to whittle the list down to one language at each step. If it gets to the end and there is still more than one language, it takes the first, as it works on the assumption the classifier has determined that to be the most likely language based on the samples Linguist has (the final thing the classifier does is sort the languages based on a score after assessing the content). Everything beyond the extension strategy relies upon the content which means empty files that share a common extension are most likely (there are a few exceptions) to fall all the way through to the classifier which is never going to be able to make a good guess as it has nothing to work with. In the case of As an empty file won't match either, it'll fall all the way to the classifier which will try and tokenise the empty content and come up with a score which will be zero for all languages so it returns the list of languages in the order it received them, in this case Smalltalk first. So in short, generally speaking, Linguist doesn't have a concept of a primary language for an extension but rather relies upon the content when there are multiple languages associated with the same extension. There are a few exceptions, but this is the general rule. Given an empty file in isolation, even a human wouldn't be able to do a better job without some additional context 😁. |
Beta Was this translation helpful? Give feedback.
How Linguist Works details how files are assessed in isolation through each of the strategies, in the order they appear in the list.
Essentially Linguist works like a funnel: a lot of languages go in the top and it tries to whittle the list down to one language at each step. If it gets to the end and there is still more than one language, it takes the first, as it works on the assumption the classifier has determined that to be the most likely language based on the samples Linguist has (the final thing the classifier does is sort the languages based on a score after assessing the content).
Everything beyond the extension strategy relies upon the content which means empty files that share …