How are the "primary languages" for a file extension decided? #5345

Sparkles-Laurel · 2021-04-28T08:24:27Z

Sparkles-Laurel
Apr 28, 2021

Some file extensions can belong to different programming languages (like .h, which can be C, C++, or Objective-C.) But sometimes, there can be empty source files. How are their language detected then?

I ask because .cs files are usually C# files, but they are detected as Smalltalk when they are empty.

Answered by lildude

Apr 28, 2021

How Linguist Works details how files are assessed in isolation through each of the strategies, in the order they appear in the list.

Essentially Linguist works like a funnel: a lot of languages go in the top and it tries to whittle the list down to one language at each step. If it gets to the end and there is still more than one language, it takes the first, as it works on the assumption the classifier has determined that to be the most likely language based on the samples Linguist has (the final thing the classifier does is sort the languages based on a score after assessing the content).

Everything beyond the extension strategy relies upon the content which means empty files that share …

View full answer

lildude · 2021-04-28T10:06:47Z

lildude
Apr 28, 2021
Maintainer

How Linguist Works details how files are assessed in isolation through each of the strategies, in the order they appear in the list.

Essentially Linguist works like a funnel: a lot of languages go in the top and it tries to whittle the list down to one language at each step. If it gets to the end and there is still more than one language, it takes the first, as it works on the assumption the classifier has determined that to be the most likely language based on the samples Linguist has (the final thing the classifier does is sort the languages based on a score after assessing the content).

Everything beyond the extension strategy relies upon the content which means empty files that share a common extension are most likely (there are a few exceptions) to fall all the way through to the classifier which is never going to be able to make a good guess as it has nothing to work with.

In the case of .cs files, this extension is shared by C# and Smalltalk so will be subject to these heuristics:

https://github.com/github/linguist/blob/32ec19c013a7f81ffaeead25e6e8f9668c7ed574/lib/linguist/heuristics.yml#L127-L132

As an empty file won't match either, it'll fall all the way to the classifier which will try and tokenise the empty content and come up with a score which will be zero for all languages so it returns the list of languages in the order it received them, in this case Smalltalk first.

So in short, generally speaking, Linguist doesn't have a concept of a primary language for an extension but rather relies upon the content when there are multiple languages associated with the same extension. There are a few exceptions, but this is the general rule.

Given an empty file in isolation, even a human wouldn't be able to do a better job without some additional context 😁.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are the "primary languages" for a file extension decided? #5345

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How are the "primary languages" for a file extension decided? #5345

Sparkles-Laurel Apr 28, 2021

Replies: 1 comment

lildude Apr 28, 2021 Maintainer

Sparkles-Laurel
Apr 28, 2021

lildude
Apr 28, 2021
Maintainer