Replies: 6 comments
-
Sounds like a major undertaking. :) I recall that something like this may have been mentioned as a goal years ago. Even though it’s probably rather early now, if the idea is still in its infancy, below are some comments, questions, wishes and other random thoughts of mine on the matter. (I haven’t asked the opinions of other FIN-CLARIN people, but I don’t think there is anything very controversial here.) I hope this text isn’t too long for a GitHub issue comment. Although I’d be glad to get some comments, I don’t expect you to answer to the questions at this stage; they just reflect what I’d find relevant and important in such a major change.
|
Beta Was this translation helpful? Give feedback.
-
Thanks for all the input! Some comments:
This could be nice to have. Our goal is to allow users to upload their own corpora and search those in Korp. Discussions are upcoming, but we must make changes to the corpus chooser anyway and with a rewrite, I think we should at least make sure that the new/rewritten one is more extendable.
Not stored in a database, I agree that versioning is important. But otherwise, we don't now yet. I think either:
The final JSON-format could be very similar to what we have now after running the configuration files. And of course, as you pointed out in 4., anyone with a Korp-instance may do as they choose when it comes to generating the JSON, but it would be nice to have a solution that fits everyone.
Noted. It will of course not be a problem with the JSON in the end, but the code that creates the JSON have to be flexible enough to accept fields that are not in the standard Korp.
Yes, this is something that must be included. We also have some attributes where the name is the same, but configuration differs.
Templating/inheritance is what comes to mind. Make an abstract corpora that the similar ones will inherit/copy and agument with the needed fields. We also need this.
I think we want to keep as much as possible to not make this task too big.
Interesting, both this and a faceted selection creates the need for the backend to be able to answer more specific questions than "all corpora in mode X". The backend could have the configurations in a simple memory database/document store, so we can make queries. Dynamic modes seems like an easy thing to implement, but maybe part of a larger discussion first on how we actually want corpus selection to work in the future. 6 Custom attributes, controllers and such
We will define all custom function/components in the frontend and refer to those with strings in the configuration. (This is the part that I will work on first, to clean up our current configuration files). It is of course possible to define the code as strings and store in the backend configuration files, but will probably just make stuff more complicated. If we need it, we can add flags to the components so that they are configurable. Something like:
To avoid situations where the components are very alike ("autocomplete25items", "autocomplete50items" etc.).
Yes! This is a goal, and should happen before we can call this done. I think that it will just simplify if the attribute definitions and translations are closer each other.
I agree that this would be nice and at least some metadata should be easy to allow translations for. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your answers and comments! Your plans and ideas sound very promising to me. Here are some further thoughts of mine.
Sounds interesting: such a feature has also been our wish. We have thought it to be rather complicated to implement, but I suppose that moving corpus configuration to the backend would clear one hurdle. I’d like to hear more about that at some point, even though we aren’t using all the components perhaps required by your planned solution, such as Sparv.
Of these options, I personally might prefer YAML, if it’s easy enough to support all the desired configuration features in it. And if YAML input is supported, it might not be difficult to support JSON input with the same semantics, too. Configurations as (Python) code would probably be the most flexible approach, but I agree that some kind of a DSL would then be nice to cover at least most of the cases.
Right; that would work at least for corpus attributes. I’d also like to see some kind of templating within individual string values, so that in a corpus template, you could define something like
I don’t know if that would be required by faceted corpus selection in the frontend if it were applied to the corpora of a specific mode whose configurations had already been retrieved from the backend. However, I think faceted queries might also be useful for those using the backend as a Web service, so I’d like to see that kind of a feature.
Sounds reasonable.
I think that on the one hand, it would be nice to have all the corpus configuration code in the same place (backend), but on the other hand, it would be better to have the JavaScript code of the components and functions where other JavaScript code is (frontend). Having them in the frontend might also make it easier to reuse code. And if a change in the frontend requires modifying the code of a component, you wouldn’t need to touch the configuration in the backend.
I think some kind of component parametrization would indeed be great.
I’ve long since had the idea of supporting in corpus (and in particular corpus folder) configurations something like name: {
fi: "Sanomalehti",
sv: "Tidning",
en: "Newspaper",
}, but I’ve never got around implementing it myself and I probably won’t do that now that the goal is to move configurations to the backend. |
Beta Was this translation helpful? Give feedback.
-
(These questions might be more appropriate in the Korp backend repo, but as this issue is open here and as the backend issues seem rather inactive, I ask them here.) What is the current state of moving Korp corpus configurations to the backend? I have noticed that the code in the By reading the frontend code (and We’d thus like to get at least a preview of what the corpus configurations now look like, in order to estimate the work we’ll have ahead in converting our corpus configurations to the new format. You perhaps have a script for converting the JavaScript configurations to the new format (have you?), but our extensions and they ways we have used JavaScript code to generate corpus configurations might mean more work in the conversion. If possible, we’d also like to see the backend code processing the configurations, to assess if we need to amend it somehow to cater for our needs. Moreover, it’s now a bit difficult for us to try to keep up with your frontend development code without being able to use the backend supporting corpus configurations. However, I also understand that you might not wish to disclose code (or a configuration format) that is under heavy development and that might still change significantly. In contrast, I don’t think the possible lack or sparsity of documentation would be a major problem for us at this stage. [EDITED: I just got information from Krister Lindén who had talked to Markus Forsberg that the Min Språkbank service being developed is not yet in a state that you would make it available to others. When it is ready enough, we’re interested in getting it, too, to see if or how we could adapt it to our needs.] Last but not least: thank you for all the great work! |
Beta Was this translation helpful? Give feedback.
-
@MartinHammarstedt and @majsan, thank you for all the information via email and repositories! I now answer myself based on the answers I got from you, for future record and also for developers at other Korp sites. Please correct or amend me if I misunderstood something.
|
Beta Was this translation helpful? Give feedback.
-
Here are some thoughts of mine on the current corpus configuration representation on the backend and some ideas or proposals for enhancements or additional (optional) features. This is once again rather a long comment, but I hope you bear with me. A thought on specifying the containing modes and folders in a corpus configuration fileI was first a bit surprised at the fact that the modes and folders for each corpus are now specified in the configuration of the corpus and not that of the folder. I thought it would be faster to find the corpora in all the folders of a mode than to scan all corpora to find those which are to be included in a mode. However, I then realized that the current system allows adding a corpus as a single file, without having to modify any mode file. Differences between the output of the conversion script and the current configurationsI noticed the following differences between the output of the corpus configuration conversion script and the current YAML configurations:
I think the current attribute definitions are in general simpler than those generated by the script. It also seemed to me that many commonly used groups of attributes were not grouped, such as those corresponding to Did you convert the output of the conversion script to the current configurations step by step, with one-off scripts or one-liners, or how did you do that? General thoughts regarding enhancements of the configuration formatI think any improvements to the corpus configuration format should be backward-compatible: the existing corpus configurations should continue to work as they are, so that you need not change your corpus configurations unless you wish to do so. I admit that I’m a bit hesitant to suggest enhancements to your carefully crafted configuration representation, but it was a bit different than I had expected. In particular, I’d be keen to reduce redundancy in both the YAML source and the resulting JSON and JavaScript objects, even though that means some extra processing. I think I should be able to implement any changes myself but I’d like to get feedback from you (@majsan and @MartinHammarstedt and whoever else might be or may wish to be involved), in particular on such features that you have already considered and rejected for some reason. And please also tell me if some of my ideas conflict with your future plans. Many of the features would affect or require support in both the backend and frontend. I’ll list them here briefly but I can open new issues in the backend repository for more details of at least those that would primarily affect the backend. Enhancement ideasThen to my enhancement ideas. I have marked with a ⭐ those that I’d definitely like to have. Please let me know what you think.
|
Beta Was this translation helpful? Give feedback.
-
Since it is complicated and time consuming to add and especially maintain corpora in the current setup, we want to move the configuration to the backend. Another benefit is that we will not have to rebuild the frontend when adding corpora.
Exactly how this will look is to be determined. Step 1 is to move as much code as possible out from the corpus configuration files.
Beta Was this translation helpful? Give feedback.
All reactions