Releases: ml4ai/skema
Releases · ml4ai/skema
ASKEM_SKEMA_Milestone_12
ASKEM SKEMA Milestone 12 release.
- Code2FN
- Support to automatically ingest library interface (url and multi-file ingestion)
- Added support for dependency generation
- module_dependencies field to GrometFNModuleCollection
- module_location script to automatically extract and locate Python dependencies
- Major progress in porting Python AST front-end to tree-sitter.
- Text Reading
- Implementation of encoder-based scenario context engine
- Adapted instruction-tuned T5 model to extract time and location scenario context.
- Improved the sieve grounder with a cross-platform neural model
- Explored use of LLM-derived data augmentation to improve training data quality.
- Implementation of encoder-based scenario context engine
- Eqn Reading
- Added support for exporting MathExpressionTree.
- Added support for physics symbols.
- ISA
- Completion of ISA workflow endpoint.
- MORAE
- Implemented support for exporting "generalized" AMR export.
- Added support for nonlinear differential equation extraction and representation.
- MOVIZ
- Added support for navigating "up" the containment/parent hierarchy within the Function Network display.
ASKEM_SKEMA_Milestone_11
ASKEM SKEMA Milestone 11 release.
- Code2FN
- Ingest all of V3 of the CISM model code base
- Improved Fortran tree-sitter front-end
- In progress migration of Python AST front-end to tree-sitter
- Includes support for handling common Python 2 idioms
- Extension of Gromet CAST and FN schema to support Gotos
- Text Reading
- Updated core pipeline with updated NLP processors transformer backend model, improving runtime and decreasing memory requirements.
- Incorporated sieve-based DKG grounding module
- Implemented transformer-based model of location and scenario context based
- Eqn Reading
- Numerous math idiom extensions to support physics equations common to climate and space weather.
- Support added for representing and serializing minimal-typed DECAPODEs representation.
- ISA
- Implemented support for equation-to-equation alignment
- Started implementation of ISA with MathExpressionTree data structures
- Implemented ISA API endpoint
- Started work on equation and code alignment.
- MORAE
- Dynamics linespace identification
- LLM-assisted Code2AMR
- AMR-enrichment – using execution to derive parameter values from expression tree evaluation
- Multiple MORAE API endpoints
- MOVIZ
- Added URL-based file launching
- Improved network layout
- Added visual indicator of missing ports when wire exists
- Added framework for reverse reference to JSON FN linking
- Added tooltips for extra information per box
- Integrated metadata display
ASKEM_SKEMA_Milestone_10
ASKEM SKEMA Milestone 10 release.
-
Code2FN
- Extracting Climate Models
- ClimLab progress
- CISM Halfar model code extraction
- tree-sitter SKEMA front-end framework
- Fortran support
- added pre-processor to convert fixed-form to free-form
- added support for compound-conditionals
- Python support
- started migration of PyAST to tree-sitter front-end
- ported: assignments, arithmetic operations, function definitions
- started migration of PyAST to tree-sitter front-end
- Matlab support
- Using SV2AIR3-Waterloo model as development target for ingestion
- added support for: assignments, binary operators, conditionals
- added unit tests
- Comment extraction
- replaced Rust comment extraction with tree-sitter
- current support: C, Cpp, Fortran, Python, Matlab, R
- GroMEt Generation
- refactored Function Call and Primitive Function Call handler
- Execution Framework
- initial support for Python built-in primitive operators and types
- track symbols throughout execution, returning history of values
- Infrastructure and bug fixes
- sync'd metadata and GroMEt schema versions
- added 'Debug' metadata entry to GroMEt for error logging
- increased unit test coverage
- tree-sitter comment extractor, parser build-tool, code2fn endpoints, CAST generation, GroMEt generation, execution engine
- Extracting Climate Models
-
Text Reading
- Improved grounding, transitioning from static word embeddings to contextualized word embeddings fine-tuned to domain annotations.
- Integration with DBK grounding annotations for epidemiology and climate.
- Evaluated performance improvement, before and after fine-tuning
- DistilBERT: 60.08 (5.85) --> 74.29 (3.07) MMR
- SPECTER: 59.64 (5.55) --> 73.71 (3.11) MMR
- Extracting relevant NLP annotations including temporal context
- Integrated with from-scratch re-implementation of Processors
- Improved grounding, transitioning from static word embeddings to contextualized word embeddings fine-tuned to domain annotations.
-
METAL
- Version 2 transformer model for contextualized embedding for linking adapted for climate domain
- Collected code repositories for training and testing
- Generated automated comments using GPT4 (whose quality is considerably higher than GPT3.5)
- Implemented first end-to-end evaluation in two settings:
- searching only within the file that contains the codee snippet
- searching across large index over the entire corpus
- Conducted ablation study
-
Eqn Reading
- Improved support for equation image, using data from University of Wisconsin
- Cleaned UWisc corpus and annotated
- Improved handling of plain-text within equations
- pMML2AMR pipeline
- Improved parser
- added support for subscripts, unicode for Newtonian derivative syntax
- improved support for AMR (e.g., infix expressions)
- added support for representing and serializing Decapodes
- handle Halfar equation
- Improved support for equation image, using data from University of Wisconsin
-
ISA
- Incorporate extractions from Text Reading module to seed alignments
- Code refactoring
-
MORAE
- Improvements to Code2AMR pipeline
- bug fixes
- developed test suite of synthetic data generated by GPT3.5
- test suite: synthetic test suite, SIDARTHE, CHIME_SIR, SEIRD Hackathon S1, Simple_SIR
- Improvements to Code2AMR pipeline
-
MOVIZ
- Added interface to display JSON
- Highlighting between JSON and FN views
- Improvements to FN layout algorithm, scaling to handle boxes with larger numbers of content elements
- Deployed MOVIZ client, allows local upload of FN JSON files.
ASKEM_SKEMA_Milestone_9
ASKEM SKEMA Milestone 9 release.
-
Code2FN
- Python idiom support
- nested functions (function closures)
- recursively called functions
- TS2CAST Fortran front-end developments
- preprocessor (id unsupported idioms, identify missing include files, fixing unsupported
&
line continuation character) - compiler directives using GCC pre-processor
- derived types (classes/structs) as FN Records
- representing program, module and "outside" code in FN module namespaces
- handling Fortran
contains
- preprocessor (id unsupported idioms, identify missing include files, fixing unsupported
- Initial support for tree-sitter-based MATLAB front-end
- Generalized JSON2GroMEt
- Additional GroMEt ingestion front-end
- source code comment to FN alignment
- bug fixes
- Python idiom support
-
TextReading
- unified TA-1 metadata extractions library
- unified TA-1 text reading REST API
- updates to TR and Scenario Context extraction with initial support for climate and earth science domain
- added AMR linking utility to text extractions with scenario contexts; includes support for AMR Petri Net and RegNet
- bug fixes
-
METAL
- METAL module with version 1 transformer model for contextualized embedding for linking adapted for the epidemiology domain
- development of synthetic epidemiology dataset
- METAL v1 with CodeBert backbone
- METAL v1 with GraphCodeBert backbone
-
Eqn Reading
- new conversion service and REST API
- improvements to pipeline for generating data for training equation extraction model
- evaluation dataset cleanup
- service structure reorganization
- image2MathML model improvements
- service response time improvements
- MathML inspection and annotation GUIs
- new support for interpretation of presentation MathML to generate content MathML
- improvements to DECAPODES interpretation of dynamics equations
-
ISA
- improved seed selection for seeded graph matching (SGM) algorithm
- variable name similarity measures
- expanded SGM method in graph matching
-
MORAE
- improved support for model identification and extraction out of FN
- Eqn2PetriNet produces AMR PetriNet
- Eqn2RegNet produces AMR RegNet
- work on ABM representation
-
MOVIZ
- updated MOVIZ to support dynamic interaction via point-and-click interactions for expanding and collapsing GroMEt boxes
- new layout mimics hand-drawn OmniGraffle representation of GroMEt FN
- demonstration client supports uploading of arbitrary GroMEt JSON files
- created live demo: https://ml4ai.github.io/moviz-client/#/
ASKEM_SKEMA_Milestone_8
ASKEM SKEMA Milestone 8 release. This includes:
- Code2FN
- TS2CAST Fortran front-end (tree-sitter based, version 1)
- Supports ingest of TIE-GCM cpktkm.F and cons.F, producing GrometFNModuleCollection
- handling continuation lines: '|' and '&'
- variable declaration and literal value creation
- single and multiple dimension array declaration, get, set and slice
- subroutine and function definition and calls
- primitive operators
- do loop (Fortran idiom similar to Python for-loop)
- if, else, else-if support
- Updates to FN Loop and Conditional representation
- removed explicit Loop/Conditional box wiring
- fixed handling of for-loop iterator loop condition test
- Handling compound conditions
- Improved support for comprehensions
- Support for functions as first-class objects
- bookkeeping of symbol table and variable environment: functions, records (classes) and variables
- CAST updates
- generalization of LiteralValue, removing specific types
- generalization of operator
- porting of cast_to_agraph.py
- Progress on GroMEt FN Execution Engine
- implemented algorithm to walk FN graph in execution order
- developed v1 execution framework primitive operator set
- API and infrastructure improvements
- front-end determines language type based on file extension
- FN diff utility
- General refactoring and name cleanup
- TS2CAST Fortran front-end (tree-sitter based, version 1)
- TextReading
- Version 1 of automated code comment linking
- Added additional grounding mechanisms to TR pipeline
- gazatteer-based grounding and composable grounding pipeline
- delegate grounding to MIRA's web API
- Added support for additional input formats, in addition to COSMOS
- plain text
- grounding through web API
- Created docker file to build a docker image compliant with xDD
- Created (with MIT) library to read and write extractions in canonical JSON format
- Expanded support for initial mention linker to support multi-module GroMEt FN
- Exposed embedding grounding mechanism on TR web service
- METAL
- Added space weather ontology support
- METAL module development
- Data collection
- generate artificial annotated data using gpt-3.5-turbo
- extracted 514 repositories from GitHub, keeping only 115 with more than 2 stars
- Functions and class definition extracted, creating 5,887 code fragments
- Model architecture
- two independent transformer encoder models initialized with CodeBERT
- Evaluation Plan
- token-level and span-level F1-score
- Data collection
- Eqn Reading
- Improvements to Image2MathML pipeline
- improved train/test data generation from arXiv 2014-2018 corpus
- reprocessed eqn dataset
- Image2MathML model retrained, improving BLEU score to 0.95
- Translating Space Weather Equations to DECAPODES Wiring Diagrams
- Improvements to Image2MathML pipeline
- ISA
- Improvements to equation conversion, including translation to canonical form
- Improvements to alignment visualization
- REST API development
- MORAE
- Improvements to FN-to-PetriNet translation
- Prototype support for edge extractions
- MOVIZ
- added optional JSON configuration specification interface to support drawing partially expanded FNs
ASKEM_SKEMA_Milestone_7
ASKEM SKEMA Milestone 7 release. This includes:
- Code2FN
- Added implicit conditional support for If statements.
- Fixed issue with Binary operations that use the same variable as both operands.
- Various Gromet formatting fixes such as removing extra new lines and consistent pathing format between different systems
- Developed API endpoint and example clients for Code2FN pipeline
- Developed script (single_file_ingester.py) to simplify Gromet generation for single files and code snippets
- Defined set of primitive operators and created framework for primitive execution
- Cleanup and rearranging of visitors in Code2FN Python to CAST and CAST to GroMEt steps.
- TextReading
- version 1 of the scenario context engine into the text reading pipeline with support for location and temporal contexts, with the following highlights:
- Detection of specific dates, times, date ranges, and time ranges
- Detection of locations with different granularity levels: Abstract locations, countries, states/provinces, cities, organizations
- An efficient algorithm to associate parameter extractions with the candidate scenario context detections based on the proximity of occurrence
- version 1 of the scenario context engine into the text reading pipeline with support for location and temporal contexts, with the following highlights:
- METAL
- Improved the grounding mechanism to consider first relevant concepts of the DKG relevant to parameter extractions.
- Extended support of metadata alignment to extractions of collections of documents of arbitrary length.
- Updated the code to support metadata alignment on gromets with multiple modules
- Created a version based on contextualized embeddings (using SciBert, but any other transformer model works as a drop in replacement) to do the metadata alignment.
- ISA
- The conversion from MathML to graph representation has been preliminarily completed, including dealing with basic operations, parentheses, and arithmetic priority
- The conversion from graph representation to adjacency matrix is completed.
- The preliminary development of the structural alignment between equation graphs returns the matching ratio and the closest matched term/variable pairs between two equations.
- Based on alignment results, a method is proposed to facilitate the identification of similarities and differences between models presented in two papers, or the investigation of code implementation issues
- MORAE
- Target format of extraction changed from BiLayer to PetriNet in py-acset form
- Developed code GroMEt to graph database representation to allow for structural and dataflow related queries to isolate and extract code roles.
- Thin-thread pipeline to TA-2 developed and tested at the Hackathon/Evaluation
- Basic integration into Terarium via REST API
- MOVIZ
- Added support for GroMEt class features that were added in last release
- Alterations to layout and visual design to better match hand-layout examples
- Added labels over original hand-layout examples to better support debugging
- Added interface for direct file GroMEt upload
- MOVIZ demonstrated to run locally on non-visualization SKEMA team machines to aid debugging and exploration of GroMEt function networks.
- MOVIZ demonstration on epidemiology model kernels.
ASKEM_SKEMA_Milestone_6
ASKEM SKEMA Milestone 6 release. This includes:
- Adding support for implicit conditional checks in while loop conditions.
- Many Python data types may be interpreted as having a Boolean value. For example, if x is a list, then in the code
while x:
, the condition will evaluate to True if the list has elements, otherwise False. The Python2CAST translation now treats condition expression trees that do not have a Boolean operator at the head as implicitly wrapped in a Pythonbool()
function call.
- Many Python data types may be interpreted as having a Boolean value. For example, if x is a list, then in the code
- Python permits lazy variable declaration. This means a new variable identifier may be declared within a conditional branch (must be declared in any branch of a conditional). These are now handled.
- This particularly caused issues when the introduced variables were used in keyword arguments in function calls within conditionals.
- Adding support for keyword only (kwonly) arguments
- These are arguments in a function definition that come after a
*
in the arguments list
- These are arguments in a function definition that come after a
- Improved support of the user module imports.
- Added functionality to better read user modules. This also fixes an issue where user modules couldn't be read in certain instances.
ASKEM_SKEMA_Milestone_5
ASKEM SKEMA Milestone 5 release. This includes:
- migration of Program Analysis pipeline from AutoMATES to the SKEMA repository
- Record (class/struct) inheritance and calls to super
- identifying Record attribute (field) introduction outside of constructor (init)
- support for general slicing
- primitive support for raise
- initial support for Ellipsis, as used in numpy slicing
- Support to ingest bucky_v2 code base
ASKEM_SKEMA_Dec_2022_Demo
Release of SKEMA for the ASKEM December 2022 Demo.