Zorse #642

slh1109 · 2024-04-17T18:38:04Z

Project description

This project aims to collect a dataset of production COBOL and associated mainframe languages (JCL, REXX, PL/I) which Large Language Models (LLMs) can be fine-tuned on. It also aims to develop an evaluation suite to measure LLMs' ability to comprehend, explain, and write COBOL. This project will:

Improve the utility of LLMs in the mainframe domain, helping engineers maintain mainframe applications
Create an industry-standard benchmark to track LLM performance on COBOL over time

Dataset

The dataset should be composed of high quality COBOL code that is permissively licensed. The code should be representative of production COBOL applications, and should be cleaned of any personally identifiable information (PII).

Evaluation Suite

The evaluation suite should comprise a series of tasks that quantitatively measure an arbitrary LLM's ability to read and write COBOL. BloopAI's COBOLEval benchmark can be used as a foundation for the suite. It is a translation of the widely-used OpenAI developed HumanEval LLM benchmark to COBOL.

Statement on alignment with Open Mainframe Project Mission and Vision statements

Enable the mainframe to be more consumable by developers with a transparent experience in leveraging the value propositions of the mainframe.

Are there similar/related projects out there?

None that we are aware of for mainframe languages. Software Heritage archives decommissioned software systems of all languages.

External dependencies (including licenses)

https://github.com/BloopAI/COBOLEval (MIT)

Sponsor from TAC

Joe Bostian

Proposed Project Stage

Sandbox

License and contribution guidelines

unknown

Current or desired source control repository

Github

External dependencies (including licenses)

none, tbd

Initial committers

tbd

Infrastructure requests

tbd

Communication channels

email, Google Docs, Zoom meetings

Communication channels

Google docs

Website

none/tbd

Release methodology and mechanics

tbd

Social media accounts

none/tbd

Community size and any existing sponsorship

initial team of around half dozen: John Mertic [email protected]; "Ed Airey" [email protected]; Elpida Tzortzatos [email protected]; Jim Porell [email protected]; Joseph Bostian [email protected]; Leonard Santalucia [email protected]; Louis Knight-Webb [email protected]; Per Kroll [email protected]; Venkatauday Balabhadrapatruni [email protected]; Goran Begic [email protected]; Gabriel Gordon-Hall [email protected], Stephen Hodges [email protected]

venkatzhub · 2024-04-18T02:12:13Z

My take:

I do not think we would build or want to build a LLM from scratch for COBOL. There are a LOT of open source models out there that can spell COBOL to some extent. Leveraging that as a starting point / foundation for our needs is the way to go. bloop.ai has an open source model that has been trained on Gnu COBOL.
The key primary goal for this community IMHO is not to build a model - but gather / create good quality IBM Enterprise COBOL code that can be used to fine-tune an open source model that we pick. This code should have clear ownership / provenance so that we can prove that the model has been trained with code that has right IP and licenses. Models will evolve, the data and the mechanics of how to fine tune/train a model will remain. Hence, the question that needs to be answered is - what is the path to get the quality Enterprise COBOL data that would be a good enough to fine tune an existing LLM.

markbsigler · 2024-05-14T19:51:21Z

Agree with @venkatzhub

IBM has Granite LLMs under Apache2 license but the training data is fairly limited with only 727 COBOL programs as compared to 4M+ C++ programs. Further, there is currently no coverage of PL/I, HLASM, REXX, JCL et al.

markbsigler · 2024-05-14T20:00:51Z

@venkatzhub I'm reading your email but will reply here to maintain the trail.

IBM references their project CodeNet with a detailed spreadsheet on each language and the quantity of accepted submissions, and further that the code sourced from two Japanese coding challenge websites. It's overwhelmingly C++ and Python.

venkatzhub · 2024-05-14T20:24:30Z

@venkatzhub I'm reading your email but will reply here to maintain the trail.

IBM references their project CodeNet with a detailed spreadsheet on each language and the quantity of accepted submissions, and further that the code sourced from two Japanese coding challenge websites. It's overwhelmingly C++ and Python.

Thanks @markbsigler !

jmertic · 2024-07-25T17:00:27Z

Project approved on 2024-07-11

jmertic added the 1-new-project-wg New Project or Working Group application label Apr 30, 2024

jmertic changed the title ~~AI for mainframe: COBOL ML model creation~~ Zorse Jun 26, 2024

jmertic added 2-annual-review Annual Review for a Project or Working Group and removed 1-new-project-wg New Project or Working Group application labels Jul 25, 2024

jmertic assigned slh1109 Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zorse #642

Zorse #642

slh1109 commented Apr 17, 2024 •

edited

Loading

venkatzhub commented Apr 18, 2024

markbsigler commented May 14, 2024

markbsigler commented May 14, 2024

venkatzhub commented May 14, 2024

jmertic commented Jul 25, 2024

Zorse #642

Zorse #642

Comments

slh1109 commented Apr 17, 2024 • edited Loading

Project description

Sponsor from TAC

Proposed Project Stage

License and contribution guidelines

Current or desired source control repository

External dependencies (including licenses)

Initial committers

Infrastructure requests

Communication channels

Communication channels

Website

Release methodology and mechanics

Social media accounts

Community size and any existing sponsorship

venkatzhub commented Apr 18, 2024

markbsigler commented May 14, 2024

markbsigler commented May 14, 2024

venkatzhub commented May 14, 2024

jmertic commented Jul 25, 2024

slh1109 commented Apr 17, 2024 •

edited

Loading