Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Dataset Processing Scripts #1

Open
TanmDL opened this issue Dec 5, 2024 · 6 comments
Open

Request for Dataset Processing Scripts #1

TanmDL opened this issue Dec 5, 2024 · 6 comments

Comments

@TanmDL
Copy link

TanmDL commented Dec 5, 2024

Hello,

Your work is fantastic, and I truly appreciate the effort that went into it. However, I have a few questions about the processing of the two datasets. If you could share the scripts used for this, it would be very helpful for us. I assure you that I will properly cite your project.

Thank you!

@AmayaGS
Copy link
Owner

AmayaGS commented Dec 11, 2024

Hi @TanmDL, really sorry it took me a while to answer you! Thanks very much for your interest in this work :)

All the scripts I used for the preprocessing are on the GitHub repo. I just updated the README file with instructions for how to run the code, including the preprocessing steps - including the tissue segmentation, patching and feature embedding. Maybe this might answer your questions, but if not please let me know which specific step needs further explanation. Hope this helps!

@AmayaGS
Copy link
Owner

AmayaGS commented Dec 11, 2024

Any thoughts you might have to make the README clearer for users would be very valuable to me, so please let me know if there's anything unclear/which could be explained in more detail :)

@TanmDL
Copy link
Author

TanmDL commented Dec 11, 2024

Thank you for your response. Could you please clarify which classification task was studied here? Initially, I thought it would involve classes like CD68 and CD138, but I found the following labels: label_dict: {'0': 'Pauci-Immune', '1': 'Lymphoid/Myeloid'}, which appear to represent subtypes. This has confused me. Could you please explain this part in more detail? Also, I want to add that the pipeline design is excellent. Thank you.

@AmayaGS
Copy link
Owner

AmayaGS commented Dec 12, 2024

For the Rheumatoid Arthritis dataset I classified into inflammatory subtypes {'0': 'Pauci-Immune', '1': 'Lymphoid/Myeloid'} and for the Sjogren dataset into Absence and Presence of of Sjogren {'0': 'Not Sjogren', '1': 'Sjogren'}. However, depending on your data structure I think you could use the code to target your stains as labels. If you give me more detail on that, I could suggest how to do it. For example, assuming you have multiple stains per patient and want to classify the stains, you could add a column to patient_labels.csv file like so:

Patient_ID Patient_stains Patient_stains_numeric label
Patient1 Patient1_CD68 Patient1.1_CD68 1
Patient1 Patient1_CD138 Patient1.2_CD138 2
Patient1 Patient1_CD20 Patient1.3_CD20 3
Patient1 Patient1_CD21 Patient1.4_CD21 4
Patient2 Patient2_CD68 Patient2.1_CD68 1
Patient2 Patient2_CD20 Patient2.3_CD20 3
Patient2 Patient2_CD21 Patient2.4_CD21 4
# Label/split configurations
labels:
  label: 'label' # column name for target label
  label_dict:  {'CD68': 1, 'CD138': 2, 'CD20': 3, 'CD21': 4} # Stain type numeric coding dictionary
  n_classes: 4 # number of target classes
  patient_id: 'Patient_stains_numeric' # column name for each unique file 

# Parsing configurations 
parsing:
  patient_ID: 'img.split("_")[0]' # "Patient1.1_stain" -> Patient1.1
  stain: 'img.split("_")[1]' # "Patient1.1_stain" -> stain
  stain_types: {'NA': 0, 'CD68': 1, 'CD138': 2, 'CD20': 3, 'CD21': 4} # Stain types 

Of course this off the top of my head - I haven't tested it and it would depend on your file structure, but it should work.

@TanmDL
Copy link
Author

TanmDL commented Dec 13, 2024

Thank you for your kind reply and giving me a fantastic idea. Can you please send me dataset links so that I can try to download and test them?

@AmayaGS
Copy link
Owner

AmayaGS commented Dec 17, 2024

Unfortunately, for patient privacy protection, I am not able to share these datasets publicly as they come from clinical trial and research datasets. I am currently exploring options to publish a multistain dataset and can let you know how that goes, however if it works out it wouldn't be until mid-next year. Very sorry about that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants