-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import ZTF DR22 on Epyc #452
Comments
Per the README, checked the checksums of the files. This operation has not yet completed, probably because I set its nohup md5sum -c checksum.md5 > ~/md5-output.log &
# Process ID is 127216
renice 20 127216 At the time of this writing, it is still running, and one of the parquet files has failed its checksum so far: wc -l checksum.md5 ~/md5-output.log
176561 checksum.md5
105392 /astro/users/dtj1s/md5-output.log
281953 total
grep -v ' OK$' ~/md5-output.log
./0/field000385/ztf_000385_zr_c11_q2_dr22.parquet: FAILED |
The import also failed with this message:
The script I used: from dask.distributed import Client
from hats_import.pipeline import pipeline_with_client
from hats_import.catalog.arguments import ImportArguments
def main(input_path: str, output_path: str = "./output", output_artifact="test_cat"):
args = ImportArguments(
sort_columns="objectid",
ra_column="objra",
dec_column="objdec",
input_path=input_path,
output_artifact_name=output_artifact,
output_path=output_path,
file_reader="parquet",
)
with Client(n_workers=10, memory_limit="10GB", threads_per_worker=2) as client:
pipeline_with_client(args, client)
if __name__ == "__main__":
main(
"/data3/epyc/data3/hats/raw/ztf/lc_dr22",
"/data3/epyc/data3/hats_import/catalogs/ztf_dr22",
"ztf_lc",
) The most common errors printed to the console were of this form:
I will try to resume the import in a way that specifically only includes |
Verification of checksums concluded, and only one file was found that failed. Restored this file by rereading it from the source and verifying the checksum: curl https://irsa.ipac.caltech.edu/data/ZTF/lc/lc_dr22/0/field000385/ztf_000385_zr_c11_q2_dr22.parquet -o fixme.parquet
md5sum fixme.parquet
12b316eac3768bfac393c503ff94f55e # correct
cp fixme.parquet ./0/field000385/ztf_000385_zr_c11_q2_dr22.parquet |
Modifying the script to use explicit glob patterns to identify the Parquet files only. Note also the change in output path, now possible due to corrected permissions. import glob
from dask.distributed import Client
from hats_import.pipeline import pipeline_with_client
from hats_import.catalog.arguments import ImportArguments
def main(input_path: str, output_path: str = "./output", output_artifact="test_cat"):
# Cannot use input_path directly because the input directory contains many, many
# non-parquet files, which clog up the pipeline. Rather, use the glob module
# to create an authoritative and complete list of parquet files.
print(f"Reading {input_path} for *.parquet files")
parquet_list = glob.glob(f"{input_path}/**/*.parquet", recursive=True)
print(f"Parquet files found: {len(parquet_list)}")
args = ImportArguments(
sort_columns="objectid",
ra_column="objra",
dec_column="objdec",
# input_path=input_path,
input_file_list=parquet_list,
output_artifact_name=output_artifact,
output_path=output_path,
file_reader="parquet",
)
with Client(n_workers=10, memory_limit="10GB", threads_per_worker=2) as client:
pipeline_with_client(args, client)
if __name__ == "__main__":
main(
"/data3/epyc/data3/hats/raw/ztf/lc_dr22",
"/data3/epyc/data3/hats/catalogs/ztf_dr22",
"ztf_lc",
) |
With a hint from @delucchi-cmu, created batched indexes of the Parquet files, to reduce the amount of file I/O, and to compensate for the fact that the files' size distribution is logarithmic: def prepare_batches(input_path: str, n_batches: int, prefix: str) -> list[str]:
"""
Given a directory to search for `.parquet` files, find them all,
get their sizes, and sort them into n_batches files, each of which
will contain roughly an equal number of bytes. These batches
will be written to a family of files that all start with `prefix`
and are numbered.
"""
print(f"Reading {input_path} for *.parquet files")
parquet_list = glob.glob(f"{input_path}/**/*.parquet", recursive=True)
print(f"Parquet files found: {len(parquet_list)}")
file_sizes = [Path(p).stat().st_size for p in parquet_list]
print(f"Total size of the files: {sum(file_sizes)}")
print(f"Sorting into {n_batches} batches")
batches = binpacking.to_constant_bin_number(
list(zip(file_sizes, parquet_list)), n_batches, weight_pos=0
)
# Now write out the batch files
batch_files = []
for i, b in enumerate(batches):
batch_file = Path(f"{prefix}_{i:03d}.batch")
batch_file.write_text("\n".join([fname for sz, fname in b]) + "\n")
batch_files.append(batch_file)
return batch_files
def main(input_path: str, n_batches: int, output_path: str = "./output", output_artifact="test_cat"):
batch_files = prepare_batches(input_path, n_batches, "./batch")
args = ImportArguments(
sort_columns="objectid",
ra_column="objra",
dec_column="objdec",
# input_path=input_path,
input_file_list=batch_files,
output_artifact_name=output_artifact,
output_path=output_path,
file_reader="indexed_parquet",
)
# Attempt to scale splitting phase.
with Client(n_workers=8, memory_limit="32GB", threads_per_worker=2) as client:
pipeline_with_client(args, client) This caused the planning, mapping, and binning phases to complete much more quickly, each phase on the order of 20 to 30 minutes. The splitting phase is still taking a long time (~8h projected) but appears to be on track to complete without errors or warnings. |
Uh oh.
Will try restarting in case it was a question of resources. |
No such luck. Error is stable, and as such, this particular attempt appears to be at an end.
I have another scheme for batching the files that keeps the fields together. I had originally expected that it was only an optimization, but perhaps there is something about it that lets it get past this obstacle. This is the code for creating it: def write_batches(parquet_file_sizes: list[tuple[int, str]]):
total_size = sum([x[0] for x in parquet_file_sizes])
bin_size = total_size // n_bins
pat = re.compile(r'^.*/field(\d+)/.*$')
field = None
batch_count = 0
bytes_in_bin = 0
f_out = None
def cycle_batch(done = False):
nonlocal bytes_in_bin, batch_count, f_out
if f_out is not None:
f_out.close()
print("Closing batch", batch_count, "containing", bytes_in_bin, "bytes")
bytes_in_bin = 0
batch_count += 1
if not done:
f_out = open(f"{prefix}_{batch_count:03d}.batch", "w")
cycle_batch()
while parquet_file_sizes:
file_sz, file_name = parquet_file_sizes.pop(0)
next_field = pat.match(file_name).group(1)
if bytes_in_bin > bin_size and next_field != field:
cycle_batch()
f_out.write(file_name + '\n')
field = next_field
bytes_in_bin += file_sz
cycle_batch(done=True) |
Used the above to create index files with a limit of 1.5GiB apiece. This ran smoothly for planning and mapping, but splitting, even though it seems to be continuing, has run into this error. That batch has 128 parquet files in it, most of which are about 20MiB, and which only contains files from field 478; the files are
|
Splitting completed after 6h but then the import stopped and failed, re-reporting the error above. Reran this, but it errored out exactly the same way on exactly the same batch, Reading the schema out of all of these files, 126 of them look like this:
and two of them look like this:
|
The only two files in that batch which do not have the
|
There are two other batches with field 478, and all of the files in those batches use |
Other batches that contain non-list schema:
I don't know why these others didn't cause the same problem. Luck of the draw? |
Found 129 empty parquet files. import pyarrow.parquet as pq
from pathlib import Path
field_dir = Path("/data3/epyc/data3/hats/raw/ztf/lc_dr22/")
files = field_dir.glob("*/*/*.parquet")
num_good = 0
for file in files:
parquet_file = pq.ParquetFile(file)
num_rows = parquet_file.metadata.num_rows
if num_rows == 0:
print(file)
else:
num_good += 1
print("num_good", num_good) |
Died right at the end with this:
This looked to me like an error near the end, so I simply re-ran it, expecting it to resume what wasn't done. Unfortunately it seems to have started all over again, and I don't know how to characterize what state the output file may be in. |
If "Finishing" gets to 100%, then the job is finished. This error just means that dask did a bad job of saying goodbye. |
Job finished, getting to 100%. Tested the output catalog: from hats.io.validation import is_valid_catalog
output_path = "/data3/epyc/data3/hats/catalogs/ztf_dr22/"
ztf_dr22_path = output_path + "ztf_lc"
print(is_valid_catalog(ztf_dr22_path)
True Catalog is valid. |
Notebook containing final code here https://github.com/lincc-frameworks/notebooks_lf/blob/main/sprints/2024/12_13/ztf_dr22_import_prebatched.ipynb |
Raw data is here:
/data3/epyc/data3/hats/raw/ztf/lc_dr22
The result catalog should go here:
/data3/epyc/data3/hats/catalogs/ztf_dr22/ztf_lc
I haven't checked checksums of the files, failures are possible.
The text was updated successfully, but these errors were encountered: