-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pip fails to build wheels for cffi (gcc not found) in ubuntu.1804.armarch.open using image alpine-3.15-helix-arm32v7 #1423
Comments
on one hand, I'm not sure why we set on the other hand, the recent runtime changes were back to a previously-used Alpine Docker image. seems like something changed w/ the various Python packages while they were using a different (newer) container. bottom line, we don't have much experience w/ the floating "fix" choices may include
/cc @epananth and @davfost due to Docker implications /fyi @lbussell as well |
Affects customers, so I believe the proecss is that SSA should triage and establish impact |
set aside my general comments about why we need helix-scripts/ or to meet its Python requirements in Docker containers. removing that everywhere could break some customers. we could add an issue to the Docker epic around finding whether our customers need Python in containers and (maybe) removing it more broadly. a total removal wasn't intended to be my (1) choice. rethinking the options above, we seem to have four choices. three involve updates to https://github.com/dotnet/dotnet-buildtools-prereqs-docker/blob/main/src/alpine/3.15/helix/arm32v7/Dockerfile and the runtime team picking up a new Docker tag; the last changes https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines?path=/resources/helix-scripts/runtime_python_requirements.txt and requires a helix-machines rollout (currently planned for the 29th). updating the Dockerfile likely involves reverting @hoyosjs' recent change which may have broken the runtime team (so that the next tag doesn't cause similar issues).
(1) sounds fastest but is also somewhat "dirty" in that it makes a |
@carlossanlop can you tell us what the overall impact of this error is for your PR? |
I am unsure of the impact to my team, I was hoping dnceng could help initially determine if this was something to worry about or not. But maybe there's someone from @dotnet/runtime-infrastructure that might know if this could affect runtime. I only happened to notice it in an expected test failure in an unrelated PR when reading the console log, but it probably happens in all runs using the same queue, it just does not get reported as an error at all. |
We unfortunately cannot update the tag at the moment, as we are blocked by an external issue (microsoft/msquic#3958) which heavily impacts our CI (dotnet/runtime#91757) -- this was the reason we pinned the tag in the first place (dotnet/runtime#94609) I wonder if this is a problem only 3.15 has? Because 3.15 is EOL now. We were looking into updating to 3.16 (but it is now blocked by the issue above as well) |
Building and installing dependencies when the container start is going to be expensive and unreliable IMHO. I was under impression that the whole prereq repo exists to have everything ready to run the tests. If the images are meant for testing, we should perhaps add step to verify that Helix is able to run. And I think we can temporarily disable Quic tests on Alpine arm32 @CarnaViire. Not great but we should not block everything else IMHO. (or we can investigate and fix msquic ours selves or move back to released version instead of tracking current main) |
@wfurt yeah, I was thinking about fixing the Arm32 alpine image on a specific dotnet/msquic commit instead of main, to be able to "unlock" the image for others -- let's continue this discussion in dotnet/runtime#91757 |
what's happening here is a quirk of how Helix runs commands in a Docker container. we execute as I hinted above, I'm not quite sure of the overall reasoning except that we run
please let us know whether we need to help anything once you have a Docker tag containing the "right" |
@carlossanlop, @CarnaViire, @wfurt I see runtime builds continue to hit this issue but I'm not sure why it remains in this repo. could we please move it to dotnet/runtime since we aren't helping❔ |
@dougbu we didn't update runtime yet to consume the rolling docker image (that's why it still shows up), but I will do that shortly. But I didn't do any changes in the docker image beside fixing the msquic version. Would just a rolling image be enough, or are any additional changes needed? |
I don't know what'll happen next. it's possible the if not, solution would be additional Dockerfile changes I described above in a fair amount of detail. |
But there will not (should not :D) be a problem with msquic anymore... it is mitigated within the dockerfile in dotnet/dotnet-buildtools-prereqs-docker#933. |
this is more complicated and beyond Alpine IMHO. @lbussell disabled builds for Debian 11 on ARM32 as it was failing in similar way. I spent last few days fiddling with ARM32 builds and here is what I have found. There are no binary wheels for ARM32. Forcing new So building the dependencies from sources is only or option IMHO. That can happen either during image creation (like the Debian 11 in pre-req repo or when Helix queue us set up) or during test execution (like this particular case) I would strongly suggest to focus on the first one as doing it on every run is going to be expensive and unreliable IMHO. I know current behavior was put in with the idea that the Helix can fix itself and make it working but the practical outcome does not seems to be the case. It would also require us to put build tools to test images - making them bigger and we would pay the cost over and over again. Now, the problem with Debian is that the new Me recommendation would be to relax the package version requirement and perhaps disable the runtime attempts to install what it wants. I feel it would be better to fail clearly and prompt action instead of consuming resources and drag unnecessary baggage. And all this is difficult to test e.g. we see vases like this when something start suddenly failing without obvious reason. I know there was argument about security for the update. But as far as I can tell the cryptography 3.2.1 from Debian does not have any known flow and #929 was working fine AFAIK. While we may use newer packages on mainstream images having viable option on Alpine & Debian is also important. The security patches some from OS vendor - just like just anything else in the OS image. So I think there is action for Helix/infra team @dougbu. The MsQuic regression is out of picture now as @CarnaViire mentioned. But we still need to figure out package and dependencies for ARM32 - unless we can convince PMs and leadership to drop support for it. Long-term we should probably also improve test coverage for Helix. .NET supported OS matrix should be guiding document and we need some solution for odd-balls - like Centos 6 and Ubuntu 16 was until recently. I'm happy to share more details and/or share my dev environment - Machine from Helix dev pool and my personal Raspberry. |
@wfurt I'm on vacation but am wondering about options here. We can't generally lower the requirements on our side but we may be able to special-case the problematic Docker containers. What specific change are you requesting❔ |
Backing up a bit… Is the issue here about |
I suspect the solution will be to special case an upper bound for the @wfurt (or anyone w/ access to an ARM32 Alpine 3.15 or Debian 11 box) could you please run the following short Python script on your test machines and let me know the results (I'm hoping the OSes are consistent)❔ I suspect import platform
print (platform.machine()) I could likely create PRs somewhere to test but it should be faster if someone already has a box or VM available. |
I get Support for |
I think it'll be pretty easy to special case
that's not good b/c however, |
I'm not sure what the comment means @dougbu. This is for 32bit docker image
base machine shows 64 as expected
|
@wfurt I'm looking for a way to specify a |
After checking old available wheels, I suspect this is the crux of the problem. There are no ARM32 wheels listed there, likely meaning the Docker container has always been building With that in mind, we likely need to downgrade There are alternatives but none of them are quick:
Note that resolving the open questions about identifying the problematic systems in a way we can use in our requirements file and choosing the |
/cc @ilyas1974 and @markwilkie ^^ for awareness of the state here 😦 |
I check the requirement semantics @dougbu and it is going to be difficult IMHO. The versions report kernel version not the actual user mode setup. In case of containers the kernel version comes from the base OS (like Ubuntu 18/20) and I did not figure out how to get any reasonable info about the actual container. As far as the cryptography: It seems like prior to We ditched Alpine 3.15, 3.16 provides
And Debian 11
|
Unfortunate none of the Python "environment variable" options provide real information about the container. I see https://github.com/dotnet/dotnet-buildtools-prereqs-docker/blob/7d0f9b0c308f54dc80bf275b69dc41d1b1856109/src/debian/11/helix/arm32v7/Dockerfile#L4C5-L4C26 sets
Right. We might however be able to muddle through w/ Rust 1.48 in the relevant containers (while installing a compatible Must admit however that we're talking about layering hacks on hacks. I prefer the alternatives I mentioned, even if they take longer to implement or get buy-in for. |
didn't forget this issue during the holiday break. current status is I'm waiting for DDFun to set up a Raspberry Pi system for me to test things out on. that at least will allow me to experiment w/ an ARM32 system and get things working again, hopefully with near-current on the other hand, the "real" fixes here are for just Alpine 3.17 and Debian 11 on ARM32, correct❔ want to make sure I'm not missing another platform combo you're having problems with what looks like this issue. |
it is rolling battle - but I think Alpine & Debian 11 are the tail at this moment @dougbu |
/fyi this remains very much on our radar, especially because we provide Raspbian queues likely affected in a similar fashion. any summary of how (or if) people get around these issues will help w/ VMs as well as Docker containers. our resource constraints and urgent requirements elsewhere mean I'm not busy investigating at the moment |
I just found out about this issue which is surprisingly similar to the one I'm hitting in a completely different scenario: I'm maintaining python packages for the SynoCommunity using Synology Linux DSM for multiple archs (armv5-7-8, i686, ppc, x86_64). Somehow, can't tell when exactly,
Totally not elegant but looks like working?! Further, this problem seems specific to python 3.10 as building ok with 3.11. Just saying that I'll keep on eye on this thread, and let you know if I find a reasonable fix or workaround. Cheers! |
Build
https://dev.azure.com/dnceng-public/public/_build/results?buildId=469944
Build leg reported
https://dev.azure.com/dnceng-public/public/_build/results?buildId=469944&view=logs&j=fd49b2ea-3ee3-5298-d793-bdb9fd631e7e&t=26d44400-a1b8-5ed1-d719-0cc35b38251b
Pull Request
dotnet/runtime#93906
Known issue core information
Fill out the known issue JSON section by following the step by step documentation on how to create a known issue
@dotnet/dnceng
Release Note Category
Release Note Description
Additional information about the issue reported
Happens in (ubuntu.1804.armarch.open) using docker image mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.15-helix-arm32v7-20230807201723-7ea784e
The repo had some recent PRs modifying the alpine images:
My PR is not making any changes that could be causing this.
This error is only visible when there's an actual test error happening in these machines, otherwise, the CI leg passes.
Log file with the error: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-93906-merge-9e1b570d2b3441d698/System.IO.Compression.Brotli.Tests/1/console.b000a501.log?helixlogtype=result
Known issue validation
Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=469944
Error message validated:
Building wheel for cffi (pyproject.toml) did not run successfully.
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 11/15/2023 12:17:17 AM UTC
Report
Summary
The text was updated successfully, but these errors were encountered: