Could not load mining kernel

mashex · May 22, 2025, 3:16pm

Something goes horribly wrong for me in the mining kernel. Running this on arm64 with 32gb ram and the same amount of swap.

2025-05-22T12:36:38.306762Z  INFO poke{src="timer"}:do_poke:slam:interpret: slogger: candidate block timestamp updated: 0x8000000d36cd27d6
2025-05-22T12:36:38.306762Z DEBUG next_effect: nockapp::nockapp::driver: Waiting for recv on next effect
2025-05-22T12:36:38.306814Z DEBUG next_effect: nockapp::nockapp::driver: Waiting for recv on next effect

thread 'serf' panicked at crates/nockvm/rust/nockvm/src/mem.rs:301:23:
Box<dyn Any>

thread 'tokio-runtime-worker' panicked at crates/nockchain/src/mining.rs:175:14:
Could not load mining kernel: OneshotChannelError(RecvError(()))
2025-05-22T12:36:38.309510Z  WARN nockchain::mining: Error during mining attempt: JoinError::Panic(Id(2932), "Could not load mining kernel: OneshotChannelError(RecvError(()))", ...)

It’s related to this code in mem.rs:

    /**  Initialization:
     * The initial frame is a west frame. When the stack is initialized, a number of slots is given.
     * We add three extra slots to store the “previous” frame, stack, and allocation pointer. For the
     * initial frame, the previous allocation pointer is set to the beginning (low boundary) of the
     * arena, the previous frame pointer is set to NULL, and the previous stack pointer is set to NULL
     * size is in 64-bit (i.e. 8-byte) words.
     * top_slots is how many slots to allocate to the top stack frame.
     */
    pub fn new(size: usize, top_slots: usize) -> NockStack {
        let result = Self::new_(size, top_slots);
        match result {
            Ok((stack, _)) => stack,
            Err(e) => std::panic::panic_any(e),
        }
    }

    pub fn new_(size: usize, top_slots: usize) -> Result<(NockStack, usize), NewStackError> {
        if top_slots + RESERVED > size {
            return Err(NewStackError::StackTooSmall);
        }
        let free = size - (top_slots + RESERVED);
        #[cfg(feature = "mmap")]
        let mut memory = Memory::allocate(AllocType::Mmap, size)?;
        #[cfg(feature = "malloc")]
        let mut memory = Memory::allocate(AllocType::Malloc, size)?;
        let start = memory.as_mut_ptr() as *mut u64;

        // Here, frame_offset < alloc_offset, so the initial frame is West
        let frame_offset = RESERVED + top_slots;
        let stack_offset = frame_offset;
        // FIXME: This was alloc_offset = size; why?
        let alloc_offset = size;

        unsafe {
            // Store previous frame/stack/alloc info in reserved slots
            let prev_frame_slot = frame_offset - (FRAME + 1);
            let prev_stack_slot = frame_offset - (STACK + 1);
            let prev_alloc_slot = frame_offset - (ALLOC + 1);

            *(start.add(prev_frame_slot)) = ptr::null::<u64>() as u64; // "frame pointer" from "previous" frame
            *(start.add(prev_stack_slot)) = ptr::null::<u64>() as u64; // "stack pointer" from "previous" frame
            *(start.add(prev_alloc_slot)) = start as u64; // "alloc pointer" from "previous" frame
        };

        assert_eq!(alloc_offset - stack_offset, free);
        Ok((
            NockStack {
                start: start as *const u64,
                size,
                frame_offset,
                stack_offset,
                alloc_offset,
                memory,
                pc: false,
            },
            free,
        ))
    }

I’d love to learn more about why this happens and how to avoid it. It’s not sporadic, this happens every time I try to start PoW.

logs/min2-1747915616.log:2025-05-22T12:40:27.525940Z  INFO poke{src="libp2p"}:do_poke:slam:interpret: slogger: [%mining-on 14.013.155.469.355.287.694 17.658.163.466.538.601.719 16.139.960.547.538.818.049 13.146.085.519.865.444.801 3.604.770.390.141.248.621]
logs/min2-1747915616.log:thread 'tokio-runtime-worker' panicked at crates/nockchain/src/mining.rs:175:14:
logs/min2-1747915616.log:Could not load mining kernel: OneshotChannelError(RecvError(()))
logs/min2-1747915616.log:2025-05-22T12:40:27.692917Z  WARN nockchain::mining: Error during mining attempt: JoinError::Panic(Id(3138), "Could not load mining kernel: OneshotChannelError(RecvError(()))", ...)

AramaicAllspice · May 22, 2025, 3:55pm

What operating system are you running this from?

mashex · May 22, 2025, 4:18pm

This is on Ubuntu 24.04. What are you thinking?

AramaicAllspice · May 22, 2025, 4:22pm

[All Irrelevant, See Next Post]

I can spin up a virtual machine and see if it happens for me.
For Reproducibility:

https[:]//cdimage[.]ubuntu[.]com/daily-live/20240421/noble-desktop-arm64[.]iso

The above was strange; trying a server iso install instead:

https[:]//cdimage[.]ubuntu[.]com/releases/24[.]04/release/ubuntu-24[.]04[.]2-live-server-arm64[.]iso

The thing about Linux ARM from what I’ve seen is that the repositories aren’t always the same, and software behaves differently especially rust*. Things that build on amd64 for me break on arm64 sometimes.

*: alpine linux with the non-GNU libs has similar issues

mashex · May 22, 2025, 4:34pm

Hey @grilledasparagus, based on your screenshot in the other thread: you also have this issue! Or at least had the issue this morning. Def let us know if it’s gone now and you’re generating actual proofs.

0x_V · May 22, 2025, 4:54pm

I am experiencing the same issue on Debian 12. My machine has 64GB of RAM.

AramaicAllspice · May 22, 2025, 4:59pm

amd or arm?

0x_V · May 22, 2025, 5:03pm

Mine is AMD.

AramaicAllspice · May 22, 2025, 6:16pm

So I spun up that virtual machine but before I even started building hoon I just grep’ed for mining in my logs on my original machine and also saw this:

2025-05-22T18:09:56.265749Z  INFO poke{src="libp2p"}:do_poke:slam:interpret: slogger: [%mining-on 2.355.513.181.070.318.655 809.918.659.070.895.438 1.802.357.504.238.026 8.368.239.197.549.738.390 9.645.348.589.553.451.187]
thread 'tokio-runtime-worker' panicked at crates/nockchain/src/mining.rs:175:14:
Could not load mining kernel: OneshotChannelError(RecvError(()))
2025-05-22T18:09:56.396984Z  WARN nockchain::mining: Error during mining attempt: JoinError::Panic(Id(558), "Could not load mining kernel: OneshotChannelError(RecvError(()))", ...)

around crates/nockchain/src/mining.rs:175; error around kernel = under the spawns a new std thread comment

pub async fn mining_attempt(candidate: NounSlab, handle: NockAppHandle) -> () {
    let snapshot_dir =
        tokio::task::spawn_blocking(|| tempdir().expect("Failed to create temporary directory"))
            .await 
            .expect("Failed to create temporary directory");
    let hot_state = zkvm_jetpack::hot::produce_prover_hot_state();
    let snapshot_path_buf = snapshot_dir.path().to_path_buf();
    let jam_paths = JamPaths::new(snapshot_dir.path());
    // Spawns a new std::thread for this mining attempt
    let kernel =
        Kernel::load_with_hot_state_huge(snapshot_path_buf, jam_paths, KERNEL, &hot_state, false)
            .await
            .expect("Could not load mining kernel");
    let effects_slab = kernel
        .poke(MiningWire::Candidate.to_wire(), candidate)
        .await
        .expect("Could not poke mining kernel with candidate");
    for effect in effects_slab.to_vec() {
        let Ok(effect_cell) = (unsafe { effect.root().as_cell() }) else {
            drop(effect);
            continue;
        };
        if effect_cell.head().eq_bytes("command") {
            handle
                .poke(MiningWire::Mined.to_wire(), effect)
                .await
                .expect("Could not poke nockchain with mined PoW");
        }
    }
}

mashex · May 22, 2025, 7:21pm

Annoying, isn’t it. If you look at the logs in my first post, we’re dealing with the same execution path. Kernel::load_with_hot_state_huge calls SerfThread::new, asking it to allocate 32gb of memory (or at least enough for a nock stack of 32gb). NockStack actually allocates the memory.

Or in our case, it doesn’t…

AramaicAllspice · May 22, 2025, 7:25pm

Telegram notified @logan already on this error

mashex · May 22, 2025, 8:16pm

Really hard to keep track of that chat while debugging this myself. Please update us here if you come across more info from affected users or the team. They might push a fix, but given all they have on their hands now, I’m sure they’d appreciate relevant info to be gathered here in this thread.

AramaicAllspice · May 22, 2025, 9:52pm

Great Stuff in Telegram on this

pkova · May 23, 2025, 12:05am

The issue here is that Linux disallows obvious overcommits for MAP_ANONYMOUS by default. The Nock Stack mmaps 128GB which is too much for most systems to commit. Anybody affected by this issue should do sudo sysctl -w vm.overcommit_memory=1 and try again.

Notably this overcommit limit does not affect MAP_SHARED mappings, which is why we can map 1TB to LMDB in vere without issue.

I would PR this to the repo readme but I can’t. Here’s the commit Logan.

mashex · May 23, 2025, 5:21am

This is very promising, thank you!

vm.overcommit_memory=1
Always overcommit. Appropriate for some scientific applications. Classic example is code using sparse arrays and just relying on the virtual memory consisting almost entirely of zero pages.

Clearly applies in this context.

AramaicAllspice · May 23, 2025, 11:53am

Works on my machine now

Many Thanks. Your Wisdom is appreciated

BillCarson · May 24, 2025, 5:52pm

This solved my problem - now I’m off mining. Thanks!

AramaicAllspice · May 24, 2025, 6:06pm

Love to see it

Topic		Replies	Views
Is this mining? How do I fix error after the attempt? General	1	98	May 23, 2025
The NockVM memory error and the failure to load the mining kernel are likely bugs that need to be addressed in future optimizations General	2	61	June 3, 2025
Who can help me? General	1	28	May 23, 2025
Let's share the difficulties we met when we compile and run nockchain General	20	732	May 23, 2025
Error running miner General	5	47	July 22, 2025

Could not load mining kernel

Related topics