Ceci est une ancienne révision du document !
Table des matières
eGpu
En anglais “GPU enclosures”. Nécessite un port Thunderbolt 3, 4 ou à venir 5.
Mes expériences eGpu avec mon Mint 21.3 (Ubuntu 22.04) kernel 6.8.
- WKGL17 Wikingoo
- acheté
- ✅ RTX 3060 ok
- ✗ RTX 5060 à peu près ok (ça plante selon les modèles)
- WKG-L19C70 Wikingoo
- Le vendeur dit qu'elle fonctionnera mieux que la L17 avec la RTX 5060 …
-
- acheté
- attention, une fois trouver la bonne position ne plus rien bouger.
- ✅ RTX 3060 ok
- ❌ RTX 5060 failed
- ADT UT4G
- USB4/TB3/T4B to Pcie X16 adapter for eGPU
-
- ADT UT4G-BK7 TB3/TB4 vers PCIe x16 PCIe 4.0 x4 GPU Dock
- AOOSTAR
- AG02 Oculink/USB4, avec PSU
- AOOSTAR EG02 TB5+Oculink
Au final on ne fait tourner que de petits models avec de petit context …
Nvidia
NVidia Driver Installation Guide
- nvidiafb: framebuffer support
- nvidia_modeset: Kernel Mode Setting (KMS) support
- nvidia_uvm: Unified Virtual Memory (UVM) support
- nvidia_drm: Direct Rendering Management (DRM) support
Après plantage du driver ou CUDA, essayer de le décharger avec:
# Ça fonctionne car driver nvidia non attaché à X11 # grâce à `blacklist nvidia-drm` et `options nvidia-drm modeset=0 fbdev=0` dans `/etc/modprobe.d/...` sudo modprobe -r nvidia_drm nvidia_uvm nvidia_modeset nvidia # Le driver est bien chargé au prochain usage, et CUDA fonctionne kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 580.105.08
La RTX 3060 fonctionne bien avec la version 580 nvidia-headless-580-open, nvidia-dkms-580-open
nvidia-headless-575-open
$ sudo apt install nvidia-headless-575-open Les NOUVEAUX paquets suivants seront installés : libnvidia-cfg1-575 libnvidia-compute-575 libnvidia-decode-575 libnvidia-gpucomp-575 nvidia-compute-utils-575 nvidia-dkms-575-open nvidia-firmware-575 nvidia-headless-575-open nvidia-headless-no-dkms-575-open nvidia-kernel-common-575 nvidia-kernel-source-575-open nvidia-persistenced
ggml_cuda_init: failed to initialize CUDA: CUDA driver version is insufficient for CUDA runtime version
nvidia-uvm
$ modinfo nvidia-uvm
filename: /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-uvm.ko.zst
version: 580.126.09
supported: external
license: Dual MIT/GPL
srcversion: B7E9DECF7BD1D315EBCCCF0
depends: nvidia
name: nvidia_uvm
retpoline: Y
vermagic: 6.14.0-37-generic SMP preempt mod_unload modversions
sig_id: PKCS#7
signer: NS5x-NS7xAU Secure Boot Module Signature key
sig_key: 3B:82:8F:E4:B9:99:2E:1F:E5:76:9C:33:AC:26:A9:F0:0A:1A:E3:46
sig_hashalgo: sha512
signature: 66:E9:9A:75:7C:2D:5B:1C:56:B9:CD:CE:E4:64:3B:5F:66:BB:F3:B2:
8F:E8:34:44:62:FD:02:32:A3:27:A8:EA:20:BB:BA:87:6F:F7:F8:6E:
F5:27:67:07:97:55:39:39:B2:7E:DE:01:F1:E5:64:AF:3A:29:98:90:
8D:A3:7A:0C:D9:D2:60:A8:15:C1:55:6E:F1:53:FE:85:D2:07:54:12:
B0:A4:D5:76:96:D4:A9:5F:85:B4:75:18:B4:38:A2:8B:15:3D:8C:8B:
F3:0A:AA:1E:F6:81:F1:27:CC:1E:22:EC:E6:72:BC:DC:3A:FD:39:2F:
F4:BF:DE:47:38:7E:1D:FE:04:D1:29:24:AD:CB:46:44:7F:4F:62:67:
38:FA:96:10:58:47:02:C8:65:05:67:7A:53:A6:70:76:A1:10:39:56:
0B:B3:5F:98:E2:D3:F1:FC:7E:85:02:E0:37:04:E4:91:E6:7D:92:25:
FE:3E:CD:0F:E1:26:B8:78:FA:C6:DB:AD:AA:CB:A9:22:2E:E7:20:DA:
91:46:FC:14:EB:54:54:B4:AF:1D:66:72:9B:C2:99:18:1B:57:77:14:
FD:65:14:B0:96:A5:0A:78:A4:AA:E2:F3:49:96:85:53:A3:28:50:C9:
E4:74:89:65:C7:24:19:BC:AF:4C:15:5E:55:8C:53:CC
parm: uvm_conf_computing_channel_iv_rotation_limit:ulong
parm: uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int)
parm: uvm_perf_prefetch_enable:uint
parm: uvm_perf_prefetch_threshold:uint
parm: uvm_perf_prefetch_min_faults:uint
parm: uvm_perf_thrashing_enable:uint
parm: uvm_perf_thrashing_threshold:uint
parm: uvm_perf_thrashing_pin_threshold:uint
parm: uvm_perf_thrashing_lapse_usec:uint
parm: uvm_perf_thrashing_nap:uint
parm: uvm_perf_thrashing_epoch:uint
parm: uvm_perf_thrashing_pin:uint
parm: uvm_perf_thrashing_max_resets:uint
parm: uvm_perf_map_remote_on_native_atomics_fault:uint
parm: uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (HMM is enabled if possible). However, even with uvm_disable_hmm=false, HMM will not be enabled if is not supported in this driver build configuration, or if ATS settings conflict with HMM. (bool)
parm: uvm_perf_migrate_cpu_preunmap_enable:int
parm: uvm_perf_migrate_cpu_preunmap_block_order:uint
parm: uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
parm: uvm_perf_pma_batch_nonpinned_order:uint
parm: uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint)
parm: uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int)
parm: uvm_force_prefetch_fault_support:uint
parm: uvm_debug_enable_push_desc:Enable push description tracking (uint)
parm: uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint)
parm: uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp)
parm: uvm_perf_access_counter_migration_enable:Whether access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm: uvm_perf_access_counter_batch_count:uint
parm: uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint)
parm: uvm_perf_reenable_prefetch_faults_lapse_msec:uint
parm: uvm_perf_fault_batch_count:uint
parm: uvm_perf_fault_replay_policy:uint
parm: uvm_perf_fault_replay_update_put_ratio:uint
parm: uvm_perf_fault_max_batches_per_service:uint
parm: uvm_perf_fault_max_throttle_per_service:uint
parm: uvm_perf_fault_coalesce:uint
parm: uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int)
parm: uvm_perf_map_remote_on_eviction:int
parm: uvm_block_cpu_to_cpu_copy_with_ce:Use GPU CEs for CPU-to-CPU migrations. (int)
parm: uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm: uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm: uvm_downgrade_force_membar_sys:Force all TLB invalidation downgrades to use MEMBAR_SYS (uint)
parm: uvm_channel_num_gpfifo_entries:uint
parm: uvm_channel_gpfifo_loc:charp
parm: uvm_channel_gpput_loc:charp
parm: uvm_channel_pushbuffer_loc:charp
parm: uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int)
parm: uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)
parm: uvm_peer_copy:Choose the addressing mode for peer copying, options: phys [default] or virt. Valid for Ampere+ GPUs. (charp)
parm: uvm_debug_prints:Enable uvm debug prints. (int)
parm: uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int)
parm: uvm_release_asserts:Enable uvm asserts included in release builds. (int)
parm: uvm_release_asserts_dump_stack:dump_stack() on failed UVM release asserts. (int)
parm: uvm_release_asserts_set_global_error:Set UVM global fatal error on failed release asserts. (int)
$ systool -m nvidia_uvm -v
Module = "nvidia_uvm"
Attributes:
coresize = "2154496"
initsize = "0"
initstate = "live"
refcnt = "4"
srcversion = "B7E9DECF7BD1D315EBCCCF0"
taint = "OE"
uevent = <store method only>
version = "580.126.09"
Parameters:
uvm_ats_mode = "1"
uvm_block_cpu_to_cpu_copy_with_ce= "0"
uvm_channel_gpfifo_loc= "auto"
uvm_channel_gpput_loc= "auto"
uvm_channel_num_gpfifo_entries= "1024"
uvm_channel_pushbuffer_loc= "auto"
uvm_conf_computing_channel_iv_rotation_limit= "2147483648"
uvm_cpu_chunk_allocation_sizes= "2166784"
uvm_debug_enable_push_acquire_info= "0"
uvm_debug_enable_push_desc= "0"
uvm_debug_prints = "0"
uvm_disable_hmm = "Y"
uvm_downgrade_force_membar_sys= "1"
uvm_enable_builtin_tests= "0"
uvm_enable_debug_procfs= "0"
uvm_enable_va_space_mm= "1"
uvm_exp_gpu_cache_peermem= "0"
uvm_exp_gpu_cache_sysmem= "0"
uvm_fault_force_sysmem= "0"
uvm_force_prefetch_fault_support= "0"
uvm_global_oversubscription= "1"
uvm_leak_checker = "0"
uvm_page_table_location= "(null)"
uvm_peer_copy = "phys"
uvm_perf_access_counter_batch_count= "256"
uvm_perf_access_counter_migration_enable= "-1"
uvm_perf_access_counter_threshold= "256"
uvm_perf_fault_batch_count= "256"
uvm_perf_fault_coalesce= "1"
uvm_perf_fault_max_batches_per_service= "20"
uvm_perf_fault_max_throttle_per_service= "5"
uvm_perf_fault_replay_policy= "2"
uvm_perf_fault_replay_update_put_ratio= "50"
uvm_perf_map_remote_on_eviction= "1"
uvm_perf_map_remote_on_native_atomics_fault= "0"
uvm_perf_migrate_cpu_preunmap_block_order= "2"
uvm_perf_migrate_cpu_preunmap_enable= "1"
uvm_perf_pma_batch_nonpinned_order= "6"
uvm_perf_prefetch_enable= "1"
uvm_perf_prefetch_min_faults= "1"
uvm_perf_prefetch_threshold= "51"
uvm_perf_reenable_prefetch_faults_lapse_msec= "1000"
uvm_perf_thrashing_enable= "1"
uvm_perf_thrashing_epoch= "2000"
uvm_perf_thrashing_lapse_usec= "500"
uvm_perf_thrashing_max_resets= "4"
uvm_perf_thrashing_nap= "1"
uvm_perf_thrashing_pin= "300"
uvm_perf_thrashing_pin_threshold= "10"
uvm_perf_thrashing_threshold= "3"
uvm_release_asserts = "1"
uvm_release_asserts_dump_stack= "0"
uvm_release_asserts_set_global_error= "0"
Le plantage de la RTX 5060 Ti arrive plus tard si options nvidia_uvm uvm_disable_hmm=1.
Séries RTX
- 30 (Ampere)
- 40 (Ada)
- 50 (Blackwell)
Gigabyte Windforce OC 12G Geforce RTX 3600
NVIDIA GA104 [GeForce RTX 3060]
$ nvidia-smi Sat Nov 22 10:11:42 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 Off | 00000000:05:00.0 Off | N/A | | 0% 33C P8 10W / 170W | 7MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+
PNY OC 16GB Geforce RTX 5060 Ti
J'ai du mal avec cette carte:
- avec le bridge PCIe/THB Cyid TB3-HL7 (JHL7540) : Échec complet
- avec le bridge PCIe/THB Wikingoo WKGL17-C50 : Carte reconnue et llama-bench fonctionne
- par contre l'ensemble n'est pas stable, ça plante sans prévenir, souvent éteindre de force le laptop …
Ticket ouvert chez Nvidia : kgspBootstrap_GH100: GSP-FMC reported an error while attempting to boot GSP
J'ai acheté un câble Thunderbolt certifié (50€) pour remplacer celui fourni avec l'eGPU. On dirait que ça fonctionne mieux, mais ça plante facilement kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from pRmApi→Control() … NVRM: nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, requiring os reboot to recover …
nvidia-kkms-565
Avec driver nvidia 565 les cartes ne sont pas reconnues.
nvidia-headless-580-open
éléments installés:
$ dpkg --get-selections | grep -i nvidia libnvidia-cfg1-580:amd64 install libnvidia-common-580 install libnvidia-compute-580:amd64 install libnvidia-decode-580:amd64 install libnvidia-gpucomp-580:amd64 install libnvidia-ml-dev:amd64 install nvidia-cuda-dev:amd64 install nvidia-dkms-580-open install nvidia-driver-assistant install nvidia-firmware-580 install nvidia-headless-580-open install nvidia-headless-no-dkms-580-open install nvidia-kernel-common-580 install nvidia-kernel-source-580-open install nvidia-modprobe install nvidia-persistenced install nvidia-utils-580 install
Log on plug PNY OC 16GB 5060 TI log 251125
Peut être une piste :
kernel: NVRM: Xid (PCI:0000:05:00): 79, GPU has fallen off the bus. kernel: NVRM: GPU 0000:05:00.0: GPU has fallen off the bus. ... kernel: [drm:nv_drm_dev_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000500] Failed to allocate NvKmsKapiDevice
Sur les Internet on peut trouver:
- kernel command line :
pcie_aspm=off- du coup le bridge PCI est identifié mais rien d'autre
- kernel command line :
nvidia.NVreg_EnableGpuFirmware=0
On dirait que X11 voit la carte :
$ inxi -G
Graphics:
Device-1: Intel Raptor Lake-P [Iris Xe Graphics] driver: i915 v: kernel
Device-2: NVIDIA driver: nvidia v: 580.105.08
Device-3: Chicony USB2.0 Camera driver: uvcvideo type: USB
Display: x11 server: X.Org v: 21.1.11 with: Xwayland v: 23.2.6 driver: X:
loaded: modesetting unloaded: fbdev,vesa dri: iris gpu: i915
resolution: 1920x1080~60Hz
API: EGL v: 1.5 drivers: iris,kms_swrast,swrast
platforms: gbm,x11,surfaceless,device
API: OpenGL v: 4.6 compat-v: 4.5 vendor: intel mesa
v: 25.0.7-0ubuntu0.24.04.2 renderer: Mesa Intel Iris Xe Graphics (RPL-P)
J'ai blacklisté le module nvidia-drm pour éviter autant que possible que X11 entre dans la partie :
cat /etc/modprobe.d/nvidia-blacklist-drm.conf # Carte Nvidia pour inférence IA, pas pour l'affichage # # Pour prise en compte il faut : # $ sudo update-initramfs -u # Intégration avec le DRM (Direct Rendering Manager) pour X11/Wayland. blacklist nvidia-drm # Gestion des modes d'affichage (résolutions, fréquences). blacklist nvidia-modset # Pilote framebuffer pour la console. blacklist nvidia-fbdev
et
$ cat /etc/modprobe.d/nvidia-graphics-drivers-kms.conf # Nvidia modesetting support. Set to 0 or comment to disable kernel modesetting # and framebuffer console support. This must be disabled in case of Mosaic or SLI. #options nvidia-drm modeset=1 # pour de l'inférence uniquement, pas pour l'affichage. options nvidia-drm modeset=0 fbdev=0
Avec le bridge Wikingoo WKGL17-C50
Avec certains modèles ya “CUDA Error” et dans les logs ya :
kernel: NVRM: GPU at PCI:0000:05:00: GPU-ab296f23-e6a6-a23b-b6c1-33f9b813df84 kernel: NVRM: GPU Board Serial Number: 0 kernel: NVRM: Xid (PCI:0000:05:00): 79, pid=3191, name=llama-bench, GPU has fallen off the bus. kernel: NVRM: GPU 0000:05:00.0: GPU has fallen off the bus. kernel: NVRM: GPU 0000:05:00.0: GPU serial number is 0. kernel: NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79. kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed! kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed! kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed! kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed! ...
Cyid TB3-HL7
Utilise le JHL7540 qui gère le lien Thunderbolt → PCIe
Le bridge PCI est là mais pas la NVidia :
$ lspci -t -v -k ... +-07.0-[03-2c]----00.0-[04-2c]--+-01.0-[05]-- | +-02.0-[06]----00.0 Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge DD 2018] | \-04.0-[07-2c]-- ...
la Gigabyte Windforce OC 12GB NVidia GeForce RTX 3060 est vue :
+-07.0-[03-2c]----00.0-[04-2c]--+-01.0-[05]--+-00.0 NVIDIA Corporation GA104 [GeForce RTX 3060] | | \-00.1 NVIDIA Corporation GA104 High Definition Audio Controller | +-02.0-[06]----00.0 Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge DD 2018] | \-04.0-[07-2c]--
Il faut trouver la bonne “position” physique. C'est très fragile. 😩
La PNY Geforce RTX 5060 Ti est bien présente aussi sur le PCI bridge mais pas identifiée
+-07.0-[03-2c]----00.0-[04-2c]--+-01.0-[05]--+-00.0 NVIDIA Corporation Device 2d04 | | \-00.1 NVIDIA Corporation Device 22eb | +-02.0-[06]----00.0 Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge DD 2018] | \-04.0-[07-2c]--
Ce qui fait que nvidia-smi ne la voit pas, les autres outils non plus. 😩
sudo ubuntu-drivers devices == /sys/devices/pci0000:00/0000:00:07.0/0000:03:00.0/0000:04:01.0/0000:05:00.0 == modalias : pci:v000010DEd00002D04sv0000196Esd00001440bc03sc00i00 vendor : NVIDIA Corporation manual_install: True driver : nvidia-driver-575-open - third-party non-free driver : nvidia-driver-570 - third-party non-free driver : nvidia-driver-575 - third-party non-free driver : nvidia-driver-580 - third-party non-free recommended driver : nvidia-driver-580-server-open - distro non-free driver : nvidia-driver-580-open - third-party non-free driver : nvidia-driver-580-server - distro non-free driver : nvidia-driver-570-open - third-party non-free driver : nvidia-driver-570-server-open - distro non-free driver : nvidia-driver-570-server - distro non-free driver : xserver-xorg-video-nouveau - distro free builtin
Wikingoo WKGL17-C50
Thunderbolt 4 PCI Bridge.
kernel: thunderbolt 0-1: new device found, vendor=0x215 device=0x41 kernel: thunderbolt 0-1: TB4 HOME TB4 eGFX boltd[1107]: [c9030000-0080-TB4 eGFX ] parent is e49f8780-a06c... boltd[1107]: [c9030000-0080-TB4 eGFX ] connected: connected (/sys/devices/pci0000:00/0000:00:0d.2/domain0/0-0/0-1) boltd[1107]: [c9030000-0080-TB4 eGFX ] auto-auth: authmode: enabled, policy: iommu, iommu: yes -> ok
