skip to content

Search

环境配置记录

4 min read

主要是今天在服务器上配环境跑代码的痛苦过程

前言

这一周要复现relbench的实验代码,之前在本地跑了其中比较简单的几个实验,这两天需要把所有实验都跑出来

服务器环境

小一点的 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.230.02 Driver Version: 535.230.02 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage /Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 | | N/A 37C P0 26W / 70W | 12154MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:5E:00.0 Off | 0 | | N/A 26C P8 9W / 70W | 4MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Tesla T4 Off | 00000000:B1:00.0 Off | 0 | | N/A 25C P8 9W / 70W | 4MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 Tesla T4 Off | 00000000:D9:00.0 Off | 0 | | N/A 25C P8 9W / 70W | 4MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

这个服务器上显存太小了,一开始全在这个服务器上跑,后面的图模型全 out of memory, 后面只在这个服务器上跑lightgbm的树模型

大一些的(3090) nvidia-smi Sun Jun 29 19:15:11 2025
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage /Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce … On | 00000000:25:00.0 Off | N/A | | 42% 52C P2 123W / 350W | 2447MiB / 24576MiB | 21% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce … On | 00000000:41:00.0 Off | N/A | | 30% 30C P8 25W / 350W | 3MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce … On | 00000000:61:00.0 Off | N/A | | 40% 40C P2 104W / 350W | 2063MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce … On | 00000000:81:00.0 Off | N/A | | 30% 29C P8 21W / 350W | 3MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA GeForce … On | 00000000:A1:00.0 Off | N/A | | 39% 29C P8 20W / 350W | 3MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA GeForce … On | 00000000:C1:00.0 Off | N/A | | 30% 29C P8 21W / 350W | 3MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA GeForce … On | 00000000:E1:00.0 Off | N/A | | 77% 55C P2 117W / 350W | 3511MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

这个是今天新开的服务器账号,主要就是花时间在环境配置上

环境配置过程

1. SSH连接

先ssh连接到服务器,修改了一下初始默认的密码

2. conda

安装conda包,这里一开始把环境装在base了,后面又来回重新安装了conda

3. pip relbench

4. 被环境兼容性折磨一下午

新的服务器上cuda版本是11.8,relbench直接pip安装的torch需要cuda为12.6。 然后是torch_geometric的版本问题,太高的torch版本不兼容。 然后就是scikit-learn的square参数。 还有模型缓存位置(后面没管了)