PyTorch Bug: Corrupted Tensors After Failed Resize
In the world of deep learning, PyTorch is a powerhouse, enabling researchers and developers to build and train complex neural networks with remarkable flexibility. However, even the most robust libraries can sometimes harbor unexpected quirks. Recently, a peculiar bug has surfaced concerning how PyTorch handles tensor operations, specifically when resizing tensors that share storage with non-resizable buffers. This issue can lead to what we'll affectionately call "zombie tensors" – objects that appear to have one shape but are fundamentally corrupted, often resulting in perplexing crashes. Let's dive deep into this intriguing problem and understand its implications.
The Core of the Issue: A Sneaky Metadata Update
The heart of this bug lies in the resize_() operation within PyTorch. When you attempt to resize a tensor that's backed by a storage mechanism that cannot be resized – such as a NumPy array that has been integrated into a PyTorch tensor using set_() – PyTorch is designed to throw an error. Specifically, you'll encounter a RuntimeError with a message like: "Trying to resize storage that is not resizable." This is the expected and correct behavior, as it alerts you to an invalid operation. However, the problem arises because this error handling isn't as robust as it could be. Before PyTorch checks if the underlying storage can actually be resized, it proceeds to update the tensor's shape and stride metadata. This means that even though the operation will ultimately fail, the tensor's internal pointers are modified to reflect the new, desired size.
Imagine you have a tensor that's essentially a view into a NumPy array. NumPy arrays, by their nature, have fixed storage. If you then try to use resize_() on this PyTorch tensor, PyTorch first changes its mind about what size the tensor should be, and then discovers that the NumPy array storage can't accommodate this new size. The result is a tensor that thinks it's, say, a 5x5x5 multidimensional array, but its actual storage is still the original, empty, 0-byte buffer. This creates a deeply unsettling inconsistency – a "zombie tensor". It possesses the shape metadata of a much larger object, yet its underlying storage is practically non-existent.
This inconsistency is the breeding ground for further problems. Any subsequent attempt to interact with this corrupted tensor – whether it's printing its contents, accessing its elements, or performing further operations – is likely to lead to a cascade of errors. You might see a RuntimeError indicating a mismatch between dimensions and size, or worse, a Segmentation Fault, which is a much more severe low-level memory access error. These crashes can be incredibly difficult to debug, especially when the corrupted tensor is created deep within a complex model or training loop, far removed from the initial resize_() call that caused the corruption.
Reproducing the "Zombie Tensor" Phenomenon
To truly understand a bug, it's essential to be able to reproduce it reliably. The PyTorch team has provided a minimal, yet effective, reproduction case that clearly demonstrates this issue. It starts by creating a tensor with an empty, non-resizable storage. This is achieved by taking a NumPy array with zero elements and converting its underlying storage into a PyTorch untyped_storage. This locked_storage is then assigned to a new PyTorch tensor.
import torch
import numpy as np
# Create non-resizable storage (0 bytes)
locked_storage = torch.from_numpy(np.array([], dtype=np.int32)).untyped_storage()
# Inject into a fresh tensor
t = torch.tensor([], dtype=torch.int32)
t.set_(locked_storage)
# Attempt to resize (Expected: Fail, maintain original shape)
# (Actual: Fails, but updates shape to 5x5x5)
try:
t.resize_((5, 5, 5))
except RuntimeError:
pass
# Verify corruption
print(f"Shape: {t.shape}")
print(f"Storage: {t.untyped_storage().nbytes()}")
print(t) # CRASH
When this code snippet is executed, the try...except block is designed to catch the expected RuntimeError when t.resize_((5, 5, 5)) is called. Indeed, the RuntimeError is caught, confirming that PyTorch recognizes the storage is not resizable. However, the subsequent print statements reveal the problem:
Shape: torch.Size([5, 5, 5]): The tensor's shape has been misleadingly updated to reflect the target size of (5, 5, 5).Storage: 0: The underlying storage remains at 0 bytes, completely unable to hold any data for a 5x5x5 tensor.
Finally, print(t) will trigger the crash. This could manifest as a RuntimeError directly from PyTorch, or a more severe Segmentation Fault, depending on the exact circumstances and the PyTorch version. The expected behavior, as outlined by the bug report, is that if resize_() throws an exception due to locked storage, the tensor's metadata (shape and stride) should remain unchanged. In this case, the shape should have stayed as torch.Size([0]), maintaining the strong exception guarantee that the program state remains valid even when an error occurs.
Why This Matters: The Importance of Exception Safety
The concept of exception safety is crucial in software development, especially in libraries that are used for complex computations like PyTorch. When an operation fails, the library should leave the program in a consistent state. There are generally three levels of exception safety:
- Basic Exception Safety: Guarantees that if an exception occurs, the program remains in a valid, usable state, though its specific values might change.
- Strong Exception Guarantee: Guarantees that if an operation fails, the program state is unchanged, as if the operation never happened. This is often referred to as the