AutoPartGen: Autoregressive 3D Part Generation from Meta AI
- Smars
- Open Source , AI Research
- 30 Jun, 2026
Why This Project Exists
Most 3D generation models treat objects as monolithic blobs. You feed in a text prompt or an image, and out comes a single mesh — one undifferentiated piece of geometry with no understanding of what constitutes a chair leg versus a chair seat. For downstream tasks like animation, rigging, or scene composition, this is a dead end. You get a model that looks right but cannot be manipulated in meaningful ways.
The alternative — manual 3D part segmentation — is tedious work that skilled artists do reluctantly and everyone else avoids entirely. You need deep domain knowledge to know where the boundaries between parts should fall, and the process does not scale.
AutoPartGen from Meta AI and Oxford’s Visual Geometry Group takes a different approach. Instead of generating a single mesh, it generates parts one at a time in an autoregressive sequence. Give it an image, a mesh, or a set of 2D masks, and it produces a decomposed 3D object where each part is a separate, usable mesh. The model decides autonomously how many parts an object should have and where the boundaries lie.
This is a NeurIPS 2025 paper with code, weights, and a working pipeline. The re-implementation uses TripoSG components and is publicly available under a noncommercial research license.
What It Actually Does
AutoPartGen builds on 3DShape2VecSet, a latent 3D representation that turns voxel grids into compact latent vectors. The key insight is that this latent space exhibits strong compositional properties — parts of an object can be represented as additive combinations of latent vectors, which makes it natural for part-based generation.
The generation pipeline works as follows:
Input conditioning. The model accepts three types of inputs, alone or in combination:
- An image of an object (DINOv2 features extracted as conditioning)
- An indexed 2D mask where each region corresponds to a desired part
- An existing 3D mesh (reconstructed, scanned, or generated)
Autoregressive part generation. Rather than predicting all parts simultaneously, AutoPartGen generates one part at a time. At each step, a DiT (Diffusion Transformer) predicts the next part’s latent representation conditioned on previously generated parts, the overall object shape, and the input modality. The process continues until the model outputs an end-of-sequence token — meaning it determines autonomously when all parts have been produced.
Latent-to-mesh conversion. Each predicted latent is decoded through a Shape VAE back into geometry, then extracted as an iso-surface using either DiffDMC (faster, requires CUDA extension) or marching cubes fallback. Parts get simplified, cleaned of floaters, and optionally smoothed before export.
Assembly. The individual part meshes can be reassembled into a coherent whole without additional optimization. Each part is a separate GLB file, and a combined mesh is also exported.
The whole pipeline runs in under a minute on a single GPU for typical objects.
Who Would Use This
AutoPartGen targets a specific gap in the 3D content pipeline — the space between “generate a mesh” and “make that mesh usable for animation, rigging, or scene assembly.”
-
3D content creators: Automatic part decomposition means you can take a generated or scanned object and immediately get separate meshes for each logical component. No manual separation required.
-
Animation and rigging workflows: Rigging a monolithic mesh is painful. Rigging individual parts that already correspond to semantic components (legs, arms, torso) is dramatically faster. AutoPartGen provides this decomposition automatically.
-
Game development: Scene composition with part-level control allows procedural variation — swap out chair legs, change table surfaces, combine parts from different objects. The part-level granularity makes this possible without manual mesh editing.
-
Robotics and simulation: Simulated environments need objects with meaningful articulation points. AutoPartGen’s part decomposition maps naturally to joint locations and rigid body segments.
-
Research: The autoregressive approach to 3D part generation is itself a research contribution. The codebase provides a clean reference implementation for anyone working on compositional 3D generation.
When You Might Not Need It
Some caveats worth knowing before investing time:
-
Noncommercial license only. The FAIR Noncommercial Research License means you cannot use this in production or commercial products. Academic and research use is the intended scope. If you need commercial 3D part generation, this is not the tool.
-
Re-implementation, not the original. The released code is a re-implementation using TripoSG components. The authors note it may underperform the original system reported in the paper. Results will differ from the paper’s quantitative evaluations.
-
Single GPU, moderate memory. The 2048-latent checkpoint is the default release path. The optional 4096-latent checkpoint improves fine detail but requires more VRAM and runs slower. Both need a CUDA-capable GPU.
-
Input quality matters. The model conditions heavily on input images. Poor lighting, heavy occlusion, or unusual angles will produce worse decomposition. The background removal (RMBG-1.4) helps but is non-commercial itself.
-
Part semantics are learned, not specified. You cannot tell the model “give me exactly 5 parts with these names.” The model decides part count and boundaries based on its training. Indexed masks provide some control, but the underlying generation is still probabilistic.
-
Early-stage project. 14 stars on GitHub, 3 commits. The paper is solid but the codebase is fresh. Expect rough edges, limited documentation, and potentially breaking changes.
Quick Start
Prerequisites: Python 3.10, CUDA 12.1, a GPU with sufficient VRAM (the default model works on consumer cards).
Step 1 — Environment setup:
conda create -n autopartgen python=3.10 pip -y
conda activate autopartgen
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
pip install -e .
Step 2 — Download weights from Hugging Face:
hf download facebook/autopartgen \
autopartgen_dit.pth autopartgen_vae.pth \
--local-dir checkpoints
Step 3 — Run inference on an image:
python inference.py \
--image examples/image/apple_character/image.png \
--output_path outputs/apple_character
Output: a directory containing mesh_000.glb, mesh_001.glb, etc. (individual parts) plus mesh_combined.glb (the assembled whole).
The Architecture Under The Hood
Three components do the heavy lifting:
3DShape2VecSet VAE. Converts between voxel grids and latent vectors. The latent space has compositional properties that make part-level generation natural — you can add latent vectors to combine parts, and the decoder produces coherent geometry from the sum.
DiT backbone. A Diffusion Transformer that predicts the next part’s latent representation. It conditions on previously generated parts (preventing overlap), the overall object shape (maintaining coherence), and input features (DINOv2 for images, vertex features for meshes). Two checkpoints: 2048 latent tokens (default, faster) and 4096 latent tokens (finetuned, better detail).
Autoregressive loop. The generation loop predicts one part, checks if the end token was emitted, and continues if not. This means the model learns to decide when an object has “enough” parts — a chair gets 4 legs plus a seat, a car gets wheels plus body, without explicit instruction.
The default guidance settings vary by mode: image-only generation uses different CFG scales than mesh-conditioned or mask-conditioned generation. The config files document these values.
Community and Maintenance
Authored by researchers at Meta AI and Oxford VGG, published July 2025. The repository shows 14 stars and 1 fork — early stage by any measure.
3 commits in the repository history. The codebase is clean but minimal. Dependencies are well-declared in pyproject.toml and requirements.txt. The Hugging Face model page is properly organized with documentation.
The FAIR Noncommercial Research License is explicit: research use only. This is not a commercial product and should not be treated as one. Forking for internal research is permitted and straightforward.
The acknowledgements section reveals significant upstream dependencies: TripoSG for the VAE and DiT structure, HunyuanDiT for transformer blocks, DINOv2 for image features, DiffDMC for surface extraction, and TRELLIS for post-processing. This is a research integration effort as much as a novel contribution.
Bottom Line
AutoPartGen solves a real problem in 3D content pipelines: going from a single object to a decomposed, part-level representation without manual intervention. The autoregressive approach — generating parts sequentially until the model decides it is done — is elegant and produces results that other methods struggle to match.
The practical constraint is the noncommercial license. For researchers working on 3D generation, compositional scene synthesis, or part-level manipulation, this is a valuable reference implementation with released weights. For anyone hoping to use this in a product, the license makes that a non-starter.
The paper is worth reading regardless. The insight that 3DShape2VecSet latent spaces have compositional properties suitable for part generation is itself a useful finding for the broader 3D generation community.
Model weights: https://huggingface.co/facebook/autopartgen