Pre-trained models & data sourcing #66

MarcusLoppe · 2024-03-09T00:07:19Z

MarcusLoppe
Mar 9, 2024

There are tons of mesh models that are free for research usage, however the mesh models are either too big or hard too download.
In this thread I'll explain where I source my meshes from and share early results of a pre-trained model.
I'm hoping that this will open up a discussion to accelerate this project, please comment with your thoughts and your training process.
Training book available in my fork: MarcusLoppe/meshgpt-pytorch

Training data:

Currently I'm only training on mesh with less then 250 triangles due training on kaggles free GPU.
I managed to get around 1184 models from ShapeNet + ModelNet40 but it wasn't enough data for the model to generalize.
I wanted to download and use Objaverse but it's around 8.9TB which I don't have space for, so using my script-kiddie skills I managed to download the dataset with file limit specifications Link.
I downloaded all the models from 2 to 40KB, this resulted in 37k models which I filtered out all meshes above 250 faces and got a total of 13.5k models!
Using the 1184 models from ShapeNet & ModelNet and the 13.5k from Objaverse, I augmented them 15 times and got a dataset of 218k meshes.

Results:

I've experimented with different type of codebook sizes and model setups, I've uploaded the models I think will perform well.
Scaling up the model results in better performance but since I've only trained on 186M tokens it's probably a over-parametrized model.

Training times using a single P100:

Auto-encoder: around 30hrs+ (12hrs to 0.6 then it was bit slower)
Transformers: Varies but I think it was about more then 48hrs each, I stopped when the learning progression slowed down and improved only 0.06 per epoch which made it unreasonable to train further using a single GPU.

Available models /datasets:

Auto-encoder 51M parameters: mesh-encoder_16k_2_4_0.339.pt
Transformer GPT-2 small - 141M parameters: mesh-transformer.16k_768_12_12_loss_2.335.pt
Transformer GPT-2 small/medium - 321M parameters: mesh-transformer_16k_768_24_16_loss_2.147.pt

Dataset 186M tokens: objverse_shapenet_modelnet_max_250faces_186M_tokens.npz
Finetune dataset 14.1M tokens: shapenet_25_x50_finetune_dataset.npz

https://drive.google.com/drive/folders/1C1l5QrCtg9UulMJE5n_on4A9O9Gn0CC5?usp=sharing

Thoughts:

Since the Objaverse uses a variety of meshes from minecraft characters to furnitures I have a high confidence that it will be able to encode and decode all kinds of shapes it haven't seen before.
I've attached a image below of the output & the ground truth, the goal of this model is to ensure that the encoder and the decoder can talk to each other using the codebook.
When the auto-encoder can compress the mesh structure into codes that are generalizable and not over-fitted of a dataset, I believe then is when we can see some awesome results!

This model is currently training and will do so for a few 12hr session, please let me know if you have access to better GPU's so we can get his project up and running for real!

I decided on using a 16k codebook size since a hypothesis I have is that using a small codebook will make i the transformers job harder. Imagine if you have a 2k vs 16k codebook and the transformer is generating the tokens for the base of the chair, for each token generated using the 2k codebook it might have the option of 1 or 2 tokens to generate a good base, but if you where using a 16k codebook it might have the option of 1-6 tokens that will result in also a good base.

Data sources:

ModelNet40: https://www.kaggle.com/datasets/balraj98/modelnet40-princeton-3d-object-dataset/data

ShapeNet - Extracted model labels Repository: https://huggingface.co/datasets/ShapeNet/shapenetcore-gltf

Objaverse - Downloader Repository: https://huggingface.co/datasets/allenai/objaverse

autoencoder = MeshAutoencoder(     
        decoder_dims_through_depth =  (128,) * 6 + (192,) * 12 + (256,) * 24 + (384,) * 6,   
        dim_codebook = 192,  
        dim_area_embed = 16,
        dim_coor_embed = 16, 
        dim_normal_embed = 16,
        dim_angle_embed = 8,
    
    attn_decoder_depth  = 4,
    attn_encoder_depth = 2
)

Results, it's about 14k models so with the limited training time and hardware It's a great result.

MarcusLoppe · 2024-03-13T17:32:13Z

MarcusLoppe
Mar 13, 2024
Author

@lucidrains Hey I know you aren't actively open-source researching but I just wanted to let you know that the linear attention layers really paid off!
Check out the samples_0.36_MSELoss.obj and you can checkout how well the autoencoder can encode and decode over a very large dataset.
I think the original paper didn't even achieve this good of a result!

Of to train the transformer! :) Early results but during training a smaller dataset, it took around 6 epochs to reach 3 loss and using the 14k models dataset it reached it about 1 epoch, that is a sign that the transformer scales as well with data.

1 reply

lucidrains Apr 13, 2024
Maintainer

that is so cool! thank you for sharing Marcus! takes notes

StephenYangjz · 2024-03-14T22:36:43Z

StephenYangjz
Mar 14, 2024

Hi @MarcusLoppe thanks for the great work : )
I wonder if you have a transformer checkpoint that you can share as well? No hurry but just wanna know, happy to help as well as I think this is a great interesting proj!

10 replies

StephenYangjz Apr 28, 2024

Thank you so much for the update! I tried it on the demo dataset with your notebook, but it seems like the results are not super good yet even under some token prompts. May I know if im doing anything wrong or is it because the models are not trained on this dataset? Appreciate it again! @MarcusLoppe

MarcusLoppe Apr 28, 2024
Author

Since the models are in the pre-trained stage they are not designed to generate any useful shapes.
It's more like they have an idea about how the model token sequence usually look like, so they have the basic pattern of the tokens but nothing concrete.

The idea of releasing these models is that the fine-tuning process will be a lot shorter.

I'm currently fine-tuning another model which I was able to reproduce the shapenet dataset.
I'll be releasing it shortly in a day or two.

MarcusLoppe Apr 28, 2024
Author

@StephenYangjz
About the conference/paper you talked about before, has the deadline passed?

StephenYangjz Apr 28, 2024

@MarcusLoppe icic sry thanks for the clarifications! The conference is flexible, I think 3DV would be reasonable to shoot for now. I got all other parts to work assuming text conditioned basic geometry can be generated using mesh-gpt or other similar means. Are you interested in discussing more?

MarcusLoppe Jun 8, 2024
Author

Hey,
I've just published a decently robust model on hugging face if you want to try it: https://huggingface.co/MarcusLoren/MeshGPT-preview

ljqiff · 2024-03-19T09:28:58Z

ljqiff
Mar 19, 2024

Hi @MarcusLoppe Excellent work.Can you share the dataset to Google Drive?My country's internet is bad. Thank you very much. : )

1 reply

MarcusLoppe Mar 19, 2024
Author

Hi, I actually uploaded it to the drive a day or so ago.
You can load it using the MeshDataset class in my fork.

218k dataset - objaverse + shapenet: objaverse_x15_shapenet_x15_max_250_faces.npz
17k dataset - shapenet only (better labels): shapenet_x15_max_250_faces.npz

Infinity12306 · 2024-03-22T12:52:15Z

Infinity12306
Mar 22, 2024

Thank you for your generous sharing! I read all your posts in discussion area and get inspired a lot. I'm especially impressed by your enthusiam to work on this with pretty limited computing resources. I always take insufficient computing resources as an excuse for my lazyness, your persistent trials are really a good example to me!

0 replies

Infinity12306 · 2024-03-22T12:54:20Z

Infinity12306
Mar 22, 2024

Besides, I wonder if there are any tutorials or reference code on how to simplify the mesh, i.e. reducing the face numbers from more than 20k to less than 800 so that training is possible. Any help on this will be much appreciated!

3 replies

Infinity12306 Mar 22, 2024

And you mention that you have tried incorporate attention layers in encoder-decoder part, are your directly splitting the input tensor of size F*feat_size into N slices of size F/N * feat_size and compute attention on these slices?

MarcusLoppe Mar 22, 2024
Author

Blender and MeshLabs contains tools for that, the most common and best one is quadric edge collapse decimation.
However reducing a mesh from 20k triangles to 800 triangles will not work since too much of the shape will be lost, maybe 30-50% reduction of triangles are more realistic.

Attention layers is built in the autoencoder, search for 'decoder_attn_blocks' or 'encoder_attn_blocks' in the code.

Infinity12306 Mar 23, 2024

Thanks a lot! I will try to understand how you incorporate attention mechanism into VAE, and I also think that my way of using attention makes sense as it enables faces aware of neighboring faces. I think that it is very important and should help the model learn better

sbriseid · 2024-03-22T15:13:36Z

sbriseid
Mar 22, 2024

Hi @MarcusLoppe, this is really exciting stuff!

Referring to the image in your first post on this page; was it made with models that were part of the training data? Or was it from a separate set of validation/test data? I did a comparison with the demo_mesh models (loading the autoencoder from your 0.47 checkpoint), failing to reproduce the mesh. Training on the (augmented) demo_mesh models the corresponding result was (as expected) almost flawless.

0 replies

felixfrank · 2024-04-04T10:28:37Z

felixfrank
Apr 4, 2024

Hey @MarcusLoppe great work on using this repository to reproduce/improve upon the results from the meshgpt paper, at least with respect to the autoencoder training. I tried to reproduce your results, using your dataset from the google drive and the MeshGPT_demo.ipynb notebook (I just hacked in the npz dataset and did not do any data duplication). I managed ~50 epochs in 24 hours, however, I only managed to get to a reconstruction loss around 1.0. Especially after the first 10 epochs training slows down considerably. Even now after 48 hours, I'm only looking at a loss around 0.7. Is there any other trick you use to get to 0.36 loss in 24 hours, or are you simply using a more powerful GPU that can do more epochs in the same time?

3 replies

MarcusLoppe Apr 4, 2024
Author

@felixfrank

That seems really fast I think, which dataset where you using? The 218k or the 17k?

I've done some experimenting and at the moment I think the model below trains pretty fast and able to compress at low codebook size (e.g. got 0.4 loss @ 1k codebook size).

I'm using the free one that you get with kaggle (Nvidia P100), it's quite old but works, it takes about 1.5hrs for one epoch. So in total it takes about 6-8 epochs to reach around 0.5 then it slows down a bit (maybe 0.01-0.05 improvements).

Using model below and training on the 218k dataset, I got the results:
Epoch 1: Recon loss 2.4605 commit_0.339
Epoch 2: Recon loss 1.0613 commit_2.137
Epoch 3: Recon loss 0.8 commit_2.5

If you notice the commit loss being high, set the weight dependant on the loss, i have it at 0.3 when training a large dataset, then if you got very high commit loss (2+) over a few epochs, set it 0.4 and experiment.

autoencoder = MeshAutoencoder(     
        decoder_dims_through_depth =  (128,) * 6 + (192,) * 12 + (256,) * 24 + (384,) * 6,   
        dim_codebook = 192,  
        dim_area_embed = 16,
        dim_coor_embed = 16, 
        dim_normal_embed = 16,
        dim_angle_embed = 8,
    
    attn_decoder_depth  = 4,
    attn_encoder_depth = 2
    ).to("cuda")

felixfrank Apr 4, 2024

@MarcusLoppe, thank you for the additional information. I used the 218k dataset. However I kept the commit_loss_weight at 0.2 and I do in fact see the commit loss growing to > 20. Especially after the first 5 Epochs it goes up fairly quickly. I'll try again setting the weight higher.
Maybe it's also the smaller codebook size which makes the quantization faster to learn so the training can focus more on the decoder. You use the default layers [64, 128, 256, 256, 576] for the graph neural network in the encoder, correct? Also do you play with the overall learning rate?

MarcusLoppe Apr 4, 2024
Author

@felixfrank
Alright, that might the reason. If the encoder can't keep up with decoder, the encoder won't know that it's doing anything "wrong" due to the loss being so low.
Anything above 2 in commit loss can be considered very high.

Maybe it's also the smaller codebook size which makes the quantization faster to learn so the training can focus more on the decoder.

I think it's the opposite, using a smaller codebook size will force the network to compress the information too much, in file compression this is great. But I think creating such a complex language will make the transformer job very hard.
The encoder and decoder uses the indices to represent information but also the sequence order.

So imagine that you have two identical sequences of 1500 tokens, and a codebook of 128 tokens. If you switch places of two tokens in the sequence, the mesh might represent something entirely different e.g. from car to tree.

You use the default layers [64, 128, 256, 256, 576] for the graph neural network in the encoder, correct?

Yes. I've been playing around with that but I believe the more simple encoder the easier the codebook is to understand.
The higher training time will come from the transformer. So I try to make it easier for the transformer.

Also do you play with the overall learning rate?

Kinda, I use 1e-3 to 0.4 or till the progression halts since using a higher learning rate will encourage a more generalized or broad knowledge. When it halts I switch to 1e-4, it might be around that 0.39 mark.

Jingwei-Jiang · 2024-04-10T02:55:58Z

Jingwei-Jiang
Apr 10, 2024

Hi @MarcusLoppe , thanks for your great work!
May I ask how the loss trend was when you trained the transformer?

3 replies

Jingwei-Jiang Apr 10, 2024

I have attempted to train it, and the loss has decreased from around 10 to about 8.6 at the 40th epoch. It is still in training now.

MarcusLoppe Apr 25, 2024
Author

I have attempted to train it, and the loss has decreased from around 10 to about 8.6 at the 40th epoch. It is still in training now.

Hey, If your referencing to the auto-encoder, use a higher commit_loss (e.g. 0.3 - 0.4)
But if it's regarding the transformer, try raising the learning rate to 1e-2.

I've uploaded pre-trained models to my google drive, it's not much but something. I'm currently fine-tuning the 321M variant as well as training a 500M variant using fp16.

Jingwei-Jiang Apr 30, 2024

Thank you for your efforts and responses. I have a question about the autoencoder in your Google drive. What exactly does the MSE loss you reported refer to? Is it the recon loss + the commit loss?
I still encounter negative commitment loss when training on large datasets. What is your commit loss weight?

fire · 2024-04-12T17:42:40Z

fire
Apr 12, 2024

@MarcusLoppe Are you on discord? It would be great to have a discord server for all the people interested in extending MeshGPT. My discord id is chibifire.

2 replies

fire Apr 12, 2024

I am trying to see if it's possible to arrange training resources for meshgpt.

Edited: your work on quads might be the way to reduce token count by 30%~ #54

MarcusLoppe Apr 25, 2024
Author

@fire
Hey, bit late response but I've sent a friend request (got logic in the username)

Nithin-GK · 2024-05-08T03:00:00Z

Nithin-GK
May 8, 2024

Hi @MarcusLoppe Thank you for the great work and the checkpoints, I was wondering what are the augmentations that you utilized to increase the dataset size

0 replies

thua919 · 2024-05-25T17:36:36Z

thua919
May 25, 2024

Thank you so much for the wonderful contribution! It's a little bit late to be involved in the discussion, but I am just reproducing the results with your provided pre-trained autoencoder and GPT-2 small/medium, getting something like the figure here with text prompt only. Am I missing something important or is it still require category-specific fine-tunning? I am currently holding 8 free 4090, and I am wondering if would you still be interested in collaborating in training it? I would very much like to share the weights as well.

6 replies

Jingwei-Jiang Jun 5, 2024

So great！ I wonder is it need to retrain the autoencoder and what is transformer config?

MarcusLoppe Jun 5, 2024
Author

So great！ I wonder is it need to retrain the autoencoder and what is transformer config?

Yes, however it will be shorter since what changed was the normals features.
You can load it by setting strict to false:

pkg = torch.load("./model.pt")
transformer.load_state_dict(pkg['model'],strict=False)

The transformer config that I use is:

transformer = MeshTransformer(
    autoencoder,
    dim = 768, 
    attn_depth = 12,  
    attn_heads = 12, 
    fine_cross_attend_text = True,
    text_cond_with_film = False, 
    num_sos_tokens = 1, 
    dropout  = 0.0,
    max_seq_len = 1500, 
    fine_attn_depth = 2,
    condition_on_text = True, 
    gateloop_use_heinsen = False,
    text_condition_model_types = "bge",  
    text_condition_cond_drop_prob = 0.0, 
)

thua919 Jun 6, 2024

add you in Discord！

NinaWie Jun 6, 2024

Hi @MarcusLoppe , thanks so much for all your efforts! This code base and the discussions are extremely useful for me. Could you upload your latest models to the Google Drive? It would be super nice to be able to play around with them :)

lucidrains Jun 6, 2024
Maintainer

@MarcusLoppe remember Marcus, if a startup or corporation approaches you, don't do free work. charge a consulting fee, what you can do is valuable

ljqiff · 2024-05-25T17:37:09Z

ljqiff
May 25, 2024

已收到你的邮件。

0 replies

abdulmalik29 · 2024-06-15T19:21:31Z

abdulmalik29
Jun 15, 2024

Hello @MarcusLoppe , This is a very interesting projects. well done. I have tried your model and it is impressive however if you input more than a single words and it weirdly breaks everything. Any idea why this happens.

For example this is the result for "tree":

and this one for "a tall tree":

this is the result for "a tall tree with small leaves":

4 replies

MarcusLoppe Jun 15, 2024
Author

Seems accurate, the model have never be provided a text label "tall tree" when it was training so it doesn't have a clue about that it is.
I don't think there is any "tall" phrases in the dataset it was trained on either.
Due to compute restraints I just trained it on 800 text labels hence why single words probably will work fine.

abdulmalik29 Jun 16, 2024

Do you think if it was trained on larger set it can generate meshes from longer sentences?

skx6 Jun 18, 2024

Hi, Could you please tell me which model is used for generating these objects?

abdulmalik29 Jun 18, 2024

I have used this MarcusLoren/MeshGPT-preview

ShuaibZyx · 2024-06-28T03:28:46Z

ShuaibZyx
Jun 28, 2024

Hi, I have read part of the code about MeshAutoencoder and I would like to ask you about the encoder encode section, how the input vertices, faces, face_edges, face_mask and face_edges_mask were processed. In other words, what happens when you load the mesh model to extract features from encode normally. I tried to emulate your code using pytorch, but I ran into a lot of difficulties, most importantly calculating edges and connecting vertex data to face data. Please help me. @MarcusLoppe

0 replies

lijianwen13 · 2024-09-10T09:32:13Z

lijianwen13
Sep 10, 2024

Hi, does anyone know if there is a discord server? Could you please invite me into it? I am also very interested in extending MeshGPT. Thanks very much.

0 replies

ljqiff · 2024-09-10T09:32:49Z

ljqiff
Sep 10, 2024

已收到你的邮件。

0 replies

Pre-trained models & data sourcing #66

Training data:

Results:

Training times using a single P100:

Available models /datasets:

Thoughts:

Data sources:

ModelNet40: https://www.kaggle.com/datasets/balraj98/modelnet40-princeton-3d-object-dataset/data

ShapeNet - Extracted model labels Repository: https://huggingface.co/datasets/ShapeNet/shapenetcore-gltf

Objaverse - Downloader Repository: https://huggingface.co/datasets/allenai/objaverse

Results, it's about 14k models so with the limited training time and hardware It's a great result.

Replies: 16 comments · 33 replies

MarcusLoppe Mar 13, 2024 Author

lucidrains Apr 13, 2024 Maintainer

MarcusLoppe Apr 28, 2024 Author

MarcusLoppe Apr 28, 2024 Author

MarcusLoppe Jun 8, 2024 Author

MarcusLoppe Mar 19, 2024 Author

MarcusLoppe Mar 22, 2024 Author

MarcusLoppe Apr 4, 2024 Author

MarcusLoppe Apr 4, 2024 Author

MarcusLoppe Apr 25, 2024 Author

MarcusLoppe Apr 25, 2024 Author

MarcusLoppe Jun 5, 2024 Author

lucidrains Jun 6, 2024 Maintainer

MarcusLoppe Jun 15, 2024 Author

Replies: 16 comments 33 replies

MarcusLoppe
Mar 13, 2024
Author

lucidrains Apr 13, 2024
Maintainer

MarcusLoppe Apr 28, 2024
Author

MarcusLoppe Apr 28, 2024
Author

MarcusLoppe Jun 8, 2024
Author

MarcusLoppe Mar 19, 2024
Author

MarcusLoppe Mar 22, 2024
Author

MarcusLoppe Apr 4, 2024
Author

MarcusLoppe Apr 4, 2024
Author

MarcusLoppe Apr 25, 2024
Author

MarcusLoppe Apr 25, 2024
Author

MarcusLoppe Jun 5, 2024
Author

lucidrains Jun 6, 2024
Maintainer

MarcusLoppe Jun 15, 2024
Author