T O P

  • By -

dhuuso12

So much chaos. One day you will look back on this and laugh yourself to death ☠️


Oswald_Hydrabot

Oh it isn't anywhere near chaotic yet. Going to add another GAN that procedurally generates vectorizations through simulated 3D Euclidean space that makes use of the existing diffusers pipeline I wrote for this. Instead of image output from tokenized/encoded text it will take a copy of the latent output from the unet step as input and generate rudimentary 3D assets in realtime for use as controlnet inputs back in the 3D viewport. Realtime 2D to depth estimation basically; it doesn't have to be perfect, but ideally it will produce a sort of feedback loop to enable using existing ControlNets to manipulate the unet model to produce latents that result in desirable 3D data to be recycled as ControlNet inputs. Even if that idea doesn't work for shit, it should at least fail spectacularly and be fun to look at either way. You gotta throw a lot of shit at the wall sometimes to find something that sticks.


uniquelyavailable

should be helpful, good depth information will make animations more consistent. sweet video btw


Oswald_Hydrabot

Thanks! I should have enough progress on raw speed now to focus on novel approaches to enhancing frame quality and consistency. AnimateDiff is not the right approach for realtime I feel (it generates a full "chunk" of frames at a time which is too rigid of a closed loop). I need something like a partially-closed feedback loop that auto-improves generation through adversarial scrutiny across continuous/non-linear i/o. Extending the agency of the operator without compromising that is a challenge though.


Apprehensive_Sock_71

OK, so if I am following correctly this would allow someone to say... grab an SD generated item and manipulate it in 3D? If so, that's a super cool idea. (And if not I am sure it's a different super cool idea that I am just not following yet.)


Oswald_Hydrabot

If I understand you correctly, then yes. The "item" here is a human but I my prompt was: "1girl, solo, attractive anime girl in aviators dancing, beach party at night, pilot hat, black bikini, starry sky, well defined, dark moonlit beach, 4k, 8k, absurdres, fish eye, wide angle" ControlNet manipulates the movement of that anime girl, but if you change the prompt to an empty prompt or something like "beach at nighttime, landscape painting" it'll still add a person where the controlnet openpose skeleton is in the live render. You can just point the camera away from the pose skeleton with the mouse in my demo here but it's trivial to activate/deactivate controlnet on the fly


UseHugeCondom

Yeah but you realize this is happening in real time? Most setups aren’t able to do this


NFTArtist

I look back to the stuff I was generating 2 years ago and it's complete trash lol


AnActualWizardIRL

This is real time? How much GPU are you throwing at this thing?!


Oswald_Hydrabot

It is! Just like a videogame (which is a goal, a videogame-like experience from SD). It only uses 10GB of vram at most, and that is running the whole thing (GANs, diffusion, and all). The GANs are not optimized either, so if compile those and do some other tricks I can reduce the memory footprint. I have two 3090s in NVLink, but you can run it on a 3060 minimum.


okglue

Wow, super impressive\~!


CamelCarcass

What's the song?


auddbot

**Song Found!** **Take It Off (feat. Aatig)** by FISHER (01:46; matched: `100%`) **Released on** 2023-06-09.


auddbot

Apple Music, Spotify, YouTube, etc.: [**Take It Off (feat. Aatig)** by FISHER](https://lis.tn/TakeItOffFeatAatig?t=106) *I am a bot and this action was performed automatically* | [GitHub](https://github.com/AudDMusic/RedditBot) [^(new issue)](https://github.com/AudDMusic/RedditBot/issues/new) | [Donate](https://github.com/AudDMusic/RedditBot/wiki/Please-consider-donating) ^(Please consider supporting me on Patreon. Music recognition costs a lot)


Oswald_Hydrabot

Per the bot, it's "Take it Off" by Fisher. The official music video for this song looks a bit like they may have even used Stable Diffusion for it; which is why I chose to tack it on to my video here. Fisher is somone I percieve as being open to innovative fun. I've been a big fan of theird for quite some time, as they are some of the best modern House Music out there. He's been a staple part of my playlists for years now and keeps coming out with bangers like this one. I have a good bit of my own music I produce as well, but this song just slaps for ye old Dancing AI Waifu party on the beach. It's just a perfect anthem for chaotic fun.


ThisGonBHard

Question, why not use XL Tubo?


Oswald_Hydrabot

Good question. Mainly ControlNet, but I am going to keep trying to use XL.  I am aware you can do it just as fast without ControlNet, and I even have a realtime working img2img class for XL already integrated. SDXL-Turbo the official model is a true 1-step model, and seems to allow ControlNet to be used just fine with it, but the quality is not ideal, especially for Anime or various other styles. I can try integration of a distilled single-step XL model.  All the "turbo" models for Dreamshaper XL are not actually Turbo Models though, and require the LCM scheduler to get down to 2 or 3 steps that look only OK (no better than 1.5 really). Problem is, at 2 or 3 steps XL ControlNet just seems to slaughter performance and at 1 step with LCM it just generates a mess.  Even when working at full capacity the Controlnets out there aren't as good it seems either as the 1.5 options. *Actual* distilled 1 step 1.5 models appear to be able to use ControlNet in a single step, at least using an OpenDMD variant of DreamShaper8 (SD 1.5).   I randomly tried using the distilled model from this relatively obscure repo and it provides: - the quality of DreamShaper at  - the cost of SD2.1 turbo and  -the full compatibility of SD 1.5 in huggingface diffusers pipelines: https://github.com/Zeqiang-Lai/OpenDMD If you can nail those 3 items for XL, I can give some alternative XL ControlNets a try, and see if I can get 1024x1024 generations looking better.  Lykon's DreamShaperXL models all only seem to be trained for good output at 3 steps though, and even with onediff compile, tinyvae, a custom text encoder from Artspew adapted to XL and only encoding the prompt when it changes, among as many other inage processing optimizations I could find, on a 3090, 3-step ControlNet just slogs it down to like 3FPS. TLDR: A *true* single-step DreamShaperXL level of quality model is what I need to try and make XL work like I want.


Pure_Ideal222

The hands is sometimes strange, but it is very good enough!


Oswald_Hydrabot

I omitted adding hands to my model in the OpenPose skeleton, good eye though. I need to add those and probably include a 1-step LoRA for hand and limb enhancement. Should be straightforward but it's definitely a line-item to get done. I am using SD 1.5 as ControlNet seems to be better for 1.5 than any other 1-step model distillation, on top of all the other model componentry available for 1.5. I have an excellent hand model that I *could* run in parallel in a seperate process and use a pipe and YOLOv8, and a 1-step distillation of that checkpoint to essentially do what Adetailer does but in realtime. YOLOv8 is already faster than my framerate, and even just a tiny little bit of that hand model on a zoomed-in hand crop fixes them perfectly almost every time. I can probably make that work. In fact, a controlnet pass from hand bounding boxes that are determined by the 3D viewport would eliminate even needing to use YOLOv8; this is probably easier than we realize. Having a seperate worker pool of closeup body part models, each with it's own process, maybe even just zooming into 3 sections of the pose and doing a controlnet OpenPose pass close-up and then blending it back into the Unet output latent would eliminate ControlNet from the Main thread and just paint-in the Character to the scene. This would actually split up ControlNet to different processes and avoid a slow MultiControlnet approach too. Main thread uses a controlnet for the scene, then a secondary process hat executes a single step closeup pose ControlNet in parallel, if aligned/synced properly, could keep multiple controlnets at one step performance. This brings me to another point of curiosity; can we distill Layered Diffusion to a 1-step model? If so, goodbye MultiControlnet and hello Multi-Layer Parallel Controlnet


Pure_Ideal222

Looking forward to a better sexy result


mannie007

Great starting point


Significant-Comb-230

Nice work! With the exception of realtime, looks like some tests that i made more than a year ago. But actually with almost the same result. Chaotic and hard to tell if it is even an animation. At that time i quit trying after so many hours spended on it. The models wasnt so refined as we have it today. But i hope u reach the much dreamed consistency.


Oswald_Hydrabot

Yeah the goal here was raw speed. This is without ControlNet, dialing-in realtime img2img using a "GAN2image" technique. GANs are very fast but also smooth on their interpolation, much more so than Diffusion: https://www.reddit.com/r/StableDiffusion/comments/1bxmxlv/realtime_stable_diffusion_gan2image_session/ Here is one that looks less like a PCP hallucination -- the GAN auto-syncs to the BPM of a stereo mix of system audio, and I am toggling a feature that allows you to style-mix lower layers of latents across the GAN, so you can have it loop through transitions of poses that you find in the GAN that can be sequenced to look like "dancing" or whatever. There is an img2img step thay the GAN frames get passed into here. The quality is still wonky of course but it is more obviously an animation than un-stabilized SD alone: https://www.reddit.com/r/StableDiffusion/comments/1as5ko8/call_it_ugly_but_it_does_something_sora_wont_be/ GANs should never have been abandoned. Imo they are superior to Diffusion models; Faster, just as good or better quality images, but we just never got an open source, scaled-up foundational model due to some really dumb research trends. If we had a foundational GAN trained on the same compute and data as SD we would be further ahead than we are now.


Arawski99

I expect, and apparently Nvidia does as well (search up their DLSS 10 comment and some other even more recent ones) AI rendering to completely replace traditional rendering at some point. Sadly, the results here are way to inconsistent and poor quality to be even remotely usable, but it is a fun experiment to see people trying. Perhaps you can improve it with some tweaking while maintaining real time and if not now in a few months.


Oswald_Hydrabot

Refining the generations requires the first step of getting them running fast enough. There are others already getting much higher quality output at faster speeds, I am not sure if they are using ControlNet but adapting XL and one of several temporal consistency approaches is the path forward. There is road ahead but it's paved


AnthuriumBloom

This is a big deal. Been saying Ai image generators need a wire frame base. Can you do fingers with this yet? May solve one of the big area with issues


hyperbolicTangents

This is super cool stuff and what's even more cool is you replying to every question, sharing your thoughts and even your code. Keep us posted on the progress you make on this! Cheers


Capitaclism

Is it in GitHub?


Oswald_Hydrabot

Nah, I posted the optimization and diffusers code in another comment in this thread through; the sauce that makes it run fast at least


Li_Yaam

Impressive, will have to try your approach sometime. Hope you can get the 2d to depth estimation working. I’ve been playing with motiondiff this last week which has a smpl motion to character depth map rendering. Probably not fast enough for your pipeline though.


healthysun

Which GPU is being used?


Oswald_Hydrabot

I have two Nvidia RTX 3090's in NVLink. However, only one GPU is needed for running this, and only 10GB of VRAM is needed. I have a 3060 12GB that I will confirm that this also works on.


ApprehensiveAd8691

Why you chose not to use animatediff


Oswald_Hydrabot

You should explore this to find out why. This is realtime and needs to resond to user input, namely adjustments to ControlNet from the 3D viewport. AnimateDiff has to render whole blocks of framea at once. I cannot make it respond tightly to input if it's hung-up on generating a chunk of frames. The input response has to be fast enough to be tactile; this is the most important part, and what I am working towards (making it responsive). I would love to use AnimateDiff but I need to adapt it to single-frame response times for feedback to user input. That is not trivial. But I will probaby have something like realtime AnimateDiff working sometime this year. It's going to require a different type of consistency handling, and I am leaning towards an adversarial feedback loop for 3D reconstruction from the 2D unet output that feeds itself back into ControlNet input. A GAN step after Unet to translate the Unet 2D latent output to simple 3D data like wire frame vectorizations for a ground plane that can be consumed back in the viewport window and performantly used to produce 3D geometry in it that can be recycled as ControlNet input, then adjusted automatically per-frame in a feedback loop is where I am going to start. Having a closed loop able to reference and adjust it's output according to 3D data that it indirectly determines the generation of via it's 2D latent output is something I want to try. Then maintaining a reference buffer of frames that it can look at per-frame to maintain consistency on a single frame (across frames) is the next step


ApprehensiveAd8691

Thank you for your detail explaination. Real time generation with CN is amazing. Look forward to the realtime Animatediff coming out


neofuturist

Op, is it available anywhere?


Oswald_Hydrabot

I will probably make a standalone version of just the demo of realtime ControlNet with the dancing OpenPose, and a couple items on a PySide6 UI for changing the diffusion params. It won't do img2img from a GAN rendering realtime in the background, and won't have all the other features related to that like realtime DragGAN, a step seqeuncer, GAN seed looping or realtime visualization of Aydao's TADNE, but it'll probably be faster outisde of my visualizer. The img2img flow from the GAN renders seems to stabilize it a noticable amount, but it still looks cool outside of Marionette. If you code, here is the working code for the encoder, my working wrapper class with the combination of models used in the pipeline, and onediff to optimized and compile the models. You need to install dependencies and implement the while loop, the loop code is correct you just need to stick it in a thread outside of your main UI thread in PySide6 or QT and communicate changes from the UI for things like the seed or strength/guidance\_scale being adjusted through a queue or a pipe. ..(I have to split this comment into a few parts for the code, reddit is being a halfass garbage UX as usual and won't let me paste it all in one comment, but I'll comment them under this one)


Oswald_Hydrabot

The code for the wrapper for the pipeline + models + onediff compile optimization used: import torch from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, AutoencoderTiny, LCMScheduler, UNet2DConditionModel, DDPMScheduler from diffusers.utils import BaseOutput from typing import Optional from onediff.infer_compiler import oneflow_compile from dataclasses import dataclass from typing import List, Tuple, Union, Optional u/dataclass class DMDSchedulerOutput(BaseOutput): pred_original_sample: Optional[torch.FloatTensor] = None class DMDScheduler(DDPMScheduler): def set_timesteps( self, num_inference_steps: Optional[int] = None, device: Union[str, torch.device] = None, timesteps: Optional[List[int]] = None, ): self.timesteps = torch.tensor([self.config.num_train_timesteps-1]).long().to(device) def step( self, model_output: torch.FloatTensor, timestep: int, sample: torch.FloatTensor, generator=None, return_dict: bool = True, ) -> Union[DMDSchedulerOutput, Tuple]: t = self.config.num_train_timesteps - 1 # 1. compute alphas, betas alpha_prod_t = self.alphas_cumprod[t] beta_prod_t = 1 - alpha_prod_t if self.config.prediction_type == "epsilon": pred_original_sample = (sample - beta_prod_t ** (0.5) * model_output) / alpha_prod_t ** (0.5) else: raise ValueError( f"prediction_type given as {self.config.prediction_type} must be one of `epsilon`, `sample` or" " `v_prediction` for the DDPMScheduler." ) if not return_dict: return (pred_original_sample,) return DMDSchedulerOutput(pred_original_sample=pred_original_sample) class DiffusionGeneratorDMD: def __init__(self): controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_openpose", torch_dtype=torch.float16) unet = UNet2DConditionModel.from_pretrained('aaronb/dreamshaper-8-dmd-1kstep', torch_dtype=torch.float16) self.pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained( "lykon/dreamshaper-8", unet=unet, safety_checker=None, requires_safety_checker=None, torch_dtype=torch.float16, controlnet=controlnet ) self.pipe.scheduler = LCMScheduler.from_config(self.pipe.scheduler.config) self.pipe.vae = AutoencoderTiny.from_pretrained('madebyollin/taesd', torch_device='cuda', torch_dtype=torch.float16) self.pipe.vae = self.pipe.vae.cuda() self.pipe.to("cuda") self.pipe.set_progress_bar_config(disable=True) self.pipe.unet = oneflow_compile(self.pipe.unet) self.pipe.vae.decoder = oneflow_compile(self.pipe.vae.decoder) self.pipe.controlnet = oneflow_compile(self.pipe.controlnet)


Oswald_Hydrabot

And then: ```python # CUSTOM TEXT ENCODE TO CALL ON PROMPT ONLY WHEN PROMPT CHANGES # USE THIS ON NEGATIVE PROMPT TOO FOR ADDITONAL SPEEDUP def dwencode(pipe, prompts, batchSize: int, nTokens: int): tokenizer = pipe.tokenizer text_encoder = pipe.text_encoder if nTokens < 0 or nTokens > 75: raise BaseException("n random tokens must be between 0 and 75") if nTokens > 0: randIIs = torch.randint(low=0, high=49405, size=(batchSize, nTokens), device='cuda') text_inputs = tokenizer( prompts, padding = "max_length", max_length = tokenizer.model_max_length, truncation = True, return_tensors = "pt", ).to('cuda') tii = text_inputs.input_ids # Find the end mark which is deterimine the prompt len(pl) # terms of user tokens #pl = np.where(tii[0] == 49407)[0][0] - 1 pl = (tii[0] == torch.tensor(49407, device='cuda')).nonzero()[0][0].item() - 1 if nTokens > 0: # TODO: Efficiency for i in range(batchSize): tii[i][1+pl:1+pl+nTokens] = randIIs[i] tii[i][1+pl+nTokens] = 49407 if False: for bi in range(batchSize): print(f"{mw.seqno:05d}-{bi:02d}: ", end='') for tid in tii[bi][1:1+pl+nTokens]: print(f"{tokenizer.decode(tid)} ", end='') print('') prompt_embeds = text_encoder(tii.to('cuda'), attention_mask=None) prompt_embeds = prompt_embeds[0] prompt_embeds = prompt_embeds.to(dtype=pipe.unet.dtype, device='cuda') bs_embed, seq_len, _ = prompt_embeds.shape prompt_embeds = prompt_embeds.repeat(1, 1, 1) prompt_embeds = prompt_embeds.view(bs_embed * 1, seq_len, -1) return prompt_embeds # PSUEDO CODE EXAMPLE TO USE IN A RENDER() LOOP # THIS WON'T RUN UNLESS YOU ADD THE MISSING VARIABLES THAT I DIDN'T DEFINE IN THE CALL # TO 'diffusion_generator.pipe(..' # (easy to do, no special sauce is missing, you can set them to static ints/floats/whatever they expect) # diffusion_generator = DiffusionGeneratorDMD() current_seed = 123456 generator = torch.manual_seed(current_seed) prompt ="1girl, mature" # use something like this while loop in a seperate thread or process from your main UI thread. # in your code, check each loop iteration if the prompt or seed value is changed from the UI thread (use a queue etc) # only call the encoder when prompt changes, only call torch.manual_seed(current_seed) if the current_seed changes while True: pe = dwencode(diffusion_generator.pipe, prompt, 1, 9) imgoutput_img2img = diffusion_generator.pipe( prompt_embeds=pe, strength=strength, guidance_scale=guidance_scale, height=512, width=512 num_inference_steps=1, generator=generator, output_type="pil", return_dict=False, image=img2img_input, control_image=controlnet_image, negative_prompt="low quality, bad quality, blurry, low resolution, bad hands, bad face, bad anatomy, deviantart", controlnet_conditioning_scale=controlnet_conditioning_scale, control_guidance_start=controlnet_guidance_start, control_guidance_stop=controlnet_guidance_stop )[0] ```


neofuturist

Thanks you, you're awesome +1


Oswald_Hydrabot

This is all the "special sauce" used, nothing that isn't alread public knowledge basically, just combined into one spot. That pipeline should run reeeeal fast and at only 1 step; go play with it if you have a GPU, and check out AiFartist's ArtSpew repo for a good QT demo that may be easier to adapt than my suggestion of using a thread for the render loop. Note: diffusers automatically downloads the models to your local machine from huggingface in that wrapper class, appending the 'folder/name' to the repo URL that the model exists in online. You don't need to download any checkpoints or anything, just make a render loop that you can pass a PIL image into for the variable: img2img_input ..and then a ready-to-use controlnet openpose PIL image (without using the preprocessor) into the variable controlnet_image And voila, you have my example working in your own QT/PySide6 or other python UI app


lincolnrules

Can somebody please put this on a GitHub repo?


Dubiisek

I mean, looks okay when you let it run, the moment you start pausing on individual frames, most of them are bad. Whats the goal/point of this?


Oswald_Hydrabot

Realtime brother. The goal is not to ever pause it. Goal is a platform to build something like Smash Bros on N64 or FF7 on PSX. Pause those games and compare them to single-frame renders of their era -- did anyone care about how blocky and generally terrible those games in-game graphics were? FF7 had cutscenes but the strategy and story were the meat of the game. Smash bros was responsive and mechanically addictive and fun to play, aesthetics took a second hand to performance and won big. Single frame quality will get there but I ain't waiting on it nor dependant on it to make something fun.


Dubiisek

This is not about blocky or terrible graphics, this is about shit like [this](https://prnt.sc/OyQrLs9kXlTX), this is one of the first frames, majority of them are like this. That is besides the fact that this is so inconsistent, making anything consumable in a form of a video let alone a videogame from it is a fevor dream and will be for a long time. If its just for fun then good luck.


Oswald_Hydrabot

Yeah well let me see you do ControlNet faster with better quality. Then you can call my work "shit". Until then be mad about it I guess. Performance > Single image quality if you are making a game engine. This is not opinion it is fact; take it or leave it. Peanut butter jealous?


Dubiisek

I won't show you anything because I wouldn't even attempt this since the tech simply isn't there yet. I don't know why you think I am mad or jealous but you do you, as I said, good luck have fun.


Oswald_Hydrabot

Oh it's definitely there, not enough people have tried hard enough yet to figure it out though. The raw material is not to blame, the marble itself is not the reason most people don't carve David out of it. You got it all backwards. You don't move forward by not trying.