-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoTrack filter #56
Comments
Great, thank you.
Currently the filters just receive a single input. That works fine with just using the a single mask. But in this case you probably want to have multiple inputs. Also the current approach when blurring the background is to pass the webcam image into two branches, whereas the background branch is blurred and the foreground branch has the applied to it. "Zooming" both probably requires a filter after both branches are applied, and you need a separate input from the bodypix model I guess. Open for your ideas how to best achieve that. |
Hi @de-code, here's some thoughts: So for replace background of a movie:
webcam -> bodypix -> erode -> dilate -> blur -> motion_blur -> composite -> window What do you think? |
Thank you for putting your thoughts down. I will need some time to read and understand it properly. But just as you mentioned gstreamer. There is Google Coral's project-bodypix which is using gstreamer. But as far as I understand is only supporting Coral Edge TPU. It might be something to look for inspiration. I don't have any experience with the gstreamer API (only from vague recollections it required more system dependencies). |
Making it pull would be fine for small graphs. |
Hi @de-code I got inspired about your lazy pull approach so I adopted it for a more general graph. Running the graph consists of recursively calling node.calculate() starting with the last node. Steps of Node.calculate():
This way the nodes can be purely functional and lazily pulled. nodes:
- source: video_source
# id: bg # id is optional. type name is default for id. Since each id must be unique in the graph,
# if two nodes of the same type are created, use id to identify them
# Source: https://www.pexels.com/video/carnival-rides-operating-in-an-amusement-park-3031943/
input_path: "https://www.dropbox.com/s/oqftndbs29g8ekd/carnival-rides-operating-in-an-amusement-park-3031943-360p.mp4?dl=1"
repeat: true
preload: true
- source: webcam
device_path: "/dev/video0"
fourcc: "MJPG"
- filter: bodypix # automatic connection from webcam.output to bodypix.input
model_path: "https://storage.googleapis.com/tfjs-models/savedmodel/bodypix/mobilenet/float/050/model-stride16.json"
# model_path: "https://storage.googleapis.com/tfjs-models/savedmodel/bodypix/resnet50/float/model-stride16.json"
internal_resolution: 0.5
threshold: 0.5
- filter: erode
# input: bodypix.output # bodypix default output is the all mask
value: 20
- filter: dilate # automatic connection
value: 19
- filter: box_blur
value: 10
- filter: motion_blur
frame_count: 3
decay: 0
- filter: composite
# input: motion_blur.output # NOTE: this isn't needed. It would be the default connection anyway
input_fg: webcam.output
input_bg: video_source.output
- filter: auto_track # crops input to crop_mask, adds a padding and then resizes back to the size of input array
# input: composite.output # NOTE: not needed
crop_input: bodypix.output_face # Alternative is a union node of bodypix.output_lface and bodypix.output_rface
padding: 20
- sink: v4l2_loopback
device_path: "/dev/video4" |
Hi @benbatya thanks again. I haven't had much time to think about it yet but I will try to respond in part anyway (please let me know if anything doesn't make sense, I am sure I am missing a lot).
That I am not sure about either. I guess it is one of the major weaknesses of Python. There seem to be shared memories that could potentially be used. Perhaps one other consideration here is, that maybe the initial use-case will be live desktop usage, maybe even during a meeting. Maybe it's good / okay if it is not maxing out all of the CPUs. Another potential argument for pull in that setting is that it reduces issues with back-pressure. i.e. if the bodypix model and the webcam output is slower than the background video frame rate, then we can skip video frames rather than putting them on a queue. (I guess that is different to a typical data flow scenario, where we definitely want to process all of the data) Some benefits I believed to see when I used the branches:
I am not sure whether the second point is actually true, as I am sometimes confused by it myself. So maybe it wasn't a great idea after all. I suppose by mandating referenced layers to appear before like you suggest, you are also enforcing it to be acyclic. With the current implementation we could also provide a map with the layers by id to add them in. Otherwise your proposal seems to be a bit similar how The inputs are connected to Some questions: I guess you prefer to more clearly separate between In the In the When you explained I will add some other GitHub issues with potential features.. |
Hi @de-code, my example config was to try to prove out what an "ideal" configuration might be. To answer your questions:
|
If composite functionality for nodes is desired, a sub-graph with clearly defined source inputs and sink outputs can be defined. This is how gstreamer GstGhostPads are defined. I think that there's a lot of benefit to keeping the graph definition in python to allow making new filter nodes simpler. bodypix takes up a majority of the CPU right now so trying to multi-process it is counter productive. It would be nice to move bodypix to the GPU for processing but I guess that means updating python-tf-bodypix? |
A good example of a node network is https://viz.mediapipe.dev/demo/hair_segmentation. |
Okay, these discussions are definitely useful. When I created this So I guess at this point the main questions are:
It also depends on your available time. If you think it is still worth investing into this project, then in order to progress it, it could be a good time to start moving different aspects to separate smaller tickets. The main questions for the autotrack feature are probably (given that it is still configuration driven):
Another idea could also be to decouple the autotrack filter from the bodypix model further. For example OpenCV has face detection, which may not be very useful for removing the background but could be used for autotrack (and would be a lot faster I guess, not tested). i.e. one could imagine a bounding box face input, although it doesn't quite fit any of the other data being passed around.So maybe not a good idea, just throwing it out there anyway.
You should get cuda-enabled GPU support via TensorFlow. Otherwise the current Docker image won't have GPU support, because it isn't currently including any cuda libraries. I don't have any GPU on my laptop to test it though. |
Just my thoughts: I have started looking at mediapipe by Google (mp) and I think that it's a much better platform for flexible webcam manipulation then trying to get performance out of a python app. It has a nice graph description API (more understandable then gstreamer), it automatically parallelizes node calculations and contains most of the functionality needed to make good results. The only major piece needed AFAICS is to port the bodypix model to a calculator which can expose results (output_stream in mediapipe terms) and to make a calculator which sinks the results into v4l2loopback. Once I get through the tutorial, building mp nodes for v4l2loopback and auto_track will be the next goals. I apologise for suggesting this feature and then leaving this project. I like what you have done but unfortunately I see the GIL as being a fundamental blocker to performance. Especially when alternatives like mediapipe and gstreamer exist. I would be extremely Happy to collaborate on porting bodypix to mediapipe. Google's new background segmentation model is much more accurate then bodypix but doesn't appear to produce separate masks for different body parts or estimate feature locations AFAICS https://ai.googleblog.com/2020/10/background-features-in-google-meet.html?m=1 https://drive.google.com/file/d/1lnP1bRi9CSqQQXUHa13159vLELYDgDu0/view |
MediaPipe looks like a good project.
Why don't you use a separate simple face detection in the interim to build a proof-of-concept?
It seems interesting. I am not sure whether I read that correctly:
My TFlite knowlegde is also very limited. I barely added support for that in
That is fine. I will try to check out MediaPipe more myself. And it is a good feature suggestion (just thinking whether this issue should be renamed to something like "architecture exploration / discussion"). I do want to stick with Python for now I think. Unless there was very limited exposure to other languages.
I am happy to share any limited knowledge I gained from using the bodypix model, or exchange ideas otherwise.
Did you happen to come across the actual model? (I can seem to see it from the blog) |
I think that the model is at the bottom of google-ai-edge/mediapipe#1460 |
Just a note that as part of #94 the internal representation has changed a bit. It is now closer to the nodes you described. |
This is the tracking issue for the autotrack feature originally mentioned at python-tf-bodypix#67
Make AutoTrack a filter that can be applied.
This will require a new filter type which takes the bodypix masks as an argument to the filter function
The text was updated successfully, but these errors were encountered: