Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Songs/Artists/Albums query performances by ~50 #348

Merged

Conversation

quentinguidee
Copy link
Contributor

@quentinguidee quentinguidee commented Feb 26, 2024

While requests are performant for many use cases, I have around 225000 listened songs (23000 different) with Your Spotify hosted on a Raspberry PI, so navigate on the client takes a lot of time, especially with the "Last year" or "All" timeframe.

This PR improves performances of getBestSongsNbOffseted, getBestArtistsNbOffseted and getBestAlbumsNbOffseted by a factor of ~10. My approach was to group all Infos of the same track at start. This avoids to compute many times the same Info, which helps a lot for the lightTrackLookupPipeline.

Benchmark

done on mac m1, not on the Raspberry

A request for the top songs with date range "All" before this PR takes in my case 11,8s:

server  | GET /spotify/top/songs?start=2017-11-18T12:18:39.000Z&end=2024-02-26T14:07:27.570Z&nb=20&offset=0 200 11764.380 ms - 25715

After this PR, this time drops to 1,4s, which makes the page loads almost instantly for the biggest request.

server  | GET /spotify/top/songs?start=2017-11-18T12:18:39.000Z&end=2024-02-26T14:07:27.570Z&nb=20&offset=0 200 1363.197 ms - 25095

Update: See below for another performance boost (#348 (comment))

For reviewer

  • I removed the $project: { ...getGroupByDateProjection(user.settings.timezone), id: 1 } step. I'm not sure what it does exactly. However, we can add it back if necessary.
  • I never used MongoDB so feel free to check if some steps can be simplified. I didn't test on previous versions of Mongo, so I'm not sure if I break compatibility with Mongo 4.4 required for Raspberry Pi 4 (even if I don't see where it could happen).

Related

#232, #162

@quentinguidee quentinguidee force-pushed the perfs/improve-get-best-songs branch from 6c4e32a to 160d897 Compare February 26, 2024 14:54
@quentinguidee
Copy link
Contributor Author

Also, I think a lot of requests can beneficiate from this improvement. If everything looks fine I might try to improve the others.

@quentinguidee
Copy link
Contributor Author

quentinguidee commented Feb 26, 2024

I just realised that the request is sorted by listening count, not total duration, so we can directly limit/offset at start, which would make the request instantaneous

Edit it seems to work (0,2s), but the total_count is not calculated correctly, so I reverted to the 1,2s version

@quentinguidee quentinguidee changed the title Improve getBestSongsNbOffseted performances by ~10 Improve getBestSongsNbOffseted performances by ~60 Feb 26, 2024
@quentinguidee quentinguidee force-pushed the perfs/improve-get-best-songs branch from e5fb3b2 to 160d897 Compare February 26, 2024 18:02
@quentinguidee quentinguidee changed the title Improve getBestSongsNbOffseted performances by ~60 Improve getBestSongsNbOffseted performances by ~10 Feb 26, 2024
@Yooooomi
Copy link
Owner

Hello! Thanks a lot for what you've done. It seems good to me.

The facet is a bit weird since I'm not used to it.
Would be interesting to test if it outputs the exact same information.
This performance boost is insane thanks a lot! I don't want to abuse or anything but yeah it would be very good to have a brand new look on other queries too.
Maybe (surely) getBestArtistsNbOffseted and getBestAlbumsNbOffseted would benefit from this too.

I've tried working on #162 at some point but did not find any better implementation back then.

Again, thanks a lot. Don't hesitate if you have any questions, and keep in touch when you feel like you're done with this :)

@quentinguidee
Copy link
Contributor Author

Hi!

Would be interesting to test if it outputs the exact same information.

They seem to have the exact same result in my case! However, it may be good to test again before merging with other data!

I'll try to improve the PR in the next days to cover more improvements 👍

@@ -560,45 +560,58 @@ export const getBestSongsNbOffseted = (
) =>
InfosModel.aggregate([
{ $match: basicMatch(user._id, start, end) },
{
$project: { ...getGroupByDateProjection(user.settings.timezone), id: 1 },
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is there a usage for this line that I removed? I'm not sure where it can be useful here

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't know either. Must be a line I forgot to remove one day. I think it's safe to remove it

@Yooooomi
Copy link
Owner

I'll try to improve the PR in the next days to cover more improvements 👍

Thanks a lot

@quentinguidee
Copy link
Contributor Author

quentinguidee commented Feb 27, 2024

With the same technique, 5669865 makes the /top/artists request faster, from 12.1s to 1.3s

For this request however:

  • I removed track and album from the result. I don't think it makes sense since an artists doesn't have only one track/one album, so it seems safe to remove;
  • I also removed the differents field, it seems to come from other requests like differentArtistsPer. I found the usage, I added it back

@quentinguidee
Copy link
Contributor Author

And to conclude, /top/albums from 12s to 1.4s 🚀

@quentinguidee
Copy link
Contributor Author

Should be ready for review! 🚀

@quentinguidee quentinguidee changed the title Improve getBestSongsNbOffseted performances by ~10 Improve Songs/Artists/Albums performances by ~10 Feb 27, 2024
@RomainNeup
Copy link

Hello @quentinguidee, congrats for the optimization it's pretty impressive 👏
If you have time, could you please have a look at the new mongoDb requests added in my PR #346 to see if it can also be optimized ?

@quentinguidee
Copy link
Contributor Author

quentinguidee commented Feb 27, 2024

@Yooooomi I have a MongoDB question. I think that almost every requests in this project could be instantaneous if we could have duration_ms, track_id, album_id, artists_id directly indexed in the InfosModel. I don't see another way to do this than adding these fields to all InfosModel. Do you know any way to achieve this automatically with MongoDB?

@Yooooomi
Copy link
Owner

I can write a migration that adds these fields in the info model.
I'll be working on it later today if this can help you.

@quentinguidee
Copy link
Contributor Author

Thanks! That would be helpful. Also this might open the path for more precise total duration when storing each listening duration separately

@Yooooomi
Copy link
Owner

Also this might open the path for more precise total duration when storing each listening duration separately

The Spotify API does allow to get the specific listening duration for the recently played tracks.

@quentinguidee
Copy link
Contributor Author

Yeah unfortunately, but this might be approximated with the time elapsed between two entries in the history? And by default set it to the track duration?

Also, iirc the spotify data when downloaded include the duration of each listening

@Yooooomi
Copy link
Owner

Spotify history is sometimes not precise, relying on timestamps to establish listening duration might be dangerous.
I'm pushing the migration to release/1.8.0 branch. You will be able to rebase/merge from there to have access to albumId, artistId and durationMs in the Infos model.

@Yooooomi
Copy link
Owner

It's pushed to release/1.8.0

@quentinguidee
Copy link
Contributor Author

quentinguidee commented Feb 27, 2024

@Yooooomi It seems that the migration does not trigger, is that because the timestamp of the migration is wrong? (probably the duplication of vscode which incremented the timestamp by one). I tried to npm migrate manually but it says its migrated (but the infos don't have the new fields)

@Yooooomi
Copy link
Owner

Yooooomi commented Feb 27, 2024

It did trigger on my instance. Are you sure you have the latest code built?
image

@quentinguidee
Copy link
Contributor Author

quentinguidee commented Feb 27, 2024

Oh the migration was marked as done in the collection "migrations", but was not done for a reason I don't understand. I did a mongorestore and restarted, and it worked

@quentinguidee
Copy link
Contributor Author

quentinguidee commented Feb 27, 2024

Yeah that works really well, I achieve a request of ~300ms instead of 12.1s with the new fields 🚀 And the requests are also simpler in the code (we can extract almost everything to a new function)

  InfosModel.aggregate([
    { $match: basicMatch(user._id, start, end) },
    {
      $group: {
        _id: "$id",
        duration_ms: { $sum: "$durationMs" },
        count: { $sum: 1 },
      },
    },
    {
      $facet: {
        infos: [
          { $sort: { count: -1, _id: 1 } },
          { $skip: offset },
          { $limit: nb },
        ],
        computations: [
          {
            $group: {
              _id: null,
              total_duration_ms: { $sum: "$duration_ms" },
              total_count: { $sum: "$count" },
            },
          },
        ],
      },
    },
    { $unwind: "$infos" },
    { $unwind: "$computations" },
    {
      $project: {
        _id: "$infos._id",
        result: {
          $mergeObjects: ["$infos", "$computations"],
        },
      },
    },
    {
      $replaceRoot: {
        newRoot: {
          $mergeObjects: ["$result", { _id: "$_id" }],
        },
      },
    },
    { $lookup: lightTrackLookupPipeline("_id") },
    { $unwind: "$track" },
    { $lookup: lightAlbumLookupPipeline() },
    { $unwind: "$album" },
    { $lookup: lightArtistLookupPipeline() },
    { $unwind: "$artist" },
  ]);

The $facet is a way to make two different data process from the same data source. So the first part takes only the most listened tracks, while the second takes the total count and duration from all tracks. Then both are merged

@Yooooomi
Copy link
Owner

Nice job! I don't remember why we compute total_duration_ms. This feels weird when we just take a subpart of the whole thing.

@quentinguidee
Copy link
Contributor Author

quentinguidee commented Feb 27, 2024

I think this is for the percentages here:
image

But if we remove them the request should be extremely fast

@Yooooomi
Copy link
Owner

I feel like we could do this in a request on the side. So that this request does not have to compute this for every page.

@quentinguidee quentinguidee changed the title Improve Songs/Artists/Albums performances by ~10 Improve Songs/Artists/Albums performances by ~50 Feb 27, 2024
@quentinguidee quentinguidee changed the title Improve Songs/Artists/Albums performances by ~50 Improve Songs/Artists/Albums query performances by ~50 Feb 27, 2024
@Yooooomi
Copy link
Owner

I just pushed a new file architecture to the branch so that it follows modern monorepo architecture. You might encounter issues. I would suggest you merge the 1.8.0 to your branch to see if there are any merge conflicts.

@Yooooomi Yooooomi changed the base branch from master to release/1.8.0 February 27, 2024 22:54
@quentinguidee quentinguidee force-pushed the perfs/improve-get-best-songs branch from 73148fc to cfe4995 Compare February 28, 2024 01:37
@quentinguidee quentinguidee force-pushed the perfs/improve-get-best-songs branch from cfe4995 to 12fec6d Compare February 28, 2024 01:44
@quentinguidee
Copy link
Contributor Author

quentinguidee commented Feb 28, 2024

Thanks! Everything looks fine, the 3 requests are almost instantaneous now. I have also refactor the 3 to use the same base request. I think the "differents" field and some lookup can be removed depending on the request (tracks/artists/albums) but that doesn't change the time so clean code is better imo for now while we recreate the requests

@quentinguidee
Copy link
Contributor Author

I'll keep this PR restricted to these 3 methods for now so we can iterate faster, so this one is ready for review 🧪

@Yooooomi
Copy link
Owner

Yooooomi commented Feb 28, 2024

Hello, this seems fine to me. Does it change the signature fontend? It seems the objects are not looked up anymore so the frontend cannot display information am I right?
Edit : didn't see the lookup pipeline my bad. Seems fine to me, going to merge this evening to test.

@Yooooomi Yooooomi merged commit be61317 into Yooooomi:release/1.8.0 Feb 28, 2024
@Yooooomi
Copy link
Owner

OK OK OK this is BLAZING fast. Got from 20.48s down to 0.81s on M1 Pro, I'll update this with numbers from intel i7. Infos mongo collection is 10 times bigger though but still only 41MB (32MB of indexes). Huge work thanks a lot. Can't wait to see how it can be applied to other requests. Many users will thank those performance improvements.

@quentinguidee
Copy link
Contributor Author

quentinguidee commented Feb 28, 2024

Yeah that changes entirely the feeling of the app! The tradeoff with document databases compared to relational db is that join requests are really bad and should be avoided as much as possible. Without that the performances are extremely good! I'll work on the other queries really soon, I have a lot of free time at the moment.

For the 30MB of indexes yeah that was predictable, but I think it is nothing compared to the space it took in RAM during the giant request (but I don't have benchmarks for that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants