Sourcery refactored main branch #2

sourcery-ai · 2023-11-23T15:58:02Z

Branch main refactored by Sourcery.

If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

See our documentation here.

Run Sourcery locally

Reduce the feedback loop during development by using the Sourcery editor plugin:

Review changes via command line

To manually merge these changes, make sure you're on the main branch, then run:

git fetch origin sourcery/main
git merge --ff-only FETCH_HEAD
git reset HEAD^

Help us improve this pull request!

sourcery-ai

Sourcery timed out performing refactorings.

Due to GitHub API limits, only the first 60 comments can be shown.

sourcery-ai · 2023-11-23T15:58:05Z

RecDP/setup.py

-        self.extras: dict = {}
-        self.extras['autofe'] = list_requirements("pyrecdp/autofe/requirements.txt")
+        self.extras: dict = {
+            'autofe': list_requirements("pyrecdp/autofe/requirements.txt")
+        }
        self.extras['LLM'] = list_requirements("pyrecdp/LLM/requirements.txt")
-        self.extras["all"] = list(set(chain.from_iterable(self.extras.values()))
-    )
+            self.extras["all"] = list(set(chain.from_iterable(self.extras.values()))
+        )


Function SetupSpec.__init__ refactored with the following changes:

Merge dictionary assignment with declaration (merge-dict-assign)

sourcery-ai · 2023-11-23T15:58:05Z

RecDP/examples/notebooks/haystack_sod/gen_sod.py

-    file_names = dict((key, path_prefix + os.path.join(current_path, data_folder, filename)) for key, filename in SO_FILE.items())
+    file_names = {
+        key: path_prefix + os.path.join(current_path, data_folder, filename)
+        for key, filename in SO_FILE.items()
+    }


Function main refactored with the following changes:

Replace list(), dict() or set() with comprehension (collection-builtin-to-comprehension)

sourcery-ai · 2023-11-23T15:58:05Z

RecDP/examples/notebooks/twitter_recsys/manual_recdp/RecsysSchema.py

-        str_fields1 = [StructField('%s' % i, StringType())
-                       for i in self.string_cols1]
-        long_fields1 = [StructField('%s' % i, LongType())
-                       for i in self.long_cols1]
-        str_fields2 = [StructField('%s' % i, StringType())
-                       for i in self.string_cols2]
-        long_fields2 = [StructField('%s' % i, LongType())
-                       for i in self.long_cols2]
-        bool_fields1 = [StructField('%s' % i, BooleanType())
-                        for i in self.bool_cols1]
-        long_fields3 = [StructField('%s' % i, LongType())
-                       for i in self.long_cols3]
-        str_fields3 = [StructField('%s' % i, StringType())
-                       for i in self.string_cols3]
-        long_fields4 = [StructField('%s' % i, LongType())
-                       for i in self.long_cols4]
-        bool_fields2 = [StructField('%s' % i, BooleanType())
-                        for i in self.bool_cols2]
-        long_fields5 = [StructField('%s' % i, LongType())
-                       for i in self.long_cols5]
-        bool_fields3 = [StructField('%s' % i, BooleanType())
-                        for i in self.bool_cols3]
-        double_fields = [StructField('%s' % i, DoubleType())
-                        for i in self.double_cols]
+        str_fields1 = [StructField(f'{i}', StringType()) for i in self.string_cols1]
+        long_fields1 = [StructField(f'{i}', LongType()) for i in self.long_cols1]
+        str_fields2 = [StructField(f'{i}', StringType()) for i in self.string_cols2]
+        long_fields2 = [StructField(f'{i}', LongType()) for i in self.long_cols2]
+        bool_fields1 = [StructField(f'{i}', BooleanType()) for i in self.bool_cols1]
+        long_fields3 = [StructField(f'{i}', LongType()) for i in self.long_cols3]
+        str_fields3 = [StructField(f'{i}', StringType()) for i in self.string_cols3]
+        long_fields4 = [StructField(f'{i}', LongType()) for i in self.long_cols4]
+        bool_fields2 = [StructField(f'{i}', BooleanType()) for i in self.bool_cols2]
+        long_fields5 = [StructField(f'{i}', LongType()) for i in self.long_cols5]
+        bool_fields3 = [StructField(f'{i}', BooleanType()) for i in self.bool_cols3]
+        double_fields = [StructField(f'{i}', DoubleType()) for i in self.double_cols]


Function RecsysSchema.toStructType refactored with the following changes:

Replace interpolated string formatting with f-string [×12] (replace-interpolation-with-fstring)

sourcery-ai · 2023-11-23T15:58:05Z

RecDP/examples/notebooks/twitter_recsys/manual_recdp/datapre.py

-    
+
    if len(x)>rw:
        return hashit(x[rw])
    elif rw<0:
-        if len(x)>0:
-            return hashit(x[-1])
-        else:
-            return 0
+        return hashit(x[-1]) if len(x)>0 else 0


Function ret_word refactored with the following changes:

Replace if statement with if expression (assign-if-exp)

sourcery-ai · 2023-11-23T15:58:05Z

RecDP/examples/notebooks/twitter_recsys/manual_recdp/datapre.py

-            if len(text_split)>1:
-                if text_split[1] in ['_']:
-                    uhash += clean_text(text_split[1]) + clean_text(text_split[2])
-                    text_split = text_split[2:]
-                else:
-                    cl_loop = False
+            if len(text_split) > 1 and text_split[1] in ['_']:
+                uhash += clean_text(text_split[1]) + clean_text(text_split[2])
+                text_split = text_split[2:]
            else:
                cl_loop = False
-


Function extract_hash refactored with the following changes:

Merge duplicate blocks in conditional (merge-duplicate-blocks)

sourcery-ai · 2023-11-23T15:58:06Z

RecDP/examples/notebooks/twitter_recsys/manual_recdp/datapre.py

-    df = spark.read.parquet(path_prefix+"/recsys2021/datapre_stage1/stage1_valid_all")
+    df = spark.read.parquet(
+        f"{path_prefix}/recsys2021/datapre_stage1/stage1_valid_all"
+    )


Function valid_stage2 refactored with the following changes:

Use f-string instead of string concatenation (use-fstring-for-concatenation)

sourcery-ai · 2023-11-23T15:58:06Z

RecDP/examples/notebooks/twitter_recsys/manual_recdp/datapre.py

-    
+
    #############  load decoder data
    df = spark.read.parquet(path_prefix+current_path+"test1_decode")
    print("data decoded!")

    ############# load dict from stage 1
    dict_names = ['tweet', 'mention']
-    dict_dfs = [{'col_name': name, 'dict': spark.read.parquet(
-        "%s/%s/%s/%s" % (proc.path_prefix, proc.current_path, proc.dicts_path, name))} for name in dict_names]
+    dict_dfs = [
+        {
+            'col_name': name,
+            'dict': spark.read.parquet(
+                f"{proc.path_prefix}/{proc.current_path}/{proc.dicts_path}/{name}"
+            ),
+        }
+        for name in dict_names
+    ]
    _, te_test_dfs, y_mean_all_df = getTargetEncodingFeaturesDicts(proc, mode='stage1', train_dict_load=False)
-    
+
    ############# set up to stage 2
    current_path = "/recsys2021/datapre_stage2/"
    proc = DataProcessor(spark, path_prefix,
                        current_path=current_path, dicts_path=dicts_folder, shuffle_disk_capacity="1500GB",spark_mode='local')
-    
+
    ############# count encoding
    ce_test_dfs = CountEncodingFeatures(df, proc, gen_dict=True,mode="inference",train_generate=False)
-    
+


Function inference_join refactored with the following changes:

Replace interpolated string formatting with f-string (replace-interpolation-with-fstring)

sourcery-ai · 2023-11-23T15:58:06Z

RecDP/examples/notebooks/twitter_recsys/model/lgbm/train_stage1.py

-    ctr = positive/float(len(gt))
-    return ctr
+    return positive/float(len(gt))


Function calculate_ctr refactored with the following changes:

Inline variable that is immediately returned (inline-immediately-returned-variable)

sourcery-ai · 2023-11-23T15:58:06Z

RecDP/examples/notebooks/twitter_recsys/model/lgbm/train_stage1.py

-    feature_list = []
-    feature_list.append(stage1_reply_features)
-    feature_list.append(stage1_retweet_features)
-    feature_list.append(stage1_comment_features)
-    feature_list.append(stage1_like_features)
+    feature_list = [
+        stage1_reply_features,
+        stage1_retweet_features,
+        stage1_comment_features,
+        stage1_like_features,
+    ]


Lines 56-174 refactored with the following changes:

Merge append into list declaration [×4] (merge-list-append)

sourcery-ai · 2023-11-23T15:58:07Z

RecDP/examples/notebooks/twitter_recsys/model/lgbm/train_stage2.py

-    ctr = positive/float(len(gt))
-    return ctr
+    return positive/float(len(gt))


Function calculate_ctr refactored with the following changes:

Inline variable that is immediately returned (inline-immediately-returned-variable)

sourcery-ai · 2023-11-23T15:58:10Z

RecDP/examples/notebooks/twitter_recsys/model/lgbm/train_stage2.py

-    feature_list = []
-    feature_list.append(stage2_reply_features)
-    feature_list.append(stage2_retweet_features)
-    feature_list.append(stage2_comment_features)
-    feature_list.append(stage2_like_features)
+    feature_list = [
+        stage2_reply_features,
+        stage2_retweet_features,
+        stage2_comment_features,
+        stage2_like_features,
+    ]


Lines 51-165 refactored with the following changes:

Merge append into list declaration [×4] (merge-list-append)

sourcery-ai · 2023-11-23T15:58:10Z

RecDP/examples/notebooks/twitter_recsys/model/lgbm/inference_distributed/split_data.py

-    test = pd.read_parquet(f'{data_path}/stage12_test')  
+    test = pd.read_parquet(f'{data_path}/stage12_test')
    print(test.shape)
    print(f"load data took {time.time() - t1} s")

    ######## split data
    t1 = time.time()
-    indexs = [i for i in range(distributed_nodes)]
-    step = int(len(test)/distributed_nodes)
+    indexs = list(range(distributed_nodes))
+    step = len(test) // distributed_nodes
    tests = []
    for i in range(distributed_nodes):
        if i<distributed_nodes-1:
            tests.append(test[i*step:(i+1)*step])
        else:
            tests.append(test[i*step:])
-        
+


Lines 16-30 refactored with the following changes:

Replace identity comprehension with call to collection constructor (identity-comprehension)

Simplify division expressions (simplify-division)

sourcery-ai · 2023-11-23T15:58:10Z

RecDP/examples/notebooks/twitter_recsys/pandas/datapre.py

-    
+
    if len(x)>rw:
        return hashit(x[rw])
    elif rw<0:
-        if len(x)>0:
-            return hashit(x[-1])
-        else:
-            return 0
+        return hashit(x[-1]) if len(x)>0 else 0


Function ret_word refactored with the following changes:

Replace if statement with if expression (assign-if-exp)

sourcery-ai · 2023-11-23T15:58:10Z

RecDP/examples/notebooks/twitter_recsys/pandas/datapre.py

-            if len(text_split)>1:
-                if text_split[1] in ['_']:
-                    uhash += clean_text(text_split[1]) + clean_text(text_split[2])
-                    text_split = text_split[2:]
-                else:
-                    cl_loop = False
+            if len(text_split) > 1 and text_split[1] in ['_']:
+                uhash += clean_text(text_split[1]) + clean_text(text_split[2])
+                text_split = text_split[2:]
            else:
                cl_loop = False
-


Function extract_hash refactored with the following changes:

Merge duplicate blocks in conditional (merge-duplicate-blocks)

sourcery-ai · 2023-11-23T15:58:11Z

RecDP/examples/notebooks/twitter_recsys/pandas/datapre.py

-    elif x[-1]=='?' and x[-2]=='!':
+    elif x[-1] == '?':
        return(2)
    elif x[-1]=='!' and x[-2]=='?':
        return(3)
-    elif x[-1]=='!' and x[-2]!='?':
+    elif x[-1] == '!':


Function check_last_char_quest refactored with the following changes:

Remove redundant conditional [×2] (remove-redundant-if)

sourcery-ai · 2023-11-23T15:58:12Z

RecDP/examples/notebooks/twitter_recsys/pandas/datapre.py

-    
+
    #############  load decoder data
    df = spark.read.parquet(path_prefix+current_path+'test1_decode.parquet')
    print("data decoded!")

    ############# load dict from stage 1
    dict_names = ['tweet', 'mention']
-    dict_dfs = [{'col_name': name, 'dict': pd.read_parquet(
-            "%s/%s/%s/%s" % (path_prefix, current_path, dicts_folder, name+'.parquet'))} for name in dict_names]
+    dict_dfs = [
+        {
+            'col_name': name,
+            'dict': pd.read_parquet(
+                f"{path_prefix}/{current_path}/{dicts_folder}/{name + '.parquet'}"
+            ),
+        }
+        for name in dict_names
+    ]
    _, te_test_dfs, y_mean_all_df = getTargetEncodingFeaturesDicts(mode='stage1', train_dict_load=False)
-    
+
    ############# set up to stage 2
    current_path = "/recsys2021/datapre_stage2/"

    ############# count encoding
    ce_test_dfs = CountEncodingFeatures(df, gen_dict=True,mode="inference",train_generate=False)
-    
+


Function inference_join refactored with the following changes:

Replace interpolated string formatting with f-string (replace-interpolation-with-fstring)

sourcery-ai · 2023-11-23T15:58:12Z