Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXTERNAL] ReadOnlySpan<ReadOnlyMemory<char>> in .NET Interactive and Jupyter Notebook #950

Open
Jrjuniorjr opened this issue Dec 18, 2020 · 2 comments
Labels

Comments

@Jrjuniorjr
Copy link

Jrjuniorjr commented Dec 18, 2020

Hi to all, I'm new on ML.NET and I'm trying to FeaturizeText with TF-IDF.
In the Jupyter Notebook and in .NET Interactive, this code gets an error:


`
using Microsoft.ML;
using Microsoft.ML.Data;
using System;
using Microsoft.ML.Transforms.Text;
using System.Collections.Generic;

namespace SentimentAnalysis
{
    public class Input
    {
        [LoadColumn(0)]
        public string Text { get; set; }
        [LoadColumn(1)]
        public int Rating { get; set; }
    }
    public class Output
    {
        public float[] Features { get; set; }
    }
    class Program
    {
        public static TextFeaturizingEstimator.Options GetOptions()
        {
            var vectorizedTextOptions = new TextFeaturizingEstimator.Options()
            {
                KeepDiacritics = false,
                KeepPunctuations = false,
                KeepNumbers = true,
                CaseMode = TextNormalizingEstimator.CaseMode.Lower,
                StopWordsRemoverOptions = new StopWordsRemovingEstimator.Options()
                {
                    Language = TextFeaturizingEstimator.Language.English
                },

                // ngram options
                WordFeatureExtractor = new WordBagEstimator.Options()
                {
                    NgramLength = 1,
                    UseAllLengths = false, // Produce both unigrams and bigrams
                    Weighting = NgramExtractingEstimator.WeightingCriteria.TfIdf, // TF-IDF
                },

                // chargram options
                CharFeatureExtractor = null
            };
            return vectorizedTextOptions;

        }
        static void Main(string[] args)
        {
            var context = new MLContext();
            var list = new List<Input>()
            {
                new Input()
                {
                    Text = "This is machine learning example", Rating = 4
                },
                new Input()
                {
                    Text = "I like .NET", Rating = 5
                }
            };
            var samples = context.Data.LoadFromEnumerable<Input>(list);
            var options = GetOptions();
            var transformFitted = context.Transforms.Text.FeaturizeText(
                    "Features",
                    options,
                    "Text"
                ).Fit(samples);

            var tfIdfTransformed = transformFitted.Transform(samples);

            var predictionEngine = context.Model.CreatePredictionEngine<Input, Output>(transformFitted);
            Output prediction = null;

            VBuffer<ReadOnlyMemory<char>> slotNames = default;
            tfIdfTransformed.Schema["Features"].GetSlotNames(ref slotNames);

            var tfIdfColumn = tfIdfTransformed.GetColumn<VBuffer<float>>(tfIdfTransformed.Schema["Features"]);
            var slots = slotNames.GetValues();

            Console.Write("NGrams: ");
            foreach (var featureRow in tfIdfColumn)
            {
                foreach (var item in featureRow.Items())
                {
                    Console.Write($"{slots[item.Key]}  ");
                }
                Console.WriteLine();
            }
        }
    }
}
`

And I do var slot = slotNames.GetValues() what I'm getting is:

Error: (3.1): error CS8345: The field or a self-implemented property cannot be of type 'ReadOnlySpan <ReadOnlyMemory >', unless it is an instance member of a reference struct.

And in Visual Studio 2019, it works well, but when I try to loop over the columns:


`
       Console.Write("NGrams: ");
        foreach (var featureRow in tfIdfColumn)
        {
            foreach (var item in featureRow.Items())
            {
                Console.Write($"{slots[item.Key]}  ");
            }
            Console.WriteLine();
        }
`

It repeats some words.

For example: If I get two input: "This is a test", "I like dotnet", what I got print is:
This is a test
test I like dotnet

It is just an example.

Another question is how can I get do something like this in ML.NET:

This code is in python:
`

docs = []
for index, row in df_balenceado.iterrows():
    docs.append(row["CleanReview-NoStemming"])

tfidf_vectorizer=TfidfVectorizer()
fitted_vectorizer=tfidf_vectorizer.fit(docs)
tfidf_vectorizer_vectors=fitted_vectorizer.transform(docs)
feature_names = tfidf_vectorizer.get_feature_names()
dictonary = pd.DataFrame(tfidf_vectorizer_vectors.todense().tolist(), columns=feature_names)
dictonary.head()
pickle.dump(fitted_vectorizer, open("tfidf_nostemming.pkl", "wb"))
dictonary = variance_threshold_selector(dictonary, 0.00005)
dictonary.to_csv("DictionaryNoStemming.csv", index=False)

`

@WhiteBlackGoose
Copy link
Contributor

Smaller repro:

ReadOnlySpan<int> a = new [] { 1, 2, 3 };

@Jrjuniorjr
The reason is probably because it automatically becomes a field.

For instance,

static ReadOnlySpan<int> Quack()
    => new [] { 1, 2, 3 };

Quack()

will give another error, complaining about not being able to convert it to object.

Additional formatting won't do anything useful, because either way when formatting it will be trying to convert it into an object, which is impossible for ref struct such as span.

It looks like a serious issue, but I doubt it can be solved.

@jonsequitur
Copy link
Contributor

This would need to be addressed in Roslyn scripting.

See: dotnet/roslyn#40213

@jonsequitur jonsequitur changed the title ReadOnlySpan<ReadOnlyMemory<char>> in .NET Interactive and Jupyter Notebook [EXTERNAL] ReadOnlySpan<ReadOnlyMemory<char>> in .NET Interactive and Jupyter Notebook Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants