Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CV on Params in DataFrameMapper Transforms? #124

Open
andrewm4894 opened this issue Sep 6, 2017 · 2 comments
Open

CV on Params in DataFrameMapper Transforms? #124

andrewm4894 opened this issue Sep 6, 2017 · 2 comments

Comments

@andrewm4894
Copy link

Apologies for posting as an issue but feel like could be a useful use case.

I'm just wondering if something like what i'm trying to do is or should be possible.

If i set up a pipeline like:

# make pipeline for individual variables
name_to_tfidf = Pipeline([ ('name_vect', CountVectorizer()) , ('name_tfidf', TfidfTransformer()) ])
ticket_to_tfidf = Pipeline([ ('ticket_vect', CountVectorizer()) , ('ticket_tfidf', TfidfTransformer()) ])

full_mapper = DataFrameMapper([
    ('Name', name_to_tfidf ),
    ('Ticket', ticket_to_tfidf ),
    ('Sex', LabelBinarizer())
    ])

# build full pipeline
full_pipeline  = Pipeline([
    ('mapper',full_mapper),
    ('clf', SGDClassifier(n_iter=15, warm_start=True))
])

Is there a way to pass a list of options to CV on for individual transforms in the DataFrameMapper like here:

# determine full param search space (need to get the params for the mapper parts in here somehow)
full_params = {'clf__alpha': [1e-2,1e-3,1e-4],
               'clf__loss':['modified_huber','hinge'],
               'clf__penalty':['l2','l1'],
               # now set the params for the datamapper part of the pipeline
               'mapper__features':[[
                   ('Name',deepcopy(name_to_tfidf).set_params(name_vect__analyzer = ['char', 'char_wb'])),
                   ('Ticket',deepcopy(ticket_to_tfidf).set_params(ticket_vect__analyzer = ['char', 'char_wb']))
               ]]
              }

Ideally id like to CV on what params are best for the name_to_tfidf and ticket_to_tfidf DataFrameMapper pipelines.

But passing a list of options to set_params() like this gives me this error when i go to fit:

ValueError: ['char', 'char_wb'] is not a valid tokenization scheme/analyzer

@scotthuang1989
Copy link

I think what you want is this GrideSearchCV, just create a GridSearchCV, pass to pipeline as a "normal " estimator, then you will get what you want.

@andrewm4894
Copy link
Author

My bad - I left that part out. I am doing this:

# set up grid search
gs_clf = GridSearchCV(full_pipeline, full_params, n_jobs=-1)

And then:

# do the fit
gs_clf.fit(df,df['Survived'])

So i am able to do the CV on the clf params but id also like to do CV on some params within the transforms in the DataFrameMapper - just not sure how to go about this.

Here is a full example notebook.

Basically i was passing ['char', 'char_wb'] to this line for example:
('Name',deepcopy(name_to_tfidf).set_params(name_vect__analyzer = ['char', 'char_wb'])),

As i was hoping the GridSearchCV would then also consider those two params in the grid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants