Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TypedOneHotEncoder #322

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

manuzhang
Copy link

@manuzhang manuzhang commented Sep 9, 2018

This adds a typed API over Spark ML's OneHotEncoderEstimator since Spark 2.3.0

@codecov-io
Copy link

codecov-io commented Sep 9, 2018

Codecov Report

Merging #322 into master will increase coverage by 1.53%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #322      +/-   ##
==========================================
+ Coverage   94.83%   96.36%   +1.53%     
==========================================
  Files          53       56       +3     
  Lines         968      991      +23     
  Branches        9        9              
==========================================
+ Hits          918      955      +37     
+ Misses         50       36      -14
Impacted Files Coverage Δ
...cala/frameless/ml/feature/TypedOneHotEncoder.scala 100% <100%> (ø)
...main/scala/frameless/ops/RelationalGroupsOps.scala 79.16% <0%> (-18.46%) ⬇️
...la/frameless/functions/NonAggregateFunctions.scala 100% <0%> (ø) ⬆️
...re/src/main/scala/frameless/CatalystAbsolute.scala
...re/src/main/scala/frameless/CatalystBitShift.scala 100% <0%> (ø)
.../frameless/CatalystNumericWithJavaBigDecimal.scala 100% <0%> (ø)
core/src/main/scala/frameless/CatalystRound.scala 100% <0%> (ø)
...aset/src/main/scala/frameless/ops/GroupByOps.scala 98.36% <0%> (+31.69%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 012f1a1...b51b337. Read the comment docs.

@imarios
Copy link
Contributor

imarios commented Sep 11, 2018

@manuzhang thanks for the PR! Will be looking at this shortly (I hope!).

}

test("create() compiles only with correct inputs") {
illTyped("TypedOneHotEncoder.create[Double]()")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you meant to use .apply[] instead of .create[]?

@OlivierBlanvillain
Copy link
Contributor

@manuzhang I had a quick look at the diff and it looks good to me, but this is an area for frameless I'm not very familiar with.

@atamborrino Would you have time for a review?

@imarios
Copy link
Contributor

imarios commented Sep 22, 2018

@manuzhang I am taking a closer look at the PR. I think this implementation can be made better and more type-safe. A nice API would be:

val ds: TypedDataset[(String, Int, Long)] = ...
ds.transformOneHot(ds('_1)): TypedDataset[(String, Int, Long, Vector)]

The above will not compile if you try to do one-hot encoding to any column that is not String or Char, so you can add further type-safety restriction on the type of column you want to transform. Also, note how the type of the resulting dataset correctly has a new column of type vector appended to the end.

@imarios imarios self-requested a review September 22, 2018 12:52
@manuzhang
Copy link
Author

@imarios TypedOneHotEncoder is a TypedEstimator which generates a type-safe TypedTransformer with fit. The transformer from TypedOneHotEncoder requires Int as inputs and appended Vector in outputs which are checked at compile time.

@imarios
Copy link
Contributor

imarios commented Dec 5, 2018

@manuzhang I totally dropped the ball on this PR. Let me go over this one more time and merge the PR.


case class Outputs(output: Vector)

sealed abstract class HandleInvalid(val sparkValue: String)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sealed abstract class HandleInvalid(val sparkValue: String)
final class HandleInvalid private(val sparkValue: String) extends AnyVal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants