Skip to content

Commit

Permalink
Update
Browse files Browse the repository at this point in the history
Update

Update

Update

Update

Add a more meaningful check to make sure we are not merging blocks (#4186)

This is a follow-up to
#4176 (comment)

I am now counting the number of blocks with (17) and without (31) block
merging. I double checked to make sure this does not pass when we use an
aggressive region simplification strategy.

[AMD] Skip mfma layout in maybeDuplicate (#4170)

The workaround introduced in
#4048 "forgot" to skip mfma
layout.

[TEST] Merge duplicate `max_num_imprecise_acc` tests and improve code (#4191)

[DOCS][NFC] Fix doc formatting problems (#4195)

1. f-string cannot be used as docstrings in Python.
2. URLs should follow the reStructuredText format.
3. Code snippets in a code block should be indented.

Tested and passed on a local machine.

[BACKEND] Fix regression in pipeliner pre-checks. (#4196)

During some previous refactoring we changed the logic and started
pipeling cases that had incompatible shared encoding. This was missed
because one of the lit test had not been updated :(

Remove tl.multiple_of call from tma persistent kernel (#4198)

[AMD] Guard against null in `BypassEpilogueSMEM` (#4203)

`val.getDefiningOp()` can return `nullptr`. In this case, we must fail
the `BypassEpilogueSMEM` rewrite pass for the given op. This prevents
run-time crashes.

[FRONTEND][NFC] Fix type checking, conditional logic, and loop structures for improved readability and performance (#4208)

Document TRITON_HOME (#4210)

Document the existence of `TRITON_HOME` environment variable.

The `TRITON_HOME` variable controls the location of the `.triton`
directory that stores, among other things, the files downloaded during a
`pip install -e python` virtualenv build. By default, this is located in
the user's home directory, at `~/.triton`.

I was trying to build Triton on my system on a large local disk, but
with limited network home directory space, and the `pip` command kept
failing with out of disk space errors. It turned out that during
installation, large files were downloaded to the `~/.triton` directory
causing failure.

After checking that it was not `pip` doing this, I found the
`TRITON_HOME` variable which allowed me to workaround the issue and
build Triton successfully. After seconding #4007, I decided to
contribute this documentation fix.

Co-authored-by: sree <sree@buckyball>

[BACKEND] Fix regression in i1 reduction (#4215)

Recent refactoring broke i1 shared memory load.

[BUILD] update URL for LLVM tarballs (#4216)

[BACKEND] Fix divisibility analysis for shift ops (#4221)

Divisibility does not ensure that a value is not 0 therefore we cannot
use divisibility as a minimum shifted values.

Support FP8 constant (#4222)

To unblock the compilation of kernels like below which don't operate
arithmetically on FP8.

```
@triton.jit
def triton_poi_fused__scaled_mm__to_copy_constant_pad_nd_lift_fresh_2(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 400624
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex % 784
    x1 = (xindex // 784)
    x2 = xindex
    tmp0 = x0
    tmp1 = tl.full([1], 769, tl.int64)
    tmp2 = tmp0 < tmp1
    tmp3 = tl.load(in_ptr0 + (x0 + (769*x1)), tmp2 & xmask, other=0.0)
    tmp4 = tmp3.to(tl.float8e4nv)
    tmp5 = tl.full(tmp4.shape, 0.0, tmp4.dtype)
    tmp6 = tl.where(tmp2, tmp4, tmp5)
    tl.store(out_ptr0 + (x2), tmp6, xmask)
```

[INTERPRETER] Implement implicit tensor conversion for assignment operators (#4214)

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update
  • Loading branch information
Jokeren committed Jul 3, 2024
1 parent 54960ca commit bd4ce81
Show file tree
Hide file tree
Showing 20 changed files with 926 additions and 104 deletions.
72 changes: 71 additions & 1 deletion third_party/proton/csrc/include/Data/Metric.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

namespace proton {

enum class MetricKind { Flexible, Kernel, Count };
enum class MetricKind { Flexible, Kernel, PCSampling, Count };

using MetricValueType = std::variant<uint64_t, int64_t, double, std::string>;

Expand Down Expand Up @@ -148,6 +148,76 @@ class KernelMetric : public Metric {
};
};

class PCSamplingMetric : public Metric {
public:
enum PCSamplingMetricKind : int {
NumSamples,
NumStalledSamples,
StalledBranchResolving,
StalledNoInstruction,
StalledShortScoreboard,
StalledWait,
StalledLongScoreboard,
StalledTexThrottle,
StalledBarrier,
StalledMembar,
StalledIMCMiss,
StalledMIOThrottle,
StalledMathPipeThrottle,
StalledDrain,
StalledLGThrottle,
StalledNotSelected,
StalledMisc,
StalledDispatchStall,
StalledSleeping,
StalledSelected,
Count,
};

PCSamplingMetric()
: Metric(MetricKind::PCSampling, PCSamplingMetricKind::Count) {}

PCSamplingMetric(PCSamplingMetricKind kind, uint64_t samples,
uint64_t stalledSamples)
: PCSamplingMetric() {
this->values[kind] = stalledSamples;
this->values[PCSamplingMetricKind::NumSamples] = samples;
this->values[PCSamplingMetricKind::NumStalledSamples] = stalledSamples;
}

virtual const std::string getName() const { return "PCSamplingMetric"; }

virtual const std::string getValueName(int valueId) const {
return VALUE_NAMES[valueId];
}

virtual bool isAggregable(int valueId) const { return true; }

private:
const static inline std::string VALUE_NAMES[PCSamplingMetricKind::Count] = {
"NumSamples",
"NumStalledSamples",
"StalledBranchResolving",
"StalledNoInstruction",
"StalledShortScoreboard",
"StalledWait",
"StalledLongScoreboard",
"StalledTexThrottle",
"StalledBarrier",
"StalledMembar",
"StalledIMCMiss",
"StalledMIOThrottle",
"StalledMathPipeThrottle",
"StalledDrain",
"StalledLGThrottle",
"StalledNotSelected",
"StalledMisc",
"StalledDispatchStall",
"StalledSleeping",
"StalledSelected",
};
};

} // namespace proton

#endif // PROTON_DATA_METRIC_H_
1 change: 1 addition & 0 deletions third_party/proton/csrc/include/Data/TreeData.h
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ class TreeData : public Data {
void init();
void dumpHatchet(std::ostream &os) const;
void doDump(std::ostream &os, OutputFormat outputFormat) const override;
size_t getValidScopeId(size_t scopeId) const;

class Tree;
std::unique_ptr<Tree> tree;
Expand Down
38 changes: 38 additions & 0 deletions third_party/proton/csrc/include/Driver/GPU/CuptiApi.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,15 @@
#define PROTON_DRIVER_GPU_CUPTI_H_

#include "cupti.h"
#include "cupti_pcsampling.h"

namespace proton {

namespace cupti {

template <bool CheckSuccess>
CUptiResult getContextId(CUcontext context, uint32_t *pCtxId);

template <bool CheckSuccess>
CUptiResult activityRegisterCallbacks(
CUpti_BuffersCallbackRequestFunc funcBufferRequested,
Expand Down Expand Up @@ -66,6 +70,40 @@ CUptiResult getGraphExecId(CUgraphExec graph, uint32_t *pId);
template <bool CheckSuccess>
CUptiResult getGraphId(CUgraph graph, uint32_t *pId);

template <bool CheckSuccess>
CUptiResult getCubinCrc(CUpti_GetCubinCrcParams *pParams);

template <bool CheckSuccess>
CUptiResult
getSassToSourceCorrelation(CUpti_GetSassToSourceCorrelationParams *pParams);

template <bool CheckSuccess>
CUptiResult
pcSamplingGetNumStallReasons(CUpti_PCSamplingGetNumStallReasonsParams *pParams);

template <bool CheckSuccess>
CUptiResult
pcSamplingGetStallReasons(CUpti_PCSamplingGetStallReasonsParams *pParams);

template <bool CheckSuccess>
CUptiResult pcSamplingSetConfigurationAttribute(
CUpti_PCSamplingConfigurationInfoParams *pParams);

template <bool CheckSuccess>
CUptiResult pcSamplingEnable(CUpti_PCSamplingEnableParams *pParams);

template <bool CheckSuccess>
CUptiResult pcSamplingDisable(CUpti_PCSamplingDisableParams *pParams);

template <bool CheckSuccess>
CUptiResult pcSamplingGetData(CUpti_PCSamplingGetDataParams *pParams);

template <bool CheckSuccess>
CUptiResult pcSamplingStart(CUpti_PCSamplingStartParams *pParams);

template <bool CheckSuccess>
CUptiResult pcSamplingStop(CUpti_PCSamplingStopParams *pParams);

} // namespace cupti

} // namespace proton
Expand Down
132 changes: 132 additions & 0 deletions third_party/proton/csrc/include/Profiler/Cupti/CuptiPCSampling.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
#ifndef PROTON_PROFILER_CUPTI_PC_SAMPLING_H_
#define PROTON_PROFILER_CUPTI_PC_SAMPLING_H_

#include "CuptiProfiler.h"
#include "Driver/GPU/CudaApi.h"
#include "Driver/GPU/CuptiApi.h"
#include "Utility/Map.h"
#include "Utility/Singleton.h"
#include <atomic>
#include <mutex>

namespace proton {

struct CubinData {
size_t cubinCrc;
const char *cubin;
size_t cubinSize;

struct LineInfoKey {
uint32_t functionIndex;
uint64_t pcOffset;

bool operator<(const LineInfoKey &other) const {
return functionIndex < other.functionIndex ||
(functionIndex == other.functionIndex &&
pcOffset < other.pcOffset);
}
};

struct LineInfoValue {
uint32_t lineNumber{};
const std::string functionName{};
const std::string dirName{};
const std::string fileName{};

LineInfoValue() = default;

LineInfoValue(uint32_t lineNumber, const std::string &functionName,
const std::string &dirName, const std::string &fileName)
: lineNumber(lineNumber), functionName(functionName), dirName(dirName),
fileName(fileName) {}
};

std::map<LineInfoKey, LineInfoValue> lineInfo;
};

struct ConfigureData {
ConfigureData() = default;

~ConfigureData() {
if (stallReasonNames) {
for (size_t i = 0; i < numStallReasons; i++) {
if (stallReasonNames[i])
std::free(stallReasonNames[i]);
}
std::free(stallReasonNames);
}
if (stallReasonIndices)
std::free(stallReasonIndices);
}

void initialize(CUcontext context);

CUpti_PCSamplingConfigurationInfo configureStallReasons();
CUpti_PCSamplingConfigurationInfo configureSamplingPeriod();
CUpti_PCSamplingConfigurationInfo configureSamplingBuffer();
CUpti_PCSamplingConfigurationInfo configureScratchBuffer();
CUpti_PCSamplingConfigurationInfo configureHardwareBufferSize();
CUpti_PCSamplingConfigurationInfo configureStartStopControl();
CUpti_PCSamplingConfigurationInfo configureCollectionMode();

// The amount of data reserved on the GPU
static constexpr size_t HardwareBufferSize = 128 * 1024 * 1024;
// The amount of data copied from the hardware buffer each time
static constexpr size_t ScratchBufferSize = 16 * 1024 * 1024;
// The number of PCs copied from the scratch buffer each time
static constexpr size_t DataBufferPCCount = 1024;
// The sampling period in cycles = 2^frequency
static constexpr uint32_t DefaultFrequency = 10;

CUcontext context{};
uint32_t contextId;
uint32_t numStallReasons{};
uint32_t numValidStallReasons{};
char **stallReasonNames{};
uint32_t *stallReasonIndices{};
std::map<size_t, size_t> stallReasonIndexToMetricIndex{};
std::set<size_t> notIssuedStallReasonIndices{};
CUpti_PCSamplingData pcSamplingData{};
// The memory storing configuration information has to be kept alive during
// the profiling session
std::vector<CUpti_PCSamplingConfigurationInfo> configurationInfos;
};

class CuptiPCSampling : public Singleton<CuptiPCSampling> {

public:
CuptiPCSampling() = default;
virtual ~CuptiPCSampling() = default;

void initialize(CUcontext context);

void start(CUcontext context);

void stop(CUcontext context, uint64_t externId, bool isAPI);

void finalize(CUcontext context);

void loadModule(CUpti_ResourceData *resourceData);

void unloadModule(CUpti_ResourceData *resourceData);

private:
ConfigureData *getConfigureData(CUcontext context);

CubinData *getCubinData(uint64_t cubinCrc);

void processPCSamplingData(ConfigureData *configureData, uint64_t externId,
bool isAPI);

ThreadSafeMap<uint32_t, ConfigureData> contextIdToConfigureData;
ThreadSafeMap<size_t, CubinData> cubinCrcToCubinData;
ThreadSafeSet<uint32_t> contextInitialized;

std::atomic<bool> pcSamplingStarted{false};
std::mutex pcSamplingMutex{};
std::mutex contextMutex{};
};

} // namespace proton

#endif // PROTON_PROFILER_CUPTI_PC_SAMPLING_H_
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#ifndef PROTON_PROFILER_CUPTI_PROFILER_H_
#define PROTON_PROFILER_CUPTI_PROFILER_H_

#include "GPUProfiler.h"
#include "Profiler/GPUProfiler.h"

namespace proton {

Expand Down
12 changes: 12 additions & 0 deletions third_party/proton/csrc/include/Profiler/GPUProfiler.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,16 @@ class GPUProfiler : public Profiler,
std::unordered_map<uint64_t, std::pair<size_t, size_t>>>;
using ApiExternIdSet = ThreadSafeSet<size_t, std::unordered_set<size_t>>;

ConcreteProfilerT &enablePCSampling() {
pcSamplingEnabled = true;
return dynamic_cast<ConcreteProfilerT &>(*this);
}
ConcreteProfilerT &disablePCSampling() {
pcSamplingEnabled = false;
return dynamic_cast<ConcreteProfilerT &>(*this);
}
bool isPCSamplingEnabled() const { return pcSamplingEnabled; }

protected:
// OpInterface
void startOp(const Scope &scope) override {
Expand Down Expand Up @@ -140,6 +150,8 @@ class GPUProfiler : public Profiler,
ConcreteProfilerT &profiler;
};
std::unique_ptr<GPUProfilerPimplInterface> pImpl;

bool pcSamplingEnabled{false};
};

} // namespace proton
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#ifndef PROTON_PROFILER_ROCTRACER_PROFILER_H_
#define PROTON_PROFILER_ROCTRACER_PROFILER_H_

#include "GPUProfiler.h"
#include "Profiler/GPUProfiler.h"

namespace proton {

Expand Down
5 changes: 5 additions & 0 deletions third_party/proton/csrc/include/Utility/Atomic.h
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
#ifndef PROTON_UTILITY_ATOMIC_H_
#define PROTON_UTILITY_ATOMIC_H_

#include <atomic>

namespace proton {
Expand All @@ -17,3 +20,5 @@ template <typename T> T atomicMin(std::atomic<T> &target, T value) {
}

} // namespace proton

#endif // PROTON_UTILITY_ATOMIC_H_
5 changes: 5 additions & 0 deletions third_party/proton/csrc/include/Utility/Errors.h
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
#ifndef PROTON_UTILITY_ERRORS_H_
#define PROTON_UTILITY_ERRORS_H_

#include <stdexcept>

namespace proton {
Expand All @@ -8,3 +11,5 @@ class NotImplemented : public std::logic_error {
};

} // namespace proton

#endif // PROTON_UTILITY_ERRORS_H_
11 changes: 11 additions & 0 deletions third_party/proton/csrc/include/Utility/String.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,17 @@ inline std::string toLower(const std::string &str) {
return lower;
}

inline std::string replace(const std::string &str, const std::string &src,
const std::string &dst) {
std::string replaced = str;
size_t pos = 0;
while ((pos = replaced.find(src, pos)) != std::string::npos) {
replaced.replace(pos, src.length(), dst);
pos += dst.length();
}
return replaced;
}

} // namespace proton

#endif // PROTON_UTILITY_STRING_H_
Loading

0 comments on commit bd4ce81

Please sign in to comment.